Monday, 10 December 2007

AAF Mini Grant

Along with my colleague Cathy Clegg, I've been awarded and Australian Access Federation mini grant to produce a shibbolized version of lyceum.

There'll doubtless be much more on this in the coming months.

Thursday, 6 December 2007

p2p, lockss and cheap archival storage

a thought ...

if what we want is cheap archival storage could we do it very cheaply by using a lot of minimal boxes running p2p software to distribute the load - should give the same functionality as lockss and with multiple boxes should be failure resistant if not exactly resilient

Put four cheap reasonably big disks in each box ...

How yahoo [does | will do] big file systems ...

hadoop in a word. Very involved in the development of the hadopp file system and I guess it's going to be their googlefs equivalent for content hosting.

I'm also guessing that flickr will be in the [does | will do] box as well.

Which is kind of interesting. And leveraging off opensource save some of your development costs as well.

And as a bit of non sequiter (ok I want to keep track of the url), there's an interesting tale from the NYT about hadoop, Amazon's cloud computing environment adn digital content management

student filestores, unstructured data and isilon

One of the problems that universities have is that they have large amounts of unstructured data consisting of lots of small files that are subject to period of rapid change and churn.

They're known as student filestores, and becuase of their nature they're a pain to administer, backup and do file restores. The backup restore problem is due to the treewalk problem where by any such backup/restore program has to go and find the files and build and search a directory for them. This takes time and on a chatty file system can mean that you never actually build a proper directory as the content has changed by the time that you finished treewalking the filestore.

Obviously there are tricks to get round this, such as splitting the filestores into multiple filestores, say based on cohorts, but it still exists. Most conventional filesystems have trouble with lots of changing small files and consequently most backup solutions do.

Now one way you could do this is to have a database driven solution that tracks when a particular file has been changed, and writes the path somewhere so you only backup new or changed files.

This is a solution we're developing for our student filestore which is based on apple xserve and xraid technology as while we can replicate the filestore and do diffs on the filelists to allow user driven restores, we can't actually back it up - something that we would like to do for DR purposes. So we're developing a database driven system to track changes and build synthetic backups that we can then write out to a volume and backup conventionally.

The alternative would be to use a true metadata and pointer driven filesystem like the google file system, where, while you have to rewrite chunks whenever a file inside a chunk changes, all you need to do is to back up the chunks, which are larger and more easy for conventional filesystem type backups. And you don't need to use the google file system - the fossil/venti combination found in plan 9 would work just as well.

This problem also exists in digital preservation. Digital content for long term storage typically consists of lots of unstructured data and content, with a lot of small files. However as we're doing this for long term storage they don't change, they just get added to.

Most solutions in this space are fairly conservative and rely on conventional file systems for an object store and a database to store the metadata associated with the objects. To this one adds something like SAMFS to store multiple copies of the object store and the individual checksums of the objects, and do some integrity checking to avoid bitrot. Hitachi's Content Archive Platform works like this as does Honeycomb (aka Storagetek 5800) and commercial digital preservation products like digitool from Ex Libris.

And this works fine, because the ingest rate is typically low and there's no churn, which means that whatever storage backend/tape archive/replication solution can cope by doing continuous synthetic backups (or by rsync, rdiff or whatever).

What happes if you're facebook or the Kodak easyshre site?. Users, and there's a lot of them, are continually adding and modifying content and the content consists of lots of small files. And you've got to keep the content for as long as the user keeps on subscribing and you have lots of files. Yes you could quota each user to say 1GB or a 1000 files (for file based backup and replication you're more worried about the number of files, of directory table entries than the sheer amount of filestore) but if you've many thousands of users, it's still a lot of files. Too many to backup conventionally but which you would probably replicate multiple times.

So you could say that the flickr filestore or the Kodak Easyshare filestore would be a close model for a typical student filestore on drugs.

Now I don't know how flickr provide their store but Kodak uses Isilon to provide their store.

So when Isilon came to Canberra to spruik their solution I was interested. Especially as we also potentially have a long term archiving problem with medical images, astronomical images, and astronomical data, as well as having to provide a large live student filestore. Something that would scale to 1600TB is interesting.

And it was interesting. Basically start with three nodes. Stipe the data across them in such a way as to have multiple redundancy across disks within boxes and between boxes to ensure that you could lose either a random set of disks or a box and keep going. Use infiniband as a backplane to glue the boxes together. Additional nodes can be added to increase the amount of filestore available provided you stay within the redundancy rules. And you don't back it up - you replicate it. (As an aside I suddenly realised at this moment why Hitachi had put MAID (like Copan) into their preservation package - most times you never need to access the replicated copies so why keep the disks spinning - simple when you think it through).

And you present it as a contiguous filesystem presented as shares and accessible by NFS, CIFS, HTTP and grudgingly the apple filing protocol AFP). Cost is around $10K/TB plus some extra for some of the replication tools. Not astoundingly cheap, but not ridiculous either.

But I had a niggle. It was all sales and nothng about how the filesystem worked. How do they manage it?. Given its potential size it can't be a conventional inode or cluster based system so I'm guessing it must be a distributed system something like fossil. They said they'd get back to me but they havn't. Ceratinly fossil would give them efficiencies.

And then there's the other problem - googling for technical information I cam across a whole set of entries suggesting that there might be some financial problems in the parent.

On the other side, they did drop a hint that another university in Australia (Victoria actually) was possibly about to buy their solution for medical imagery. They promised to confirm that as well - something else I'm still waiting to hear about.

So, promising, not cheap and a few doubts, but if it works they way I'm guessing it could be a really useful technology for holding large amounts of unstructred data either as filestore or for archiving.

Seattle/California trip powerpoint

I've been asked to do a high speed overview of my trip to Educause in Seattle and the Caudit study tour. It's a personal take, but if you're interested you can download the presentation. It's a little under 1MB in size