Thursday, 6 December 2007

student filestores, unstructured data and isilon

One of the problems that universities have is that they have large amounts of unstructured data consisting of lots of small files that are subject to period of rapid change and churn.

They're known as student filestores, and becuase of their nature they're a pain to administer, backup and do file restores. The backup restore problem is due to the treewalk problem where by any such backup/restore program has to go and find the files and build and search a directory for them. This takes time and on a chatty file system can mean that you never actually build a proper directory as the content has changed by the time that you finished treewalking the filestore.

Obviously there are tricks to get round this, such as splitting the filestores into multiple filestores, say based on cohorts, but it still exists. Most conventional filesystems have trouble with lots of changing small files and consequently most backup solutions do.

Now one way you could do this is to have a database driven solution that tracks when a particular file has been changed, and writes the path somewhere so you only backup new or changed files.

This is a solution we're developing for our student filestore which is based on apple xserve and xraid technology as while we can replicate the filestore and do diffs on the filelists to allow user driven restores, we can't actually back it up - something that we would like to do for DR purposes. So we're developing a database driven system to track changes and build synthetic backups that we can then write out to a volume and backup conventionally.

The alternative would be to use a true metadata and pointer driven filesystem like the google file system, where, while you have to rewrite chunks whenever a file inside a chunk changes, all you need to do is to back up the chunks, which are larger and more easy for conventional filesystem type backups. And you don't need to use the google file system - the fossil/venti combination found in plan 9 would work just as well.

This problem also exists in digital preservation. Digital content for long term storage typically consists of lots of unstructured data and content, with a lot of small files. However as we're doing this for long term storage they don't change, they just get added to.

Most solutions in this space are fairly conservative and rely on conventional file systems for an object store and a database to store the metadata associated with the objects. To this one adds something like SAMFS to store multiple copies of the object store and the individual checksums of the objects, and do some integrity checking to avoid bitrot. Hitachi's Content Archive Platform works like this as does Honeycomb (aka Storagetek 5800) and commercial digital preservation products like digitool from Ex Libris.

And this works fine, because the ingest rate is typically low and there's no churn, which means that whatever storage backend/tape archive/replication solution can cope by doing continuous synthetic backups (or by rsync, rdiff or whatever).

What happes if you're facebook or the Kodak easyshre site?. Users, and there's a lot of them, are continually adding and modifying content and the content consists of lots of small files. And you've got to keep the content for as long as the user keeps on subscribing and you have lots of files. Yes you could quota each user to say 1GB or a 1000 files (for file based backup and replication you're more worried about the number of files, of directory table entries than the sheer amount of filestore) but if you've many thousands of users, it's still a lot of files. Too many to backup conventionally but which you would probably replicate multiple times.

So you could say that the flickr filestore or the Kodak Easyshare filestore would be a close model for a typical student filestore on drugs.

Now I don't know how flickr provide their store but Kodak uses Isilon to provide their store.

So when Isilon came to Canberra to spruik their solution I was interested. Especially as we also potentially have a long term archiving problem with medical images, astronomical images, and astronomical data, as well as having to provide a large live student filestore. Something that would scale to 1600TB is interesting.

And it was interesting. Basically start with three nodes. Stipe the data across them in such a way as to have multiple redundancy across disks within boxes and between boxes to ensure that you could lose either a random set of disks or a box and keep going. Use infiniband as a backplane to glue the boxes together. Additional nodes can be added to increase the amount of filestore available provided you stay within the redundancy rules. And you don't back it up - you replicate it. (As an aside I suddenly realised at this moment why Hitachi had put MAID (like Copan) into their preservation package - most times you never need to access the replicated copies so why keep the disks spinning - simple when you think it through).

And you present it as a contiguous filesystem presented as shares and accessible by NFS, CIFS, HTTP and grudgingly the apple filing protocol AFP). Cost is around $10K/TB plus some extra for some of the replication tools. Not astoundingly cheap, but not ridiculous either.

But I had a niggle. It was all sales and nothng about how the filesystem worked. How do they manage it?. Given its potential size it can't be a conventional inode or cluster based system so I'm guessing it must be a distributed system something like fossil. They said they'd get back to me but they havn't. Ceratinly fossil would give them efficiencies.

And then there's the other problem - googling for technical information I cam across a whole set of entries suggesting that there might be some financial problems in the parent.

On the other side, they did drop a hint that another university in Australia (Victoria actually) was possibly about to buy their solution for medical imagery. They promised to confirm that as well - something else I'm still waiting to hear about.

So, promising, not cheap and a few doubts, but if it works they way I'm guessing it could be a really useful technology for holding large amounts of unstructred data either as filestore or for archiving.


Arthur said...

So we're developing a database driven system to track changes and build synthetic backups that we can then write out to a volume and backup conventionally

Isn't this what Time Machine does? It reads a changes list and copies those files. Are you polling the OS/X change list or maintaining your own?

ZFS can do similar things with snapshots and copy on write (ditto netapp) as I'm sure you know

COOLJOE said...

If you want to backup conventional files without having to go through the traditional backup methods, then I suggest you look at, its a block based backup that only backups the block changes, and the not he whole file. It gets rid of the redundant file structure as well speeds up the backup processes. furthermore, upon restore, you retore from the last backup unlike to the sequential recovery process in th etraditional methods. Evault is easy, fast and cost effective, its a true disk based backup system unlike Veritas, CA etc which work on VTL and emulate tape backup on disk.

As for Isilon, this is a great company with a fantastic technology that works, its easy to use and very little support required. Its great for unstructured data especially video file, images, music files, content, medical content etc. Anything that is large in size and unamanageable. The true cost savings on Isilon comes in the form of scalability, high availaibility and support costs...It's going to be a big winner in the storage industry although it had a rough start but its getting its act together.

dgm said...

blocklet type technology as pioneered by rocksft is certainly interesting - as I wrote in 2006:

[Information Architecture] Blocklets
posted Tue, 05 Sep 2006 18:20:30 -0700

Interesting data compression software idead from a company in Adelaide called Rocksoft (

Idea goes as follows:

Suppose we have a sentence "Now is the time for all good men to come to the aid of the party". We decide that words are groups of non space characters bounded by spaces. And just like zip files we start tokenizing them - the same techniques is used in online dictionaries, and we use hex numbers as the tokens.

The sentence now becomes 01 02 03 04 05 06 07 06 08 09 0A 09 03 0B 0C 03 0D.

It gets better, suppose we have a sentence "The Party needs good men". We can tokenize it as 03 0D 0E 06 07, and sudenly we're getting compression gain.

Do this real time on disk based archiving system and we start to get compressions of 40 times, given a large enough dictionary.

More details at

Rocksoft have since been bought by Adic, who were bought by quantum, and the technology now shows up in som of quantum's de duplication products.

However, sexy as it it is you do still need to calculate the changes and then do file synthesis. The trick in the quantum vtl is to dump everything to disk and the do the calculations as post processing rather than trying to do them as part of the initial backup process.

What I don't know is how well this works with a churny filestore. Obviously it's going to be dependent in the type of filesystem, which is where journaling systms like zfs give you a win