Tuesday 8 December 2009

Repositories, clouds, and corruption

Digital repositories are interesting examples of file systems. Typically they consist of a presentation layer, a database, and a storage layer.

The presentation layer is the application, or applications which people use to add data to the repository, search for data within the repository, and retrieve data.

The database stores information about the individual objects (files) in the repository and the relationship between them as well as information describing the file and the contents.

The files themselves are stored in a file system, usually with unique system generated names redolent of babylonian prophets. As you only ever search for the object you don't need to know the name of the object, all that is required is that the system does. This is why, for example, files downloaded from flickr have wierd long hexadecimal names. It also means that the filestore is unstructured, and contains lots of files of similar size, the majority of which are only accessed rarely.

The filesystem is part of the storage layer.

Repositories are interesting as one typically only adds files to them, never deletes content from them, but one needs to guard against corruption and data loss. Typically this is done by making mutiple copies of the file, checksumming the files, storing the chacksum in the database and periodically rerunning the checksum operation and comparing the answers with the answer stored at time of ingest. If one of the copies is corrupt, it's replaced by copying a good file.

Typically, in the old days, one would use a product like SAM-FS/QFS to do this. It also used to be expensive to license so most repositories didn't and instead trusted in tape backups and rsync.

Of course backing up repository stores to tape is an interesting exercise in itself given that it consists of lots of small files in a flat structure - after all the database doesn't need a directory structure. This can be extremely inefficient and slow to backup. Much better in these days of cheap disks to copy several times.

And of course suddenly what one starts looking at looks like a distributed clustered filestore, like the googlefs or Amazon S3. And there have been experiments in running repositories on cloud infrastructure.

But of course, that costs money.

Building your own shared distributed clustered filestore may be a viable solution. And given that not just repositories but LMS applications are moving to a repository style architecture there may be a use case for building a local shared pool, using an application such as glusterfs - a distributed self healing system that is tolerant of node crashes.

Doing this neatly decouples the storage layer from the presentation layer - as long as the presentation layer can write to a file system, the file system has the smarts to curate the data, meaning that it then becomes easy for separate applications running on separate nodes to both write and share data - afterall all it is is a database entry pointing to an object already stored elsewhere on the system.

Definitely one to take further - other than the slight problem that while people have tried running dspace and fedora against systems such as the sun honeycomb, no one seems to have considered glusterfs...

No comments: