Tuesday 18 August 2009

Wither digital repositories?

A long time ago (2004 in fact) I was the digital asset management project manager for AIATSIS.

This was a really interesting project to procure and implement a digital asset management system, which bore a close resemblance to a digital repository to store for all time the digitised patrimony of the aboriginal cultures of Australia. Now, the Aboriginal cultures were oral culture and while poor in terms of physical cultural artefacts where immensely rich in terms of stories, songs, dance and the rest.

As the traditional societies broke down during the nineteenth and twentieth centuries there was great risk of loss of this material. Social dislocation, disruption, breakup of kinship groups etc etc.

However AIATSIS had built up a great store of anthropologists' field notes, recordings on cassette and quarter inch tape, film, often 8mm or 16mm, and video.

Much of these materials were in a poor state as they had not been conserved at all - in one case a box of tapes was discovered in a tin shed on someone's property after the original owner died. Being stored for forty years in a tin shed in the desert does not do anything for the longevity of quarter inch tape.

So the decision was made to digitise the materials, as recording technologies had moved from analog to digital. The result of this was a large amount of data that needed to be properly indexed and stored with appropriate metadata, and also made available to the societies whose data it originally was - digital cultural repatriation.

My part in this was to acquire a solution to do this. Previous to this I'd done a lot of work on backup solutions and had been on the UK Mirror Service steering group, so I wasn't new to the technology, although perhaps new to the concepts, but then everyone was in 2004.

Digital repositories are fairly simple. They consist of a database which contains the metadata for the digital objects. This metadata is in two parts, the technical metadata, usually things like where the object is stored, what format it is stored in, and so on, and the informational metadata which contains stuff like provenance and access rights, and the object itself which is stored in an object store, or more precisely a persistent object store, ie some form of filesystem that has inbuilt resilience such as SAM-FS or by replicating the filesystem multiple times and using things like md5 checksums to prove the copies are accurate copies and then periodically rerunning the chcksum on the files to see (a) if the answer you got was the same as previously and (b) if all copies continue to give you the same answer - this basically is a check against corruption and is is part of what SAM-FS/QFS will do for you.

Such repositories can be very large as you are not limited by the filesystems as the object store can be spread across multiple filesystems and you search for things, rather than addressing the individual file objects directly.

What you don't do is back them up. Backup is at its most basic a process where you copy the contents of a filesystem to another filesystem to give you a copy of the filesystem as it was at a single point in time. And you do this with sufficient frequency that your copies are reasonably accurate, so on a slow changing filesystem you might make fewer less frequent copies than on a busy filesystem.

Of course there is the risk that the filesystem contents might be changing as you copy it, which is why databases are typically quieseced and dump and the dump is backed up not the live database.

However, if you have a 1 terabyte filesystem and you back it up once a week and you keep your backups for six months you have to store 26 terabytes - decide that you need to do nightly backups because the file system changes so much and you're gong to keep these for the first month, you suddenly find yourself storing 22+30 ie 55 terabytes. Doesn't scale. Starts becoming expensive in the case of storage media and so on.

Of course there are ways to mitigate this, so as to cope with the changes in a big filesystem. Of course if your big filesystem contains lots of rapidly changing small files such as a student filestore, you have a different bag of problems as the filestore is different everytime you look at it to see what's changed. So you end up with tricks like writing every file to two separate filesystems so you've got an automatic backup. And of course if you track changes you can then build a synthetic point in time copy.

Now the point is that conventional serial backup doesn't scale. And if you track the changes (in a database perhaps) you can regenerate a synthetic copy of any filesystem at a specific time (within reason).

And suddenly your filesystem starts looking like a repository.

Now there's a reason why I'm telling you this. After doing AIATSIS's digital repository they asked me to be their IT manager, and from there I moved to ANU to be an Operations manager looking after servers storage and backup, and making sure that the magic kept working. I've now had another left turn and am now doubling up as ANU's repository manager.

OK, and ?

Well I went today to hear Bob Hammer, the CEO of Commvault, the company that produces our backup solution, speak. I'd gone as an Operations mager to hear about the technology enhancements and so on that were on the horizon.

What Bob Hammer had to say was more interesting than that, and very interesting from the repository point of view. In summary it was that conventional linear backup was going to disappear, and really what all the information life cycle management, and clever stuff around replication and deduping and cheap disk store was going to give you an indexed store with persistent objects and you would search for objects against the metadata on the objects - essentially e-discovery and that content, ie the value of the information was what you were preserving, not the files.

The other interesting point was that such a model means that you can decouple storage from the repository and that the repository could live in a datacloud somewhere as what was important was fast search - as long as the results were fast it didn't matter so much about the retrieval time - the google experience. Also of course we can bridge different vendor's storage and we no longer care desperately about file systems and their efficiencies. The key was the metadata database.

He also said a great many more interesting things, but it was the idea of decoupling and going for a metadata approach that piqued my interest - here was the CEO of a backup company saying it was all going to change and this was his view of the changes.

There is also the implication that the filestore contains reslience and everythine is based on the metadata approach - a bit like the google file system.

Of course the implication is that if conventional backup goes away and persistent storage looks a lot like a digital archive, what happens to repositories let alone filesystems?

In a sense the persistent store allows you to query and build a collection with object reuse by querying the persistent store's metadata as regards access and search to identify suitable objects.

So the questions are:

1) at what point does a digital archive just become a set a logic that controls the formats objects are loaded (ingested) in and allows the recoring of informational metadata? (The same is of course true of retrieval)

2) If archives are collections of metadata that point at objects distributed across mutiple object stores is that a problem - provided of course the objects are properly peristent

3) Object reuse just becomes a special case of #1, the same object can be ingested into multiple collections, perhaps eg a podcast could both be ingested in a history collection and an archive of recording made in a particular year.

4)And we need to think about what happens if an institution suffers a catastrophic failure of an object store, if all we have is a set of collections reusing objects, what happens if we lose the objects. Do we need to think about dark clones of these object stores not unlike what Clockss provides for e-journals.

No comments: