Tuesday, 16 November 2010

storage, storage, storage

The key to the digital archiving game is reliable, replicated persistent storage, ie storage where we can put stuff in and be assured that what comes out the other end at a later date is what we put in.

On small archives this is simple to do of course, you copy the files multiple times, and periodically do md5 checksums to makes sure that they are the same as the original checksum and hope to do this often enough that you don't end up with all the copies going bad.

Statistically this is unlikely, although as David Rosenthal has recently pointed out the bigger your archive and the longer you keep it the greater the chance of spectacular failure. However, for most small academic archives this is less of an issue, principally as the size of the archive and the hardware refresh cycle should mean that the problem of disk reliability decreasing with age is less of a problem.

Basically if you buy replace the hardware every three years you should get twice as much newer and more reliable storage for your dollar. And the size of the archive is such that you can probably even do periodic tape backups as a belt and braces exercise.

Large archives are of course different and have various problems of scale.

However one problem that happens is vendor change. Vendor regularly decide to stop making things. For example we had a student filestore technology solution based on Xserves that did replication and the like and could conceivably have been turned into an archival filesystem.

Apple have of course, end of lifed the Xserves. Which means that the solution will need to be migrated to new hardware. Whether this is an opportunity or a challenge I suppose depends on your view of life. And to be fair we only started with Xserves to provide better AFP support.

Now for student filestore we have a range of options from migrating to Stornext to outsourcing the whole thing.

Archives are different. While student data is as valuable as any other data it is short lived, meaning that as long as we can move the files reliably once we probably don't need to move them again.

Archival filestores are of course different. Even if we only plan to keep the contents for ten years, that's three migrations. If we think about keeping stuff for a lifetime that's twentyfive migrations, each with their attendant problems and risk of corruption.

Now most migrations go smoothly, and of course you (usually) have a usable backup.

Ninety percent of most reasonably large archival stores are never accessed after the first few years, so there is a temptation to save costs by only actively verifying the more commonly accessed data, which of course means we start to risk silent corruption.

Now I have a lot of photos online of my wife and cat. And I'll probably still occasionally want to look at them twenty five years from now. Can I be assured I can access them? Or pdf's, or this blog?

And of course these sit on large commercial providers. For small academic archives the problem is worse as they may hold the only copy and be resource limited to test and verify, leaving them at the mercy of migration anomalies, especially as these migrations tend to be single point in time changes, rather than the evolutionary changes seen in large archives ...

No comments: