Stuff, geeky stuff: Digital preservation and preserving the digital

Friday, 3 January 2020

Digital preservation and preserving the digital

I've been doing digital preservation stuff in one way or another for over twenty years, managing projects, building servers, installing systems, even writing code.

In the early days it was all a bit finger in the ear - no agreed standards, no agreed preservation formats, and lots of home grown and ultimately unmaintainable solutions.

Nowadays, and I'm possibly out of the loop a little, being retired, it's basically a just add dollars problem.

There is a tacit agreement about what to do, how to do it, and what software to deploy.

For some people a proprietary solution with predictable recurrent costs makes sense, others may find an open source solution with its low entry costs. but implied long term funding for software maintenance ( and by implication maintaining a specialist team) more to their taste.

Either are good, either will work, it's basically a financial decision as how best to proceed.

The days of having a test server sitting under your desk (and I've been there) are gone.

But there's an elephant in the room.

To date most digital preservation efforts have been focused on preserving digitised artefacts.

Books, early modern ballad sheets, insects collected in the nineteenth century, rude french postcards, etc etc.

And this is partly because a lot of digital preservation efforts have been driven by librarians and archivists who wished to digitise items under their care for preservation purposes.

And this model works well, and can be easily extended to born digital records, be they photographs, medical records, or research data - it's all ones and zeros and we know how to store them.

And being linear media with a beginning and an end we can read the file as long as we understand the format.

What it doesn't work well for is for dynamic hypertextual resources that do not have beginnings or ends but are instead database driven query centric artefacts:

From memory, the wagiman dictionary was written mostly in perl and did a whole lot of queries on a database. I know of other similar projects, such as the Malay Concordance Project, that use similar technologies.

Essentially what they have is a web based front end, and a software mechanism for making queries into a database. The precise details of how individual projects work are not really relevant, but what is is that the webserver needs to be maintained and not only has to be upgraded to handle modern browsers, it needs to be made secure, the database software needs to be maintained and of course the query mechanism needs to be upgraded.

Big commercial sites do this all the time of course, but academic projects suffer from the curse of the three year funding cycle - once it's developed there's usually no funding for sustainability which means that even if it becomes very useful, no one's there to upgrade the environment, which means that it starts to suffer from bitrot. Left alone, sitting on a single machine with a fixed operating system version it would probably run forever, but hardware dies, operating systems are upgraded and all sorts of incompatibilities set in.

While it's not quite the same thing, go and take that old tablet or computer you've had sitting on a shelf since 2010 and have never quite got round to throwing out. Try accessing some internet based services. Some will work, some won't.

And the reason is of course that things have change in the intervening ten years. New versions, new protocols and so on.

So, what to do?

One solution is to build a virtual machine using all the old versions of software (assuming of course you can virtualise things successfully). The people who get old computer games running have a lot to teach us here - most games used every tweak possible to get as much performance as possible out of the hardware of the time. The fact that hardware is now immensely more powerful than even five years ago means that the performance cost of running emulations can be ignored, even if they're particularly complex.

This gets rids of the hardware maintenance problem - as long as your virtualisation system allows. and continues to allow you to emulate a machine that old you're home dry - except you're not.

You need to think about secure access, and that probably means a second machine that is outward facing and passes the queries to your virtual machine. This isn't rocket science, there's a surprising number of commercial legacy systems out there - business process automation solutions for example - that work like this.

The other thing that needs to be done is that the system needs to be documented. People forget, people retire to Bali and so on, and sooner or later the software solution will need to be fixed because of an unforeseen problem ...

Friday, 3 January 2020

Digital preservation and preserving the digital

No comments: