Monday, 23 June 2014

Curating Legacy Data

Data is the new black, everything seems to be about data at the moment and it’s desperately trendy. At the same time there is an entirely laudable movement towards researchers making the data that underlies their research available, for all sorts of reasons, including substantiation and reuse and recombination with other data sources.
This sometimes gets conflated with Big Data, but it should’t - outside of a few disciplines, most experimental data is pretty small, and even quite large sets of results will fit comfortably on a twenty dollar usb stick.
It’s important to remember that this is a recent phenomenon - only in the last five years or so has cloud storage become widely available, or commodity hard disks become large and cheap.
Before then data would be stored on zip drives (remember those) CD’s, DVD’s, or DAT tapes - all formats which are either dead or dying, and all of which are subject to maintainance issues. Basically bitrot due to media degradation.
Even if you’ve stored you data on an older external hard disk you can have problems - my wife did just that, and then lost the cable to a four year old external disk - it of course turned out to be a slightly non standard variant on USB, and it took us a lot of searching of documentation and cable vendors (including a couple of false starts) to find a suitable cable - when is a standard not a standard ? - when it’s a proprietary one.
Recovering this legacy data is labour intensive. It can be in formats that are difficult to read, and it can require conversion (with all that implies) to a newer format to be accessible - which can be a special kind of fun when it’s not a well known or well documented format (nineteen nineties multi channel data loggers come to mind).
So, what data should we convert?
Well most scientific publications are rarely read or cited, so we could take a guess and say that it’s probably not cost effective to convert the data underlying those - though someone did once ask me if I still had the data from an experiment I did back in the nineteen eighties - turns out they were having difficulty getting regulatory approval for their physiology study and thought that reanalysing my data might give them something to help their case. And I’m afraid I couldn’t help them, the data was all on five and quarter inch Cromemco Z2D disks, or else punch cards, and long gone.
So, what probably a legacy data curation strategy should do is focus on the data underlying highly cited papers - it’s probably of greater value, and there’s a chance it might have been stored in a more accessible format.
However even recovering data that’s been looked after still has costs associated with it - costs to cover the labour of getting it off the original source media and making it useful. And these are real dollar costs.
From experience, getting a half dozen nine track tapes read costs around fifteen hundred bucks if it’s done by a specialist media conversion company, and the administration, shipping, and making useful phase probably another fifteen hundred, less if some poor graduate student can be persuaded to do the work, but still is a reasonable chunk of money, and money that need to come from somewhere.
So, who pays, and is it worth it?

[Update 26/06/2014 : Notes of a meeting in Sydney on this very subject ...] 
Written with StackEdit.

No comments: