Monday 20 August 2012

So, what to do with this data stuff ?


So, having said that we can treat the literary canon as data for some analyses, what is data?

My view is that it is all just stuff.

Historically researchers have never carde overmuch about data and it retention - the obvious exception being the social and health sciences where people have built whole careers out reanalysing survey data and combining sets of survey data.

In other disciplines attitudes have varied widely, often driven by journal requirements to retain data for substantiation purposes. And in the Sciences Humaines no one has really thought about their research material as data until very recently.

What we can be sure of is that the nature of scholarly output is changing with the increased use of blogs to communicate work in progress and videos to present results etc etc. I'm sure we can all come up with a range of examples. What we can be sure of is that the established communication methods of the last 150 years are breaking down and changing. And to be fair they only really applied to the sciences and social sciences. In the Humanities and other disciplines it has never been the universal model for scholarly communication.

Likewise teaching and learning, the rise of the VLE and its implicit emphasis on the use of electronic media, has changed the way that learning resources are presented to students.

Data is just another resource and being electronic, highly volatile. We can still read Speirs Bruce's antarctic survey data because it was hand written in bound paper notebooks. Reading data stored as ClarisWorks spreadsheets written on a 1990's Mac is rather more complex - for a start you need a working machine and a copy of the software in order to export the files ...

However not all is total gloom,  sometimes one can recover the data.

Just recently I've helped recover some data from the 1980's. It was written on magnetic tape by machines built by a manufacturer that no longer exists. Fortunately the tapes had been written in a standard tape format which could be read by a specialist data recovery company, and the files had reasonably self describing names - and the original people who had carried out the research were still around to explain what the files were.

In 10 years time recovering this data might well be near impossible.

Once recovered, looking after the data is relatively simple - electronic resources are just ones and zeros. No files or content needs special handling in order to preserve them – the techniques to enable this are well understood, as are those to provide large data stores.

It is the accompanying metadata that adds value and makes content discoverable. And that of course is where the people come in. Electronic resources are incomprehensible without documentation, be it technical information on file format and structure or contextual information.

So, if we are to attempt to preserve legacy data we need to start now while the people who can explain and help document the data are still around.

It also means that you have to make the process of deposit easy. Users will not understand the arcane classification rules we may come up with based on discipline or data type.

While it is perfectly possible, and indeed sensible to implement a whole range of specialist collections and repositories, if you are taking a global view you need to start with as ervice of last resort. Such as service should be designed to be agnostic as regards content, and also can hold content by reference ie it can hold the metadata without any restrictions as to type and point to the content stored on some other filesystem not managed by it.

It is essentially a digital asset management system. It is not a universal solution to enable all things to all people.

It is perfectly sensible to group digital assets in a variety of ways that reflect the Institution's internal processes. PDF's of research papers live in ePrints. Learning objects live in the learning object repository should we have one. A specialist collection of East Asian texts live in a dedicated collection, etc.

This means three things:

1) A framework round data governance and data management is a must have. It needs to say things like 'if it's not in a centrally managed repository you need to make sure it's backed up and has a sensible management plan'
2) The institution concerned requires a collection registry that holds information about all the collections and where they are located to aid searching and discovery
3) We need as simple and as universal submission and ingest process. If it's not simple for people to use people won't use it. We might have a model as to what goes where and have demarcation disputes but these are irrelevant to the users of the system(s)

No comments: