So, having said that we can treat the
literary canon as data for some analyses, what is data?
My view is that it is all just stuff.
Historically researchers have never carde overmuch about data and it retention - the obvious exception being the social and health sciences where people have built whole careers out reanalysing survey data and combining sets of survey data.
In other disciplines attitudes have varied widely, often driven by journal requirements to retain data for substantiation purposes. And in the Sciences Humaines no one has really thought about their research material as data until very recently.
What we can be sure of is that the nature of scholarly output is
changing with the increased use of blogs to communicate work in
progress and videos to present results etc etc. I'm sure we can all
come up with a range of examples. What we can be sure of is that the
established communication methods of the last 150 years are breaking
down and changing. And to be fair they only really applied to the
sciences and social sciences. In the Humanities and other disciplines
it has never been the universal model for scholarly communication.
Likewise teaching and learning, the
rise of the VLE and its implicit emphasis on the use of electronic
media, has changed the way that learning resources are presented to
students.
Data is just another resource and being electronic, highly volatile. We can still read Speirs Bruce's antarctic survey data because it was hand written in bound paper notebooks. Reading data stored as ClarisWorks spreadsheets written on a 1990's Mac is rather more complex - for a start you need a working machine and a copy of the software in order to export the files ...
However not all is total gloom, sometimes one can recover the data.
Just recently I've helped recover some data from the 1980's. It was written on magnetic tape by machines built by a manufacturer that no longer exists. Fortunately the tapes had been written in a standard tape format which could be read by a specialist data recovery company, and the files had reasonably self describing names - and the original people who had carried out the research were still around to explain what the files were.
In 10 years time recovering this data might well be near impossible.
Once recovered, looking after the data is relatively simple - electronic resources are just ones
and zeros. No files or content needs special handling in order to
preserve them – the techniques to enable this are well understood,
as are those to provide large data stores.
It is the accompanying metadata that
adds value and makes content discoverable. And that of course is where the people come in. Electronic resources are incomprehensible without documentation, be it technical information on file format and structure or contextual information.
So, if we are to attempt to preserve legacy data we need to start now while the people who can explain and help document the data are still around.
It also means that you have to make the
process of deposit easy. Users will not understand the arcane classification rules we may come up with based on discipline or data type.
While it is perfectly possible, and
indeed sensible to implement a whole range of specialist collections
and repositories, if you are taking a global view you need to start
with as ervice of last resort. Such as service should be designed to
be agnostic as regards content, and also can hold content by
reference ie it can hold the metadata without any restrictions as to
type and point to the content stored on some other filesystem not
managed by it.
It is essentially a digital asset
management system. It is not a universal solution to enable all
things to all people.
It is perfectly sensible to group
digital assets in a variety of ways that reflect the Institution's
internal processes. PDF's of research papers live in ePrints.
Learning objects live in the learning object repository should we
have one. A specialist collection of East Asian texts live in a
dedicated collection, etc.
This means three things:
1) A framework round data governance
and data management is a must have. It needs to say things like 'if
it's not in a centrally managed repository you need to make sure it's
backed up and has a sensible management plan'
2) The institution concerned requires a
collection registry that holds information about all the collections
and where they are located to aid searching and discovery
3) We need as simple and as universal
submission and ingest process. If it's not simple for people to use
people won't use it. We might have a model as to what goes where and
have demarcation disputes but these are irrelevant to the users of
the system(s)