Wednesday, 13 April 2011

Sustaining content

The threatened closure of AIATSIS's digitisation programme has made explicit the long term problem of all digitisation/ digital content programs - sustainability.

Sustainability can be more accurately described as the problem of how do we keep the curation process going after the initial funding has expired - the curation process meaning that we check to ensure that the data is accessible and has not gone corrupt in some way.

And this is a process that takes money.

The usual output of a digitsation program is a website that gives access to the content, a database server that allows you to query the content, or more accurately the content's metadata, plus the content itself, which needs to be backed up, sanity checked and the rest. I don't have figures for the cost of replicated filestore so I'm going to assume that its 2.5 times the cost of straight filestore.

Ignoring the costs of software licences, 2 reasonable virtual machines - one for the front end, one for the database backend. Each server would cost a little less than $1000 in terms of hardware resource to provide for a 5 year term, but running costs are probably around $1500 per annum for power and cooling, plus $2500 per annum for operating system maintenance and patching - if the cost of the server is minimal compared to the cost of running them for five years - $20000.

Storage is quite cheap to provide as well - say around $4000 for a terabyte of slow SAN based SATA storage over a 5 year term - using NAS would be a bit cheaper.

So let's say $10000 for storage if you add in replication/sanity checking.

So we could say that a project that needs to be maintained costs around $30000 for five years after the end of funding, or $6000 per annum.

Not a great sum.

However there is a problem. Few digitisation or data hosting projects include a coherent costed sustainability plan, the default seeming to be 'oh my institution will look after it', which they might well for a few years if it was only $6000 a year, but if it was 10 projects, each at $6k a year, that's getting on for the cost of a grade 4 library person, trainee network support guy , or whatever.

Projects need better exit and sustainability plans, ones with real costs. And while some might be sustainable by selling subscriptions not all will be - in fact most of them won't be.

The simplest solution is probably to require projects to pay into a ringfenced fund, something akin to an annuity, that will provide for the sustainability of the data for a fixed term. Given that the half life of scientific publications is around five years (ie 50% are never referenced five years after publication) we can probably assume the same is true of data, and can say that a dataset or digitsation project is only of value on the level of access it attracts.

This is an approach which we used to take at the UK Mirror Service when deciding which data sets to cease hosting - and while we also had special pleading it had the advantage of transparency.

The costs cited are from internal estimates from my day job of what it costs to provide these things. They're not necessarily accurate, however the server costs are not too far from the costs charged by large hosting company here in Australia. The storage costs are slightly higher than Amazon's for hosting on redundant storage in their Singapore facility, but remember that Amazon will charge for data transfer costs (essentially website accesses and database lookups) However, these are all back of the envelope prices, and you may find your technology and hosting costs differ significantly. The point remains though that even using serious hardware (virtual machines, high quality blade servers, high performance SAN hardware), the costs of maintaining the resource 'as is' is comparatively low on an annualised basis. However in aggregate they can amount to a reasonable degree of expense - meaning that the host institution probably should care about sustainability even if the individual projects do not ...


Arthur said...

That ignores the main cost of maintaining the repository. You can't just leave a pile of bits on disk and expect it to work. Software needs updating, data formats need updating etc, etc. That's what costs.

dgm said...

The annual $2500 operating system maintenance and patching includes this. Equally one doesn't need a repository per se - only repository like functionality as the data is essentially static and needs only to be searched and indexed by external systems