Friday 2 March 2012

How big is your data?

When you're trying to design a data archive one of the problems you face is trying to decide on how much storage you need.

This is in fact almost impossible. The real answer is 'Lots, as much as possible'. Providing a justifiable number is a little more tricky. This is because people really have no idea how much data they have.

For example, a few years ago I was responsible for the design and specification of an anthropological archive. Some of the material was digitised, some wasn't. The digitised information consisted of photographs say 500k each, Tiff images (something between 1-2MB each), MP3's of language say 20MB, and some video (God alone knows).

To make a plausible number, do something like:

([number digitised] x [average size]) + [(estimated annual digitisation rate] x [average size] x 3)

ie how big is what we have, how quickly will it grow, and how long will it be till we can buy more storage.

Of course no one knows how quickly it will grow and there probably isn't a business plan that gives you a number, but using the past rate of digitisation is probably a good start (2000 images in three years say).

Take this number. Double it (or triple it) as you want multiple copies for redundancy and to guard against bit rot and add 50% for head room and that's your answer.

For an average research institute or archive it's probably in the  low tens of terabytes, or if video is involved possibly a hundred or so max. Remember on the basis of these figures a two hundred page hand written journal scanned as TIFF is still less than half a gigabyte. The same goes for a thousand images. In comparison I have a 16GB SD card in my e-reader.  The average epub without graphics is under 500MB, ie I can have a year's reading on a single card, that coincidentally costs around $16. The point here is that storage has grown much more than the size of individual items and that the cost f storing stuff is minimal when costed on a per item basis.

The takeaway is that the data, while substantial, actually is small in terms of storage.

Now think about archiving scientific data. Astronomical data is big. We know this. So is Genomics/Phenomics data, and some of the earth sciences and climatology data.

The rest, well a lot of it is spreadsheets and analyses and probably not that big. I did try running a survey to find out, but with little success.

My guess is that most items are less than a gigabyte, and aggregated into a dataset or collection, much less than a terabyte. Certainly this is true for legacy data as until recently we have not been able to store (or back up) that much.

Given the size of storage devices available now this probably means we can store almost everything except for the data already known to be big, and video data.

This of course isn't the whole answer, as there are operational costs with backup, migration to new filestore as the old stuff ages, but what it means is that in Microsoft can give everyone with Skydrive 25GB so potentially could any research archive.

The major cost is not in storing it, it's on getting the stuff in and ingested and catalogued in the first place, and as such we probably shouldn't worry overmuch about allocation models. What's more important is to understand the data and its storage requirements ...

1 comment:

dgm said...

Interestingly, the Internet archive compresses down to around 7PB ....