Friday, 1 December 2017

More on spreadsheet preservation and normalisation


Yesterday, inspired on a post about preserving Google sheets I blogged about spreadsheet preservation in general.

As  is the way of these things the question has been rumbling round my brain ever since.

A long time ago, the National Archive of Australia released Xena, a normalisation tool that converts files into open xml based formats - essentially the open office formats used by Libre Office and others, on the basis that the xml produced is both documented  and readily parsable and that it would be possible to recover the data and the calculations from any preservation file.

And in fact when we built the original ANU data archive, we silently implemented this normalisation process as part of the workflow. We didn't use Xena, but after using Pronom to work out if we could recognise the file type, and if we had a normalisation engine for it - essentially an xml export tool, we would use that to produce a long term preservation copy which we would store, along with the original, in a bagit archive.

The idea of storing both, of course, is that as we didn't test the normalisation processes, and tended to trust the tools, it is just possible we could have produced garbage as part of the normalisation process.

In fact we deliberately ignored the year 1900 problem, as we reckoned that only a small number of spreadsheets would be affected.

So what does this mean for Google sheets?

Exporting to an xml format such as ods would seem to be the way to go, but given that it's not possible to preserve the original document, the sensible thing would be to download the spreadsheet in two formats, both ods and xlsx, given that both are in xml and that parsers exist for both formats.

The reconstituted spreadsheets should of course give identical results imported into the appropriate utilities.

Exporting a single sheet spreadsheet as as csv, or whatever, is only appropriate where there are no calculations involved, an example being where the spreadsheet was used to record species abundances in a number of quadrats.

The decision about whether to use an ascii format such as csv is best left to the researcher, they know their data, and whether it's appropriate.

The standard procedure should be to use a richer xml based format, and preferably two of them.

Ideally there should be some sanity checking before ingest ...

No comments: