Friday, 7 September 2007

OOXML, ODF, PDF and long term preservation

There's been a lot of talk over the last few days over Microsoft's attempts to get OOXML - basically their new xml based document format set, which includes docx, the new and non backwards compatible document format for word.

Now there's a lot of argument about OOXML and ODF (the open document format used by Open Office among others) as a standard for documents and by implication their long term archiving and accessibility, because let's fac it no one cares very much what format an ephemeral document is in as no one plans to access it three years down the track.

But since the world's gone digital we do care about the electronic documents as they might include important things like contracts, where what you want to be is assured that what you are seeing in 2011 is what you filed in 2007.

This is where thinking about pdf gets to be useful. PDF arose in the days of multiple word processing formats to provide a common, non revisable document format. Basically a pdf file is a munged postscript page description file such as a printer would interpret to put marks on paper. It's good and it works. And Adobe did something remarkable. Even though pdf reamains a proprietary format Adobe published the format specifications, and subsequent revisions and were happy for people to write tools to create and display pdf files. Which is why we have xpdf, preview, ghostscript/gsview, and a number of other third party tools. Adobe reckoned that this would make the format much more widely adopted and that they could make enough out of selling their pdf creation and manipulation tools. Hell, they even give away the viewer for free.

And revision after revision Adobe have continued to do that. And they have published a long term archival version, PDF/A that is essentially PDF 1.3 as the lowest base version universally supported that all their software will always be able to read.

So we end up with a de facto open standard, and a file format we know that can be read even if Adobe was to disappear, and to be frank, we trust Adobe to do the right thing purely because they always have done so in the past. And as a further display of confidence building, Adobe has submitted PDF 1.7 as an ISO document standard, in other words cementing it's status as a publically available and well known format.

So ODF is in a similar position. It's well known, open, documented, and consequently you have a high degree of beleif that it is possible to write a program to parse and manipulate ODF format files. Furthermore it has been adopted as a file format standard providing that degree of assurance as to what ODF compliant means and also for the long term accessibility of files in that standard.

Then we come to OOXML. OOXML is basically Microsoft's response to ODF. As part of their attempt to maintain their market share they are attempting to turn their document formats into a set of standards. Of course by doing this they have to open up their file format and als oy means that they can't tweak and fiddle with the format they way they have done between various version of word say.

The arguement is really about how tight the standard is. Microsoft have a reputation for being assertive in preserving their market share and in the past have tweaked file formats to give their products an advantage. Microsoft's draft standrad has been rejected because of alleged ambiguities in the draft, and the unspoken worry that Microsoft will carry on in their own monopolistic capitalistic ways by exploiting these ambiguities.

If the standard is tightly enough drafted to avoid these loopholes the problem goes away - if Microsoft tweak things, they like any other software manufacturer in the same circumstances would be in breach of standard. It wouldn't matter that they had originally authored the standard.

Of course Microsoft could silence all the critics by adopting ODF but realistically that won't happen.

The bottom line is that people don't trust Microsoft on the basis of past performance and some of their more gross abuses of their monopoly position. That's sad, but that's how it is. It also doesn't mean that Microsoft is intrinsically evil, the computer industry is littered with examples of similar abuses of standards and formats, it's just that Microsoft have been around for longer and have been a bit more blatant in their behaviour.

You might notice that I've never mentioned XML in this piece. That's because it's irrelevant to the discussion. The concern is about how tightly drafted the standard is and how open it is, not how the file format is encoded. The XML-ness of the format is a red herring as is shown by the success of the postscript-like pdf format.

No comments: