Friday, 22 June 2007

DocX - the nightmare continues ...

Well whatever we feel about docx it isn't going away, especially that Microsoft have now End_of_Life'd 2003 in a move to boost the uptake of Office 2007, which means we need to be pragmatic and come up with a workable solution, which in the case of the Mac, seems to be Neo office. Microsoft's own import filter for the Mac just barfs on my machine but Neo office imports neatly, graphics and all, and lets you export the document in various useful formats.

Making 2003 EoL is of course also going to be a nightmare for multi-platorm sites as docX will start to spread through their windows fleet in a near viral manner causing mayhem to the non-Windows installed base. Sites with large numbers of windows machines will experience a similar problem due to the financial hit upgrading everyone at once will cause. Open Office as a corporate office suite? Reads all your legacy documents just fine. Only problem is that Open Office isn't really integrated into Aqua on the Mac, even though they're working on this.

So for now Neo Office is your friend if you have Macs on site. Does what Open Office does and handles docX to boot. One quirk though is Safari recognises that docx, like odf and like the good old open office format is a zip based format and helpfully unpacks the docx bundle for you. To work round this little problem I resorted to downloading the offending file using parallels and dragging the offending file from the windows desktop to the Mac desktop. Surely there's got to be something a tad more elegant ...

But this isn't a complete solution to the problem of submission to scientific journals I blogged about earlier. At least lets you edit .docx documents. Still it doesn't really handle the problem that basically the equation editor in Office 2007 doesn't use MathML or a compatible format.

And this is important as typesetting equations is hard, and computerised typsetters can have their own quirks. One of the reasons AmiPro was so popular with mathematical scientists when it appeared was that it had an equation editor that produced TeX code and yet was a proper onscreen word processor. Just as the only reason TeX has hung on is that typesetters understand it and what you put in is what you get out.

Now writing a program to parse markup isn't that hard (OK it is but it's doable), which means you can convert TeX, MathML or any markup based document to something a typesetting machine understands (SGML or whatever - one of the wierdest sights I ever so was a commercial printer that had a floor full of people in cubes editing raw SGML in vi on Macs to feed it into a typesetter and fix any conversion problems).

The other key point is that if the document is in a known, or well understood format you can always convert it to something else. docX isn't, the specification is owned by Microsoft and they can tweak it to fix problems which means that you get creep, which is a document conversion engineer's nightmare. ODF and the other open formats have specification documents which you can refer to. Adobe have published detailed specification documents for pdf to allow you to write your own pdf export utilities meaning the format is open in the sense that the knowledge on how to parse it is publically available.

All good. As the world's gone digital, lots of documents, research findings, whatever only exist in electronic form. Yes, people may have printed copies scattered round their offices, in the same way they used to have offprints, but there are no catalogued, findable, non-digital copies.

If the document is in a format that can't be read it might as well be dead, or written in linear B, maya, tokharian, or something equally obscure. If the format's known we can always access the knowledge. And fundamentally that's why docX is a problem. It doesn't follow open, described, standards so there's no guarantee of future access, or when we open a docX document created with Word 2007 we'll see exactly the same document when in 2012 we open it with word 2011

[Addendum: In this discussion I'm ignoring the very similar problems caused by Excel's new xlsx format in Office 2007, but that doesn't mean they're not out there]

No comments: