Andrew Jackson of the British Library has recently published a study of the use of particular file types over time, focusing on pdf, image and HTML file versions in an attempt to define whether being widely distributed and in use is a guard against obsolescence.
It's a valuable and interesting chunk of work. However it's possible to pick a couple of holes in the study :
- it dosn't address the problem of different document formats, eg the variation in doc and ppt formats as recently exemplified by Chris Rusbridge's attempt to recover some powerpoint 4 format files
- it doesn't explore the problem of legacy formats – my favourite examples are Claris Works and AmiPro files, and also those legacy foramts without a mime type – such as data formats used by specialist dataloggers
However what it does show is that once a format is in common use it is protected against obsolescence. The real problem is with formats from the days before storing documents on the web became the default for many people and the conventions were not fully established.
For example I recently needed to check some documentation about a legacy file format. The manufacturer had put the documentation on the web as TeX files. While perfectly readable this did entail installing OzTeX to read the downloaded file.
Andrew's study also did not address the problem of legacy media formats such as exabyte tapes and the rest. To be fair he explicitly only looked at the UK web corpus, which by definition is online, which meant that he was only concerned with file formats, not media formats.
It would be interesting to run a similar study over the filestore of a medium to large university and see how large a diversity of file type there were, as well as rerunning the study to look at document formats ...