Tuesday, 30 September 2008

Putting some medieval digitisation strands together

Over the past few weeks I've posted various links and updates to posts around digitising medieval manuscripts, character recognition and then using the material to build up a corpus for textual analysis

Now with the Stavanger Medieval English Grammar project we see how such a solution would work. Crucially we need to go back and digitise the sources - later editors 'smoothed' the text in places and regularised transcriptions, meaning that sources like Project Gutenberg simply don't work. The sources are not actually transcriptions of a single document - medieval books are more like open source projects with the same basic text but some bits added in or taken out. Think ubuntu, kubuntu, xubuntu - all basically the same but different utilities and window managers. So we have to identify common passages for analysis. Not intellectually difficult, but it does take longer.

The other source we have is legal texts, such as the records of the Scottish Parliament where transcriptions are likely to be more accurate - if rather less interesting to read. Of course accuracy is not necessarily a help here as it's the mistakes that are interesting, not the fidelity of the copy, but as they contain a lot of stock bits of boilerplate we can probably see the evolution of grammatical changes.

The other, unanswered question is how good auto recognition of medieval handwriting is. Clerks, who produced manuscripts as an act of devotion tended to have nice text. Commonplace books and legal records less so, sometimes quite a lot less so ...

dgm

url for newspaper report on the john rylands project: