Friday, 12 September 2008

what transcription mistakes in manuscripts might tell us

words change over time. That's how language changes and diverges. Sometimes the change is rapid and sometimes it's slow. That's why we can more or less follow Shakespeare and not Chaucer, and it's doubtful if Shakespeare would have had any less difficulty understanding Chaucer than we have.

Equally language changes over geography as well as time - the English spoken in Kingston, Jamaica is very different from that spoken in Kingston in the ACT or Kingston-on Thames in London although the latter two are not very different, for a whole lot of reasons such as consistent recent bi-directional migration, greater degree of education etc etc.

By looking at language changes over time  we can see how language changes and show that it evolves, with less common words changing their forms quicker than more common words as people are more likely make mistakes with the rare ones than the common ones.

Anecdotally, you can observe this in Australia, where the English spoken, while almost the same as that in the south of England, is a simpler version, the reasons for this probably being due to the need to absorb migrants from non-English speaking backgrounds, whose command of the language may be a little shaky.
impact of cheap technology And then I got to wondering. Projects such as the Canterbury Tales project  transcribe old manuscripts and collate the differences  in an attempt to build a consensus about what Chaucer originally wrote. But these manuscripts also tell us something about how people spoke, because the transcription 'mistakes' the scribes made were often unconscious corrections to usage. 

They are in fact a frozen record of language change. Of course it's more complicated than that, we need to know the provenance of manuscripts to work out which are temporal corrections - reflecting changes in language over time, and dialectical corrections reflecting geographic distance. And we need a big corpus.

So how do we get a big corpus of text. Typically these texts have been transcribed by hand but advances in character recognition algorithms  and the impact of cheap technology, including cheap digitization technology should give us a large corpus to subject to genetic analysis.

This could be very interesting (in a geeky sort of way) ...

dgm said...

see the middle english grammar project for a description of a real project using a similar methodology