Tuesday 19 August 2008

reading old documents

reading old documents can be difficult - incomplete text, blobby type and all the rest. There have been fairly successful automated attempts, abot which I've previously blogged elsewhere:

Searching Manuscripts Electronically posted Mon, 13 Feb 2006 09:29:51 -0800

 Digitisation of historical records is fine, but all you end up with digitisation projects for historical documents is a series of high resolution images which may be easier to work with and increase access but doesn't do anything for search.  Printed documents are more or less OK for search. Scanned and OCR'd versions of printed books, even very old printed books, such as those from rennaisance Italy are fine, even if you do need to sometimes 'teach' the OCR software how to deal with a non standard font.  Manuascripts have however, up to now been a no-no. 

The only way to make an electronic text was to type it in by hand and mark it up using an encoding schema such as those developed by the Text Encoding Initiative Consortium.  

Laborious.  

Now comes news of a really clever idea. Allan Smeaton's reserach group has been looking at shape recognition software - basically software to recognise objects such as cars and planes in photographs as cars or planes - trickier than it seems as you need to be able to find an object and recognise it from any angle, something humans can do easily, but machines find hard.  For example from the window of my office I can see a car park containing a mixture of sedans, hatchbacks, SUVs, all by different manufacturers and all different colours, but I can recgnise them as cars.  The clever thing about Allan Smeaton's software is that it can look at an image and twist it to match a category, so it can tell that a Peugeot hatchback, a big Ford Sedan, and a Subaru Forester are all cars.  

On a whim Smeaton fed digitised images of George Washington's letters into his software and it recognised an 'A' as an 'A', a 'B' as a 'B' and so on - all of which was pretty impressive, because while George Washington was taught to write in an age where legibility was prized, being as handwriting was the only real means of communication other than face to face discourse, like of all of us his handwriting got a bit more sloppy (and variable) as he got older and busier.  

Smeaton has also tried this on digitised medieaval manuscripts. These were actually easier to handle as the monks were going for legibility, and hence repeatibility.  

Smeaton has now obtained funding from Google, among others, to develop this as a search tool for digitised manuscripts - essentially a sort of plastic OCR that copes with variation.  Copperplate and other highly repeatable handwiting - and I would guess not just in Latin script - appears to be in reach, but I would guess that dealing with highly variable scribbly script, such as in diaries, especially now from C20, C21, are not. This would be because in the last hundred years or so handwritten documents were usally for personal consumption only, with most other documents being typescript or latterly computer printed, and hence subject to greater variation (aka scribbly).  This may mean that the TEI-C folks are still in business, either doing difficult cases by hand or by correcting errors in shape recognised texts. 

[Years ago, I came across another object recognition project which was to write a naked people detector. Apart from the use of the algorithm as an engine for censorware for the prurient, it's a genuinely hard problem given that people are all shapes and sizes and are photographed from all sorts of different angles, and come in two sexes, both of which have nipples, and meaning you can't simply cheat by guessing it's a torso and then if its got nipples (like one or two round redbrown circles two thirds of the way up it's naked and not for public consumption]

Now there's an interesting alternative method - use the human eyeball by extracting text from old difficult to process document and then use the extracted text in captchas. which is an interesting idea.
Now if you put theses two approaches together would it work for reading ancient manuscripts or cuneiform tablets, and could you make the system self learn?

No comments: