Tuesday, 21 February 2012

OCR for fun and profit



Someone recently asked me about how easy it would be to provide pdf versions of digitised microfilm copies of documents and also provide an OCR’d version of them for reuse.

This was because the person in question had a lot of material stored on microfilm and a surprising number of archival projects used microfilm as an easy and robust way of capturing material.

I knew what they were asking for was possible, in fact I did mention using Imagemagick and Tesseract in my answer, but I realised that actually I knew nothing about the process or technology involved.

Just because I’m like that I went off to investigate and build myself a test OCR environment on an old laptop running Ubuntu 11.10.

In pre-digitisation days, microfilming texts that had to be copied in situ was one of the easiest ways of capturing material. The Mormons - who for religious reasons of their own - used to go around microfilming parish records of births, marriages and deaths, and incidentally building a massive geneological resource as a result

When people have microfilm images digitised by a specialist contractor they find that most professional digitisation houses produce a set of tiff images, one per frame. Any lossless format would do, there is no logical justification for using TIFF over PNG of JPEG2000, other than TIFF does not have any intrinsic compression what so ever. Both of these formats have been used in digitisation projects as has ordinary Jpeg when coupled with a cheap digital camera and a home made book scanner.

These sets of images as they stand are not particularly human friendly for reading through a text that spreads over several pages. One of the simplest solutions would be to take the files and output them as a pdf.

ImageMagick will happliy convert images from tiff (or any other format) to pdf providing you have Ghostscript installed.

However it does this by effectively printing the tiff images to a file one after another, and when it prints it prints the image as a raster image.

This works because of the janus like nature of a pdf file.

The pdf file format was developed from the PostScript page description language used in printers.

Text was handled as text as words plus formatiing instructions as to what font and size of text to use, vector graphics were handled in a special way and raster, bitmap, images where handlded as a compressed binary run length encoding.

The result is that you can create a pdf that looks like printed text with words and letters in it, but which is simply a graphic representation.

Fine for reading, but no use at all if you want to manipulate text, for example to search for the most commonly used terms in a body of text as with the Google Ngram viewer.

To do this you need to run the text through an OCR engine, of which the two common open source engines are Tesseract and Cuneiform. Both only work reliably with a small subset of (human) languages and have problems with some more exotic fonts but both do a reasonable job.
To play interactively with OCR one of the most straighforward applications to use is OCRfeeder, which is effectively an interface to allow you to process images with imagemagick and to process the images with tesseract, cuneiform, or one or two other OCR images and dumping it out in an open office odt format file for subsequent manipulation.

For example if you wanted to make a reflowable document, such as an epub out of this you need the text, not a graphic representation of the text. The same goes for a whole lot of textual analyses.

One of the virtues of OCR feeder is that it is scriptable from the command line, meaning that you can experiment first to see which OCR engine works best on some test samples of your material and then script it to run through a set of images.

One nice thing that you can also do with OCRfeeder is import an existing text based pdf and extract the text for subsequent manipulation where you don’t have the original source documents. OCRfeeder can also be used as console for correcting transcription errors.

The National Archives Xena application does much the same thing except that it uses Jpedal for pdf file manipulation. Jpedal comes in two versions, a cut down, free version, and a fuller version with text extraction capabilities. Depending on your use case you may find that OCRfeeder is a better choice than Xena, or vice versa.

Both will output to open office. Open office can of course be driven from the command line or a script to take the file and export it into a format that you require for subsequent processing such as plain text.

OCRfeeder, Imagemacick, tesseract, cuneform, ghostscript and the rest are all available as packages for Ubuntu making building a test install a snap, you can be up and experimenting within 30 minutes of building an ubuntu machine or vm, meaning that something which would have been difficult or expensive to accomplish a few years ago can be done sensibly with a scanner, some free software and a second hand pc, ie substantially increasing access to digitised resources, and allowing institutions in financially disadvantaged parts of the world mount their own digitisation exercise.

Nothing I have done off and on over the past couple of days is very new or very difficult. All I have done is show how easy it is. Easy enough for a local historical society or individual researcher to get going. Couple this with a basic digital camera built into a home made book scanner and you have an incredibly valuable tool.

There are however a couple of downsides to this. Most OCR tools have some knowledge of the language they’re scanning, in part to deal sensibly with diacritics and such like. This means that they might not cope so well with languages that use different mark up schemes.

For example you might well be able to digitise Swahili text using an English language filter given that Swahili does not use diacritics, but a language such as Yolngu that includes some extra characters. For example Yolngu uses a special diagraph for the ng sound represented as unicode character U+014A.,

Cuneiform is limited to a subset of languages using latinate scripts and the more common Cyrillic variants. It does not have support for some of the made up Cyrillic like scripts invented by the Soviets in the 1930s for writing some of the Siberian languages, Mongol and some of the Turkic languages in that area. Tesseract supports a wider range of scripts including Korean (Hangul) Chinese and Japanese, as well as Thai, opening up the prospect of being able to digitise material from East Asia in a reasonably straightforward manner.

No comments: