Someone recently asked me about how easy it would be to provide pdf
versions of digitised microfilm copies of documents and also provide
an OCR’d version of them for reuse.
This was because the person in question had a lot of material
stored on microfilm and a surprising number of archival projects used
microfilm as an easy and robust way of capturing material.
I knew what they were asking for was possible, in fact I did mention
using Imagemagick and Tesseract in my answer, but I realised that
actually I knew nothing about the process or technology involved.
Just because I’m like that I went off to investigate and build
myself a test OCR environment on an old laptop running Ubuntu 11.10.
In pre-digitisation days, microfilming texts that had to be copied in
situ was one of the easiest ways of capturing material. The Mormons -
who for religious reasons of their own - used to go around
microfilming parish records of births, marriages and deaths, and
incidentally building a massive geneological resource as a result
When people have microfilm images digitised by a specialist
contractor they find that most professional digitisation houses
produce a set of tiff images, one per frame. Any lossless format
would do, there is no logical justification for using TIFF over PNG
of JPEG2000, other than TIFF does not have any intrinsic compression
what so ever. Both of these formats have been used in digitisation
projects as has ordinary Jpeg when coupled with a cheap digital
camera and a home made book scanner.
These sets of images as they stand are not particularly human
friendly for reading through a text that spreads over several pages.
One of the simplest solutions would be to take the files and output
them as a pdf.
ImageMagick
will happliy convert images from tiff (or any other
format) to pdf providing you have Ghostscript
installed.
However it does this by effectively printing the tiff images to a
file one after another, and when it prints it prints the image as a
raster image.
This works because of the janus like nature of a pdf file.
Text was handled as text as words plus formatiing instructions as to
what font and size of text to use, vector graphics were handled in a
special way and raster, bitmap, images where handlded as a compressed
binary run length encoding.
The result is that you can create a pdf that looks like printed text
with words and letters in it, but which is simply a graphic
representation.
Fine for reading, but no use at all if you want to manipulate text,
for example to search for the most commonly used terms in a body of
text as with the Google Ngram viewer.
To do this you need to run the text through an OCR engine, of which
the two common open source engines are Tesseract
and Cuneiform.
Both only work reliably with a small subset of (human) languages and
have problems with some more exotic fonts but both do a reasonable
job.
To play interactively with OCR one of the most straighforward
applications to use is OCRfeeder,
which is effectively an interface to allow you to
process images with imagemagick and to process the images with
tesseract, cuneiform, or one or two other OCR images and dumping it
out in an open office odt format file for subsequent manipulation.
For example if you wanted to make a reflowable
document,
such as an epub
out of this you need the text, not a graphic representation of the
text. The same goes for a whole lot of textual analyses.
One of the virtues of OCR feeder is that it is scriptable from the
command line, meaning that you can experiment first to see which OCR
engine works best on some test samples of your material and then
script it to run through a set of images.
One nice thing that you can also do with OCRfeeder is import an
existing text based pdf and extract the text for subsequent
manipulation where you don’t have the original source
documents. OCRfeeder can also be used as console for correcting
transcription errors.
The National Archives Xena
application does much the same thing except that it uses Jpedal for
pdf file manipulation. Jpedal
comes in two versions, a cut down, free version, and a fuller version
with text extraction capabilities. Depending on your use case you may
find that OCRfeeder is a better choice than Xena, or vice versa.
Both will output to open office. Open office can of course be driven
from the command line or a script to take the file and export it into
a format that you require for subsequent processing such as plain
text.
OCRfeeder, Imagemacick, tesseract, cuneform, ghostscript and the rest
are all available as packages for Ubuntu making building a test
install a snap, you can be up and experimenting within 30 minutes of
building an ubuntu machine or vm, meaning that something which would
have been difficult or expensive to accomplish a few years ago can be
done sensibly with a scanner, some free software and a second hand
pc, ie substantially increasing access to digitised resources, and
allowing institutions in financially disadvantaged parts of the world
mount their own digitisation exercise.
Nothing I have done off and on over the past couple of days is very
new or very difficult. All I have done is show how easy it is. Easy
enough for a local historical society or individual researcher to get
going. Couple this with a basic digital camera built into a
home
made
book
scanner
and you have an incredibly valuable tool.
There are however a couple of downsides to this. Most OCR tools have
some knowledge of the language they’re scanning, in part to
deal sensibly with diacritics and such like. This means that they
might not cope so well with languages that use different mark up
schemes.
For example you might well be able to digitise Swahili text using an
English language filter given that Swahili does not use diacritics,
but a language such as Yolngu that includes some extra
characters.
For example Yolngu uses a special diagraph
for
the
ng
sound represented as unicode character U+014A.,
Cuneiform is limited to a subset of languages using latinate scripts
and the more common Cyrillic variants. It does not have support for
some of the made up Cyrillic like scripts invented by the Soviets in
the 1930s for writing some of the Siberian languages, Mongol and some
of the Turkic languages in that area. Tesseract supports a wider
range of scripts including Korean (Hangul) Chinese and Japanese, as
well as Thai, opening up the prospect of being able to digitise
material from East Asia in a reasonably straightforward manner.