Tuesday 21 February 2012

OCR for fun and profit



Someone recently asked me about how easy it would be to provide pdf versions of digitised microfilm copies of documents and also provide an OCR’d version of them for reuse.

This was because the person in question had a lot of material stored on microfilm and a surprising number of archival projects used microfilm as an easy and robust way of capturing material.

I knew what they were asking for was possible, in fact I did mention using Imagemagick and Tesseract in my answer, but I realised that actually I knew nothing about the process or technology involved.

Just because I’m like that I went off to investigate and build myself a test OCR environment on an old laptop running Ubuntu 11.10.

In pre-digitisation days, microfilming texts that had to be copied in situ was one of the easiest ways of capturing material. The Mormons - who for religious reasons of their own - used to go around microfilming parish records of births, marriages and deaths, and incidentally building a massive geneological resource as a result

When people have microfilm images digitised by a specialist contractor they find that most professional digitisation houses produce a set of tiff images, one per frame. Any lossless format would do, there is no logical justification for using TIFF over PNG of JPEG2000, other than TIFF does not have any intrinsic compression what so ever. Both of these formats have been used in digitisation projects as has ordinary Jpeg when coupled with a cheap digital camera and a home made book scanner.

These sets of images as they stand are not particularly human friendly for reading through a text that spreads over several pages. One of the simplest solutions would be to take the files and output them as a pdf.

ImageMagick will happliy convert images from tiff (or any other format) to pdf providing you have Ghostscript installed.

However it does this by effectively printing the tiff images to a file one after another, and when it prints it prints the image as a raster image.

This works because of the janus like nature of a pdf file.

The pdf file format was developed from the PostScript page description language used in printers.

Text was handled as text as words plus formatiing instructions as to what font and size of text to use, vector graphics were handled in a special way and raster, bitmap, images where handlded as a compressed binary run length encoding.

The result is that you can create a pdf that looks like printed text with words and letters in it, but which is simply a graphic representation.

Fine for reading, but no use at all if you want to manipulate text, for example to search for the most commonly used terms in a body of text as with the Google Ngram viewer.

To do this you need to run the text through an OCR engine, of which the two common open source engines are Tesseract and Cuneiform. Both only work reliably with a small subset of (human) languages and have problems with some more exotic fonts but both do a reasonable job.
To play interactively with OCR one of the most straighforward applications to use is OCRfeeder, which is effectively an interface to allow you to process images with imagemagick and to process the images with tesseract, cuneiform, or one or two other OCR images and dumping it out in an open office odt format file for subsequent manipulation.

For example if you wanted to make a reflowable document, such as an epub out of this you need the text, not a graphic representation of the text. The same goes for a whole lot of textual analyses.

One of the virtues of OCR feeder is that it is scriptable from the command line, meaning that you can experiment first to see which OCR engine works best on some test samples of your material and then script it to run through a set of images.

One nice thing that you can also do with OCRfeeder is import an existing text based pdf and extract the text for subsequent manipulation where you don’t have the original source documents. OCRfeeder can also be used as console for correcting transcription errors.

The National Archives Xena application does much the same thing except that it uses Jpedal for pdf file manipulation. Jpedal comes in two versions, a cut down, free version, and a fuller version with text extraction capabilities. Depending on your use case you may find that OCRfeeder is a better choice than Xena, or vice versa.

Both will output to open office. Open office can of course be driven from the command line or a script to take the file and export it into a format that you require for subsequent processing such as plain text.

OCRfeeder, Imagemacick, tesseract, cuneform, ghostscript and the rest are all available as packages for Ubuntu making building a test install a snap, you can be up and experimenting within 30 minutes of building an ubuntu machine or vm, meaning that something which would have been difficult or expensive to accomplish a few years ago can be done sensibly with a scanner, some free software and a second hand pc, ie substantially increasing access to digitised resources, and allowing institutions in financially disadvantaged parts of the world mount their own digitisation exercise.

Nothing I have done off and on over the past couple of days is very new or very difficult. All I have done is show how easy it is. Easy enough for a local historical society or individual researcher to get going. Couple this with a basic digital camera built into a home made book scanner and you have an incredibly valuable tool.

There are however a couple of downsides to this. Most OCR tools have some knowledge of the language they’re scanning, in part to deal sensibly with diacritics and such like. This means that they might not cope so well with languages that use different mark up schemes.

For example you might well be able to digitise Swahili text using an English language filter given that Swahili does not use diacritics, but a language such as Yolngu that includes some extra characters. For example Yolngu uses a special diagraph for the ng sound represented as unicode character U+014A.,

Cuneiform is limited to a subset of languages using latinate scripts and the more common Cyrillic variants. It does not have support for some of the made up Cyrillic like scripts invented by the Soviets in the 1930s for writing some of the Siberian languages, Mongol and some of the Turkic languages in that area. Tesseract supports a wider range of scripts including Korean (Hangul) Chinese and Japanese, as well as Thai, opening up the prospect of being able to digitise material from East Asia in a reasonably straightforward manner.

Wednesday 15 February 2012

seven inch tablets and keyboards

or how you can have your cake and eat it.

In a reply to a comment on my recent post on whether or not we are living in a post-pc world I mentioned these 7" android netbooks out of china that periodically turn up on ebay and dhgate.

Who, if anyone, buys them, I don't know, but they would appear to give you some basic functionality at a low price - web, skype, email and so on.

While looking for something else I noticed a new development. As well as the android netbooks there's a slew of 7" android tablets out of China. Obviously the sub $100 ones will have resistive screen etc but based on my zPad experience there's no reason to believe that they would be in any way less than the experience of using a brand name device.

The interesting thing is that these are increasingly being offered with an adapter that gives a couple of standard USB ports and a wired network port. ie you can plug in one of these roll up keyboards and a mouse and you've got the functionality of an android netbook.

Pack them away and you've got a tablet. And if you've got a resistive screen and a stylus you probably don't mind the smaller screen. (After all people successfully used palm pilots which had a much smaller screen to write emails and take notes).

Adding a keyboard to a tablet in itself isn't remarkable - Apple sells an external iPad keyboard and Lenovo among others sells a keyboard and charger combo unit for the ideapad, but I get the impression that uptake of these items is not that great. Certainly I have never seen anyone using one of these in anger.

However, I have this theory that if you speak and write Chinese, a tablet makes a lot of sense, and if you have an onscreen keyboard building up characters up of stroke elements is easy, and picking them with a stylus (or a finger) is a fairly natural seeming experience, and that this helps drive the production of low cost Android tablets in China.

Certainly when I started with my zPad I used the default pinyin keyboard application that had an option for stroke selection using predictive text input as well as assembling characters based on Roman text input.

I discovered this through the inevitable finger trouble one has (well I do anyway) with glass keyboards when I would occasionally hit the 'switch to Chinese predictive input' key instead of the shift key.
I'll admit that while I found the insight gained by using a Chinese keyboard valuable,  I later changed to a more western keyboard application.

Ignoring Steve Jobs's jibe that seven inch tablet makers should bundle sandpaper to help people file down their fingers so they could use them, the seven inch tablet + stylus probably makes a lot of sense for handling ideographic languages.

However, for those of us who use Latin script, the really interesting thing about this phenomenon is the price. The no name tablet plus foldable keyboard bundle usually come in at around $100 before shipping and taxes which has definite implications for tablet adoption and using them as simple note takers and educational devices ...

Monday 13 February 2012

Has Microsoft Word affected the way we work? | Technology | The Observer

Following on the typewriters and literary history of word processing thread I belatedly (via this week's Guardian Weekly) happened across this piece by John Naughton from the Observer


Has Microsoft Word affected the way we work? | Technology | The Observer

There's two take aways in the piece - personal word processing was most definitely around before 1985 - I for one remember writing documents on a BBC Micro in 1984 using a ROM based word processing application whose name I forget, and that professional writers, be they journalists or authors made the switch as soon as they could.

At the same time I was using a lightweight application on a BBC micro, we were using Wordstar running on top of CP/M on SuperMicro Z80 based computers in the lab to create complex reports and mailmerge documents, including using a daisywheel based printer to generate camera ready text in nice fonts for publication.

However, word processing, while it was most definitely around was not nearly as common a couple of years before when it was all TeX and troff, and access to, not to mention personal ownership of, computers was very rare.

Only with the advent of cheap computers like the BBC micro did wordprocessing start to become common. Only with the arrival of the IBM PC and it's clones did it start heading towards the pervasive ...

Dataset and game archiving

Over on one of my Wordpress blogs I've a post on how archiving experimental data (as opposed to observational data) is a lot like the problems faced by people who try to archive computer games:

Dataset and game archiving | Building an archive solution

The real push behind this is when we look at the -omics type data produced by gene sequencers and the like. Much of the data is produced by extraordinarly complex machines that is pre-analysed in complex ways with the implication that while it might be observational it is in fact (a) very like experimental data and (b) the final data set is very dependent on the computing environment that produced it ...

Sunday 12 February 2012

Roman Empire ran on camel power ??

An intriguing little snippet:

Roman Empire ran on camel power

Now it probably didn't, but it's _just_ possible that camels were more significant than we thought, it's just that we haven't discovered an image of camel using soldiers from a grave marker or had a written account.

Now if they were really common you'd think we'd have reasonable material evidence and you'd probably be right. But we do know that there were all sorts soldiers from all over the empire stationed in wierd places - just look at the roster for the regiments manning Hadrian's wall.

So it's possible that there were camel using troops in Belgium, just that there were only a few detachments.

And the camels were eaten, failed to thrive, and passed from memory ...

Friday 3 February 2012

The power of paper in a digital age ...


Building on the literary history of word processing theme here's an interesting post from the Guardian

The power of paper in a digital era | Robert McCrum | Books | guardian.co.uk

McCrum's major point is of course that due to the inherent revisability of wordprocessed documents we cannot necessarily follow an authors thought processes through drafts. This is of course not quite true, the dreaded 'track changes' can go someway towards this, even though what it produces is incredibly overloaded and impenetrable.

However the situation is even worse for historical archives. We may have 'track changes' set on and the documents correctly indexed and filed, but we are missing the marginal notes, the comments (and the postit notes and highlighting) that might reveal how a decision was arrived at.

We would only ever have the official version with the official revisions. Even in the days of typwritten memos we always had the marginal notes because of the expense in time and effort of creating revised versions.

Nowadays of course clean copies can be produced at the click of a mouse, and the scribbled annotated drafts end up in the big blue secure disposal bin ...

Wednesday 1 February 2012

are we living in a post-pc world ?

yesterday I tweeted a link to a Guardian article by Jean-Louis Gassee about how the rise of the iPad means we may be living in a post-pc world.

JLG used to run product development at Apple, helped run BeOS and later had a major role with Palm so his opinion probably both matters and is based on some considerable industry experience.

It's undoubtedly true people want iPads. Not Android tablets, even though they may have a similar end user experience but iPads. We can argue about why this is so but people want iPads.

The consequence is that a lot of the applications development focuses on developing for the iPad which has a ratchet effect meaning the iPad has by far the largest software base out there, be it for new house searches in Sydney or whatever.

This has some interesting side effects. Consider you favourite newspaper application. This of course actually pulls content using http from the newspaper's content management system and displays it. No website or browser involved. The newspaper may front end it's content management system with a website (or sites - the Guardian now has a UK and a US website with differing content) - but the website is just another delivery mechanism.

Equally take a look at the Guardian on Facebook - it's an app displaying content and you could just about imagine a scenario where Facebook became simply an application or plugin multiplexor, just as iGoogle could evolve into an application portal.

What we can conclude from the iPad and Facebook experience is that people like apps more than browsers and that people like iPads.

Apple controls the app store through iTunes, a service it honed during the iPod era, and thus potentially controls access to content. Your tablet's browser of course provides a escape hatch to this expansive but walled garden.

What is also apparent is that the chromebook has been less than a resounding success. While it is possible to do most of your day to day work (email, write drafts, meeting notes, surf, blog etc) with a browser alone the applications don't feel as rich as a desktop application and don't have the attractiveness of a tablet app.

Much as I like Google docs, a local wordprocessor gives a richer experience. What using Google Apps gives is pervasiveness - access your content from anywhere  with any browser.

A superficially persuasive model for education - give the kids chromebooks and let them connect to moodle and wikipedia - but not the moment they need to write something serious or create something. A chromebook is simply a stateless ookygoo - great for travelling, but unlike the ookygoo (or any other netbook or lightweight laptop/ultrabook) dependent on the internet to be more than a kilo of gubbins.

So are we living in a post pc /post browser world?

It would seem so. That doesn't mean that we won't still have things that look like personal computers around - we will as long as we need to write letters, email things, run presentations, hack code and anything else that creates files and stuff but these will go back to being work tools.

iPads and their descendants will be what we use at home and recreationally. We'll use them to watch movies, read books, share photos and listen to music - I've already seen real people - tourists outside Old Parliament House - using an iPad to take photos of each other.

Browsers will still exist, but increasingly they won't be the prime content access mechanism for mainstream content - the stuff everyone uses like newspaper websites, wikipedia, youtube and the rest and become something a bit specialist for searching and accessing information.

So we'll end up with a screen full of icons, each of which do some special thing - just like in Windows 3.1 ...