Thursday, 18 July 2013

OCR and Percy B Molesworth

I recently tweeted a link to an article on using OCR from the command line. The article reminded me of my previous experiments with OCR.

In between then and now I'd played with both Zamzar and Adobe File Conversion services trying to convert Florence Caddy's account of her time in Sri Lanka to editable text. Both failed for various reasons, as did the Calibre e-book management application.

So I thought I would try OCR out on Florence Caddy's book - in the end I didn't have to as I found a version in epub format that met my requirements. I was probably being a little ambitious as one of the problems with OCR is font recognition and artefacts caused either by blobbiness in the print, or dirt picked up in the scanning process and I suspect the original scanned PDF I had would have had a reasonable number of these.

However I did try and baseline how much work it would be by OCR'ing the obituary of Percy B Molesworth provided by the NASA Astrophysics data system. In this the pages have been scanned as images with headers and footers added. (Percy Molesworth was a late nineteenth century British astronomer who worked in Sri Lanka.)

The actual text of the article is about one A4 page but spread across two pages. To process it I used OCRfeeder running on my Linux laptop and inside of OCRfeeder selected the tesseract OCR engine and the initial results were not too bad - good enough to play with without needing to try an alternative engine.

OCRfeeder by default exports in ODT format making Libre Office an obvious choice as document fixup and editing tool.

Using Libre Office I cut and pasted the relevant chunk of text into a new document, and fixed the formatting so that the paragraph structure reflected that of the original document (While I was doing this I found it incredibly useful to have the original pdf open on a second laptop while I worked on the Libre Office document.

I then ran the Libre Office spell checker which fixed up most of the minor errors such as aud for and. This basically left a few complete bits of garbage to be fixed manually - dates for some reason were bizarre, and the fact that Percy had a twelve and a half inch instrument produced a string of garbage.
However, at the end of the exercise I had a usable document which I could save and export from Libre Office as required.

The key thing is the time it took - 30 minutes to do a page of A4 text, meaning that your average five to six hundred page nineteenth century travel book would take a little over eight weeks assuming the same amount of work per page, and a standard working week. The major time consumer was not the initial OCR process - that took me around five minutes, and most of the time was due to my reminding myself how the application worked, the majority of the time went into the fix up and edit process.
With that sort of time, retyping the text is probably a viable option ...
Written with StackEdit.

No comments: