Monday, 1 August 2016

OCR, scanning and print on demand

I’ve been thinking about Ernesta, Christina and OCR (and the problem of book digitisation).

Scanning books essentially allows you to take an image of each page, and may introduce artefacts caused by blotches, foxing and dead spiders, but the result is generally something that can be read by a human being, and of course the book can be printed out, bound, and sold.

That’s what the various print on demand people do and what the Espresso book machine was designed to do.

And because human being are good at working out page layouts you don’t beed need to correct for headers, footers, page numbers and so on. What you scan is what you get.

Enter the ebook reader, either as hardware or software.

For a start they want text not images. OCR software is good, but not perfect, which means that artefacts introduced either by poor quality printing or dead spiders will give you unreadable runs of gobbledygook. Also any page headers and page numbers tend to end up embedded in the text, and because the text is reflowable, ie is automatically formatted to fit the screen, which is treated as a little porthole onto which you see the book as if it was printed on a roll of toilet paper that is unwound past the porthole, headers, page numbers etc don’t make any sense.

And having read several books that have been converted like this I can tell you it’s a pain on the Kindle.

Gutenberg ebooks are usually good because they have been rekeyed. Books from mainstream publishers are usually generated from the electronic source and structured appropriately.

Scanned and OCR’d books need some TLC (and possibly some TEI), and that takes time and requires effort, which is expensive.

This means that print on demand will live on as a way of reading out of print out of copyright books as it’s probably cheaper to this than restructure and correct the text, especially for one off productions ...

No comments: