Tuesday, 6 October 2015

Reading old digitised books

Over the long weekend I caught up on some of my reading backlog, including a biography of Louise Bryant.

Louise Bryant was at one time married to John Reed (he of 'Ten Days that Shook the World' fame) and after his death married William Bullitt, who was later US ambassador to the Soviet Union.

Louise Bryant's story is of someone who desperately wanted to be someone, rather than a serious revolutionary. While she had her fifteen minutes of fame as a journalist, she was ultimately a tragic figure dying in obscurity. To quote Emma Goldman's cynical remark 'she was never a communist, she only slept with a communist'.

William Bullitt had a diplomatic career before he met Louise Bryant.
His first wife, Ernesta Drinker, accompanied him on a diplomatic mission to the central powers (Germany, Austria Hungary) before the USA joined with Britian an France in 1917 on the Western Front. Ernesta kept a diary of the trip and published it as book afterwards.

Now one of my interests is the lead up to the Russian Revolution. There's plenty of material in English about the first world war, but that naturally concentrates on Gallipoli and the Western Front. There's actually very little available about how things were in Germany and Austria Hungary, so I thought I'd try and track down a digitised copy.

Well there's not a copy on Gutenberg but it's been digitised as part of the Google Books initiative, and it's reasonably easy to obtain a copy via the Internet Archive of the scanned text as either a pdf or an epub. The text has scanning errors in it but it's not too bad, even if the structuring of the pages is a bit annoying with 'digitised by Google' added in at the bottom of each individual page image.

The text is however good enough for input to any text analyis program. Good enough for what people rather grandly call 'distant reading'.

It is however a pain to read. I could of course take the text and write a little python script to clean it up a bit and generate my own epub, and perhaps I should, but that does defer the instant gratfication aspect of tracking down a book, so I went looking for a clean copy.

The various Indian print on demand operations offer to print a corrected version for around $15, and a couple of websites offer access to a corrected version for a modest fee which allows you to download the text. One of them offers a try befoe you buy option to see a sample of the pages and cerainly they look reasonable. A quick search of AbeBooks drew nothing other than the print on demand versions at a reasonable price - none of the original editions being off loaded for a dollar or two.

So it's back to the digitised text.

One of the problems with the text digitisation effort is that a lot of the scanning initiatives have been focused on producing the text either for input to some machine learning programs or in producing a page by page set of images. And if one is using the pdf version, having an added footer is not really a problem, providing that one views the page image screen by screen at the original page size.

But one never does that. The easiest way is to use a reflowable format such as epub which allows one to adapt the text display to the capabilities of the device being used, or on a pdf viewer coercing the page to A4. And this leads to the footers and original page breaks being scattered through the document.

And this because the way that the text has been digitised has been to scan the text, add the footers, and ocr the page images to extract text from teh page images. Which is fine if one wants a digital representation of the original book, but rather less so if one wants to read the damn thing ...

