Stuff, geeky stuff: 10/01/2015

Thursday, 29 October 2015

Google calendar url's and orage synchronisation

On a hot sticky afternoon in 2007 I wrote a little script to import a google calendar file into orage.

For some reason it became quite popular, mostly due to the lack of a suitable alternative import mechanism.

Google have recently announced that the URL used by Google Calendar will change.

If you've been using my script with an old style URL, this will of course break but the fix is relatively straight forward - simply update the calendar url used by the wget command ...

Wednesday, 21 October 2015

El Capitan ...

While I was having my twitter discussion on 'What is a repository?' I wasn't idle.

I upgraded my five year old MacBook to the latest version of OS X, El Capitan. I'd previously upgraded it to Yosemite, and while it was fine in use, it did tend to slow down over the working week and was at times tediously slow to boot after a shutdown.

It's a little too early to tell how good it'll be after a period of sustained use but I think the upgraded system is a little faster.

The upgrade process was fairly slick, as you'd expect with Apple. I did have a moment of panic half way through when the system rebooted, and instead of simply flashing the power light on and off, made it flicker along with a horrid buzzing noise that reminded me of a system with corrupt firmware, but that was just me being paranoid.

Initial login and configuration seemed to take forever, giving me plenty of time to appreciate the aesthetics of the redesigned spinning beach ball of death but we got there in the end.

As usual with a Mac, everything just seems to work and there's no playing with settings (well apart from reconnecting to iCloud).

We'll see how thing are in a week or so ...

Monday, 19 October 2015

No one wants an ereader

A few days ago I blogged about peak ereader and how one big English bookshop chain had stopped stocking Kindles.

So, Is it really the end for the ereader?

I think so. A search on ebay reveals a lot of second hand Kobos, Sonys and Kindles but no new devices or third party cheap asian devices.

The cheapest I could find was a refurbished Kobo for fifty bucks - as a comparison on the same day I could get a new end of range 7" android tablet for an extra ten bucks -from Telstra of all people.

The market has voted - no one wants them ...

Wednesday, 14 October 2015

ipython notebooks and text cleaning

Twice in the last few days I've had an interactive text cleaning session - once with Ernesta's diary and once with a 125 page listing of journal editor strings that lacked some quality control - leading spaces, double spaces between some entries and so on.

All easy enough to fix with sed to get a file where the names were nicely comma separated and hence easy to split into the individual names.

None of this is rocket science mostly it's just

s/ and /,/
s/ & /,/
s/ ,/,/
s/,,/,/

and of course each time you do it the steps you follow are slightly different.

Most times I don't use sed, mostly I use gedit which includes the functionality. It could also be done interactively from the command line using perl or python as I did cleaning up Ernesta when I felt lazy and raided stack overflow rather than doing it myself.

The crucial thing is of course that I don't actually have a record of what I did. I have notes of what I think I did, but this is reconstructed from a screenscrape of a terminal session and emails to colleagues. Crucially if you use a tool like gedit, you don't get a record of what you've done.

The same goes for work done in R such as my experiments to make a middle Scots stopword list - while I'm sure I've archived my script somewhere, I don't have a record of what I did.

While it might be overkill, the answer is to use something like ipython notebooks as an interactive work environment - and of course they're not just for python anymore - they're increasingly language agnostic.

So my little self improvement project is to get to grips with ipython notebooks, which if nothing else should improve my python skills ...

Thursday, 8 October 2015

Peak e-reader ?

Waterstone's, the big UK bookshop chain has stopped reselling the Kindle. At the same time, their competitor, who resells the Nook, reports that sales are flat, with few people buying an e-reader for the first time, and those sales that they have are people replacing failed devices.

This shouldn't surprise us. E-readers are conceptually simple devices that do one or two things very well. There's no pressure to upgrade or replace unless the device breaks or is left out in a summer storm.

For example, while I use a Kindle for recreational reading, I still use my 2009 vintage Cool-er for reading public domain epubs and cleaned up texts such as Ernesta Drinker's diary. Despite being totally unsupported my Cool-er still works fine - the only problem being that the paint has scuffed off some of the arrow keys.

So that's one problem. The devices are reliable. The other problem is the multiple device problem. A lot of reading takes place on public transport, and if you've already got a tablet with you why carry a second device when you can just as easily read your book on your tablet?

So it's probably legitimate to say that the e-reader device is saturated, at least in the developed, English speaking world. Due to the cheapness of tablets these days less developed countries may never do the e-reader thing, especially as the tablet is considerably more flexible as a resource - after all if you'd a choice between a $100 tablet and a $100 ereader, which would you choose?

None of this says anything about e-book adoption rates.

E-books remain a versatile distribution medium. There will always be people that prefer paper books and those books that simply aren't available in an electronic format. And there's definitely a role for them as reference material.

But e-books are here to stay.

Wednesday, 7 October 2015

Ernesta Drinker and surfaces, macbooks and the rest

Yesterday I posted about my quick and dirty clean up of Ernesta Drinker's journal.

A few hours later, on the other side of the planet, Microsoft announced a slew devices, including the Surface Book, which is being touted by some journalists as a MacBook Pro killer.

Well I have a MacBook Pro (well, work bought it for me, and it's actually 5 years old and overdue for replacement), but all my work fiddling with Ernesta Drinkwater's book was carried out on an even more elderly Dell Latitude running Linux.

It's Linux that made it possible, because of it's rich toolset, though I could have done it on my MacBook via a terminal window because of OS X's BSD heritage.

Windows? - well given I used perl for a lot of it I could have done it by running the scripts from the command line but it would have been a bit of a hassle.

And that's something that tends to be forgotten. There are those of us who use machines for work, and quite often what we have on our desks is driven by our software requirements for work, and how effective that makes us.

If I was cynical, the only reason I have Microsoft Office is because I once had to write a set of grant proposals using a template that didn't work in Libre Office.

Necessity is the mother of software choice, not how fast or sexy your hardware its ...

Tuesday, 6 October 2015

Fixing Ernesta

Fixing Ernesta Drinker's book turned out to be easier than expected.

First of all I used gedit to remove the front matter from the text file and then used cat-s to suppress double blank lines introduced by the digitisation process to get a halfway clean file.

I then used sed to replace the header and footer strings

sed s/header_string\n//g

with null strings, which gave me a reasonably clean text. The only problem was that the file had hard coded end of line markers, and paragraphs were mostly separated by double end of line markers. Here perl was my friend

perl -pi -0 -w -e 's/\n\n/ qq9 /g' infile.txt

to replace the paragraph breaks with qq9 - a string that did not occur in the document. Then I used

perl -p -e 's/\n//g' infile.txt > outfile.txt

to take out the end of line markers

perl -p -e 's/qq9/\n /' infile.txt > outfile.txt

to put back the paragraph breaks. (And yes, I used stackoverflow). I could have wrapped all of this up in a script, but working out the best order of opeation was a bit iterative , and consequently I ran the individual operations in a terminal window.

At this point I opened the text with Libre Office to check the format and remove a couple of headers garbled in the OCR process. If I was being pedantic I could then have spell checked the document but what I had was good enough to read and take notes from, so I simply used CloudConvert to make an epub file from the saved file.

Not perfect, but good enough.

Reading old digitised books

Over the long weekend I caught up on some of my reading backlog, including a biography of Louise Bryant.

Louise Bryant was at one time married to John Reed (he of 'Ten Days that Shook the World' fame) and after his death married William Bullitt, who was later US ambassador to the Soviet Union.

Louise Bryant's story is of someone who desperately wanted to be someone, rather than a serious revolutionary. While she had her fifteen minutes of fame as a journalist, she was ultimately a tragic figure dying in obscurity. To quote Emma Goldman's cynical remark 'she was never a communist, she only slept with a communist'.

William Bullitt had a diplomatic career before he met Louise Bryant.
His first wife, Ernesta Drinker, accompanied him on a diplomatic mission to the central powers (Germany, Austria Hungary) before the USA joined with Britian an France in 1917 on the Western Front. Ernesta kept a diary of the trip and published it as book afterwards.

Now one of my interests is the lead up to the Russian Revolution. There's plenty of material in English about the first world war, but that naturally concentrates on Gallipoli and the Western Front. There's actually very little available about how things were in Germany and Austria Hungary, so I thought I'd try and track down a digitised copy.

Well there's not a copy on Gutenberg but it's been digitised as part of the Google Books initiative, and it's reasonably easy to obtain a copy via the Internet Archive of the scanned text as either a pdf or an epub. The text has scanning errors in it but it's not too bad, even if the structuring of the pages is a bit annoying with 'digitised by Google' added in at the bottom of each individual page image.

The text is however good enough for input to any text analyis program. Good enough for what people rather grandly call 'distant reading'.

It is however a pain to read. I could of course take the text and write a little python script to clean it up a bit and generate my own epub, and perhaps I should, but that does defer the instant gratfication aspect of tracking down a book, so I went looking for a clean copy.

The various Indian print on demand operations offer to print a corrected version for around $15, and a couple of websites offer access to a corrected version for a modest fee which allows you to download the text. One of them offers a try befoe you buy option to see a sample of the pages and cerainly they look reasonable. A quick search of AbeBooks drew nothing other than the print on demand versions at a reasonable price - none of the original editions being off loaded for a dollar or two.

So it's back to the digitised text.

One of the problems with the text digitisation effort is that a lot of the scanning initiatives have been focused on producing the text either for input to some machine learning programs or in producing a page by page set of images. And if one is using the pdf version, having an added footer is not really a problem, providing that one views the page image screen by screen at the original page size.

But one never does that. The easiest way is to use a reflowable format such as epub which allows one to adapt the text display to the capabilities of the device being used, or on a pdf viewer coercing the page to A4. And this leads to the footers and original page breaks being scattered through the document.

And this because the way that the text has been digitised has been to scan the text, add the footers, and ocr the page images to extract text from teh page images. Which is fine if one wants a digital representation of the original book, but rather less so if one wants to read the damn thing ...