Friday, 26 September 2014

Almost instant ebook production

For reasons described elsewhere I was searching for pictures of Louise Bryant - who was a witness to some of the events following the 1917 revolution in Russia and was married to John Reed - best known for Ten days that shook the world but with a pedigree in left wing journalism and agitation in the US in the years before the first world war.
However, this post is only tangentially about my interest in the history of the Bolshevik revolution and subsequent civil war. While searching for pictures of Louise Bryant I came across an unpublished biography at (unsurprisingly), which I thought might be worth reading.
The biography is published as a set of long html pages - as if it was a set of blog posts, and I really wanted to read it offline using either a tablet or an ebook reader. So I decided to make it into an epub file for personal use.
It turned out to be really easy to so and as there’s quite a few other books and texts out there that have been transcribed and published as a set of web pages, so I thought I’d publish my recipe …
  • Using firefox download and save each of the web pages in the document
    as a file - choose web page as the format - this will save the html
    and also save any embedded images in a subdirectorynamed after the
    web file, eg if the first section of the book is called part_one it
    will save a file called part_one.html and create a directory called
    part_one_files containing the embedded images.
  • Then create a new blank document in libre office and using the insert command select and insert each of the files in turn into the document. This will give you a document containing the entire text.
    • Automagically the images will also be embedded in more or less the
      correct place.
  • Save the file as somefile.odt onto your dropbox
  • Go to CloudConvert and connect it to your
    dropbox account.
  • Select somefile.odt as your import document and choose either epub (for most generic ebook readers) or mobi (for kindle) as the output format.
  • Select save to dropbox as your output target, click convert and it
    will write the output file out to ~/dropbox/apps/cloudconvert/somefile.epub
    • (Under windows it will be in the apps\cloudconvert directory in My
    • If you chose mobi as the output format the output file will be
At that point you can then transfer your file to your chosen device such as either sideloading it or emailing it to your kindle, or opening the file on your tablet by using an ebook reader application and opening the file on dropbox.
The whole exercise took me about ten minutes.
I see no reason why this solution should not work equally well for other texts transcribed and published as web pages.

> Written with [StackEdit](

Tuesday, 23 September 2014

Digital Preservation Strategies ...

I came across a beautifully succinct quotation from National Records of Scotland:

‘If digital records are not captured there can be no preservation and
if there is no preservation there can be no access’

Which is a beautifully concise description of why we do data capture. If we don’t there is no way of retracing our steps, no way of of substantiating research, because we don’t have the original data.

And of course, if we don’t have the data all our arguments about preferred archival formats are moot. And in a very real sense they are anyway - formats change over time, and preferences change over time. Legal documents and court transcripts in Wordperfect from the nineties are a key example.

They may still have validity, but they are in a dead file format. No one when they created these transcripts knew that in twenty years the files would be in a dead format - they chose a widely used well documented format - it’s just that preferences changed.

Tools such as Tika, Pronom and Fido give us a chance on capture of also being able to record information about the file format, which gives us a clue about how we might read the file in the future.

And of course technology to read files changes as well, all we can do is try and make sensible decisions to make life easy for anyone who wants to access captured files.

File normalisation is one - what of course it really means is ‘convert files in a known proprietary format to an open format on ingest’ - usually using something like libre office in batch mode, and storing the converted file along with the original.

The idea is of course, that the converted file will be easier to read as it’s in an open format than a proprietary format. Of course, when we say proprietary format we mean Microsoft because we worry about its dominance of the file format ecology.

And we are of course most certainly wrong - there is just so much material in Microsoft formats that it is difficult to believe that there will be a future in which there are no applications to read these files - what one should be worrying about is the less well used formats such as Pages or AbiWord where there is a greater risk of losing access.

But the point remains, that unless we capture the files in the first place we will have no chance of reading them in the future …

Written with StackEdit.

Tuesday, 2 September 2014

Moving people away from commercial cloud services

A few days ago I posted an update on my thoughts about Eresearch support services .
One of the points I made was that no matter how desirable it was to move people off of commercially hosted services such as Dropbox, it wouldn't be easy

This ease of sharing and the fact that Dropbox is hosted 
outwith Australia is something that of course gives intellectual 
property managers the willies, but it is also a fact of life, and 
something that has to be dealt with - in other words, as Dropbox 
is already out there in the wild, and whatever is provided as a 
replacement has to be at least as good, and at least as flexible 
- which of course means it will bring the same intellectual property 

Dropbox, and the others, such as Evernote and Box, are in with the woodwork as they already have widespread adoption.

I’ve just had a real world example in which a researcher shared data with me via Dropbox that he wanted to have uploaded to our data repository, and have a Digital Object Identifier minted for that data so that it was citable.

In my conversations with him I followed the party line and suggested he use Cloudstor, AARNET’s file transfer service, which is based on FileSender to transfer the data to me.

As a service, it’s pretty easy to use. However, my client used Dropbox instead, simply because it’s what he was familiar with and he knew that it worked.

I am, of course, as bad as everyone else. I routinely share documents and notebooks stored in Evernote with colleagues, and share Google documents with colleagues, so I’m most definitely not going to complain about using Dropbox here - after all it’s exactly what I would have done, and as I’ve said before I’ve had publishers share material for review in exactly the same way.

Instead of complaining, I’m going to take this as a learning experience:
  • services like Cloudstor, are not going to succeed without a major educational campaign to raise awareness among the user community
  • competitor services like Dropbox are already well established and user have a high degree of familarity with them - any educational campaign needs to focus on cloudstor’s unique features
  • whatever value proposition is made needs to be relevant to the users - so if we want to build a unique selling proposition around keeping intellectual property onshore we’d better make it relevant and explain that as well
and the last point is something that we would need to think carefully about. My client was passing me his data as he wanted to not only to make it citable, but also open access, as he was publishing a paper in a journal that required this.

And if it were me my first question would be

If it’s open access does it matter it’s gone via Dropbox ?

And I must admit, I’d be hard pressed to find a reason why it mattered …
Written with StackEdit.