Friday, 26 September 2014

Almost instant ebook production

For reasons described elsewhere I was searching for pictures of Louise Bryant - who was a witness to some of the events following the 1917 revolution in Russia and was married to John Reed - best known for Ten days that shook the world but with a pedigree in left wing journalism and agitation in the US in the years before the first world war.
However, this post is only tangentially about my interest in the history of the Bolshevik revolution and subsequent civil war. While searching for pictures of Louise Bryant I came across an unpublished biography at (unsurprisingly), which I thought might be worth reading.
The biography is published as a set of long html pages - as if it was a set of blog posts, and I really wanted to read it offline using either a tablet or an ebook reader. So I decided to make it into an epub file for personal use.
It turned out to be really easy to so and as there’s quite a few other books and texts out there that have been transcribed and published as a set of web pages, so I thought I’d publish my recipe …
  • Using firefox download and save each of the web pages in the document
    as a file - choose web page as the format - this will save the html
    and also save any embedded images in a subdirectorynamed after the
    web file, eg if the first section of the book is called part_one it
    will save a file called part_one.html and create a directory called
    part_one_files containing the embedded images.
  • Then create a new blank document in libre office and using the insert command select and insert each of the files in turn into the document. This will give you a document containing the entire text.
    • Automagically the images will also be embedded in more or less the
      correct place.
  • Save the file as somefile.odt onto your dropbox
  • Go to CloudConvert and connect it to your
    dropbox account.
  • Select somefile.odt as your import document and choose either epub (for most generic ebook readers) or mobi (for kindle) as the output format.
  • Select save to dropbox as your output target, click convert and it
    will write the output file out to ~/dropbox/apps/cloudconvert/somefile.epub
    • (Under windows it will be in the apps\cloudconvert directory in My
    • If you chose mobi as the output format the output file will be
At that point you can then transfer your file to your chosen device such as either sideloading it or emailing it to your kindle, or opening the file on your tablet by using an ebook reader application and opening the file on dropbox.
The whole exercise took me about ten minutes.
I see no reason why this solution should not work equally well for other texts transcribed and published as web pages.

> Written with [StackEdit](

Tuesday, 23 September 2014

Digital Preservation Strategies ...

I came across a beautifully succinct quotation from National Records of Scotland:

‘If digital records are not captured there can be no preservation and
if there is no preservation there can be no access’

Which is a beautifully concise description of why we do data capture. If we don’t there is no way of retracing our steps, no way of of substantiating research, because we don’t have the original data.

And of course, if we don’t have the data all our arguments about preferred archival formats are moot. And in a very real sense they are anyway - formats change over time, and preferences change over time. Legal documents and court transcripts in Wordperfect from the nineties are a key example.

They may still have validity, but they are in a dead file format. No one when they created these transcripts knew that in twenty years the files would be in a dead format - they chose a widely used well documented format - it’s just that preferences changed.

Tools such as Tika, Pronom and Fido give us a chance on capture of also being able to record information about the file format, which gives us a clue about how we might read the file in the future.

And of course technology to read files changes as well, all we can do is try and make sensible decisions to make life easy for anyone who wants to access captured files.

File normalisation is one - what of course it really means is ‘convert files in a known proprietary format to an open format on ingest’ - usually using something like libre office in batch mode, and storing the converted file along with the original.

The idea is of course, that the converted file will be easier to read as it’s in an open format than a proprietary format. Of course, when we say proprietary format we mean Microsoft because we worry about its dominance of the file format ecology.

And we are of course most certainly wrong - there is just so much material in Microsoft formats that it is difficult to believe that there will be a future in which there are no applications to read these files - what one should be worrying about is the less well used formats such as Pages or AbiWord where there is a greater risk of losing access.

But the point remains, that unless we capture the files in the first place we will have no chance of reading them in the future …

Written with StackEdit.

Tuesday, 2 September 2014

Moving people away from commercial cloud services

A few days ago I posted an update on my thoughts about Eresearch support services .
One of the points I made was that no matter how desirable it was to move people off of commercially hosted services such as Dropbox, it wouldn't be easy

This ease of sharing and the fact that Dropbox is hosted 
outwith Australia is something that of course gives intellectual 
property managers the willies, but it is also a fact of life, and 
something that has to be dealt with - in other words, as Dropbox 
is already out there in the wild, and whatever is provided as a 
replacement has to be at least as good, and at least as flexible 
- which of course means it will bring the same intellectual property 

Dropbox, and the others, such as Evernote and Box, are in with the woodwork as they already have widespread adoption.

I’ve just had a real world example in which a researcher shared data with me via Dropbox that he wanted to have uploaded to our data repository, and have a Digital Object Identifier minted for that data so that it was citable.

In my conversations with him I followed the party line and suggested he use Cloudstor, AARNET’s file transfer service, which is based on FileSender to transfer the data to me.

As a service, it’s pretty easy to use. However, my client used Dropbox instead, simply because it’s what he was familiar with and he knew that it worked.

I am, of course, as bad as everyone else. I routinely share documents and notebooks stored in Evernote with colleagues, and share Google documents with colleagues, so I’m most definitely not going to complain about using Dropbox here - after all it’s exactly what I would have done, and as I’ve said before I’ve had publishers share material for review in exactly the same way.

Instead of complaining, I’m going to take this as a learning experience:
  • services like Cloudstor, are not going to succeed without a major educational campaign to raise awareness among the user community
  • competitor services like Dropbox are already well established and user have a high degree of familarity with them - any educational campaign needs to focus on cloudstor’s unique features
  • whatever value proposition is made needs to be relevant to the users - so if we want to build a unique selling proposition around keeping intellectual property onshore we’d better make it relevant and explain that as well
and the last point is something that we would need to think carefully about. My client was passing me his data as he wanted to not only to make it citable, but also open access, as he was publishing a paper in a journal that required this.

And if it were me my first question would be

If it’s open access does it matter it’s gone via Dropbox ?

And I must admit, I’d be hard pressed to find a reason why it mattered …
Written with StackEdit.

Tuesday, 26 August 2014

Eresearch services

About a year ago I posted my two cents worth on what an eresearch support service should look like.

A year or so on, and innumerable conversations with users, potential users and people who are interested I find my views are not much changed:

User wants can be broadly summarised as

  • storage
    • dropbox like sharing capability
    • lots of it
    • handling of diverse media types (agnostic)
    • assurance it is secure backed up and accessible
  • virtual machines
    • data analysis & manipulation
  • secure long term storage of data
    • publication of data for substantiation
    • digital object identifiers
  • advice on legacy data
    • format conversion
    • media conversion
    • digitisation
    • some bespoke programming, data wrangling etc

Dropbox is extremely popular because of its ease of use and universality, meaning people can share data from the field with colleagues, with colleagues overseas etc.

I have a second life in which I review books - it’s noticable that in the past year publishers have moved from sending you the epub or mobi version to sharing it with you via dropbox. I don’t see any reason why researchers should be any different in their habits.

This ease of sharing and the fact that Dropbox is hosted outwith Australia is something that of course gives intellectual property managers the willies, but it is also a fact of life, and something that has to be dealt with - in other words, as Dropbox is already out there in the wild, what ever is provided as a replacement has to be at least as good, and at least as flexible - which of course means it will bring the same intellectual property concerns.

And of course it’s not just Dropbox, we can say the same about Evernote, OneDrive, OneNote and Google Drive.

However in the course of my conversations one thing that comes up over and over again is the need for decent work in progress storage, and work in progress storage into which it is easy to load data, either by direct capture from instruments, or by some easy finder/file manager like process - people expect to be able to drag’n’drop and tellin them about some command line incantation with rsync doesn’t play.

There is an interest in data publication, but at the moment it’s basically driven by journals requiring that data has to be made available, but I expect that this will build as more and more journals require this. I also expect to see more interest in publishing source code and things like R scripts as part of the whole substantiation and open review thing.

There’s also an undercurrent of people wanting to return to research they did earlier and finding themselves locked out of their data because it’s been stored on media no longer in common use - such as zip drives, or in older data formats that made sense at the time. We could rehearse the open formats argument here, but that doesn’t fix the problem, which needs to be addressed. Allied to this is the need for a little bespoke programming or data wrangling to get data into a usable format, or to clean data.

So, one year on I’d say change hasn’t happened, but there’s nothing to say that it won’t …

Written with StackEdit.

Wednesday, 20 August 2014

Munich to ditch Linux ?

The internet has been all a-twitter today with the news that Munich was considering dumping Linux and going back to Microsoft.

I’m not surprised. Saddened perhaps, but not surprised. Much as Apple through the iPad owns the tablet space, Microsoft still owns the office desktop, and this means that if you want to do something different you have to not only do it as well as Microsoft, you have to do it better.

So let’s look at the Linux software environment and compare it with Microsoft. And of course when we’re talking about local government we’re largely taking about administrative and management tasks, which means word processing, spreadsheets, email and workflows - in other words office applications.

Libre Office and Open Office basically do everything Microsoft Office does, but slightly more clunkily and clever formatting in Office documents sometimes comes out a little wierd, especially if the original document has been edited with two or three different versions of Office, but in the main it’s perfectly usable. You’d be being snippy to say it wasn’t.

Ditto for evolution as a mail and calendar client. Not as polished as outlook but perfectly usable. And if you were a private individual or running a little home business there’s no reason why Linux wouldn’t work for you. The same argument applies to Macs and OS X. Or running anything with Google Docs.

And then there’s collaboration, workflows, business automation, call it what you will. Sharepoint does that pretty well. And in the Linux world?

Sure there are solutions but they usually involve keeping squads of wild eyed sandal wearing geeks in the basement - ie you can’t just license it, get some nice consultants in at inflated prices to configure it for you and leave it running the way you want.

And there’s lots of things out there to integrate. Useful things like invoicing and payment management solutions. Move to something definitely not mainstream and you have to re-engineer every damn thing …

Written with StackEdit.

Wednesday, 13 August 2014


I recently happened across an application called Multcloud.

To be honest, the manufacturers asked me if I was interested in reviewing it.

I declined, because, as a matter of policy I don’t write reviews on things I havn’t tested on myself by using them for real work, or for which payment (or some other inducement) is offered. I’ve always believed in eating my own dogfood, and I find that way I sleep better at night.

However, I was sufficiently curious to take a look.

The idea is quite simple - we all have multiple cloud based accounts, OneDrive, GoogleDrive, Box, and the rest and we all end up with files scattered across all of them, and if you’re like me have different machines that mount different subsets of these drives.

The idea is to provide you with a browser window into which you connect all your accounts, and then which allows you to search across them just as you would search the disks attached to your pc, and to copy files between them.

No a stupid idea - in fact quite a good idea. Obviously there’s a raft of security concerns but the vendors claim on their website that all authentication is by OAauth, and that no data is cached on their servers.

Now as I said, I havn’t tested this tool, and have no idea how well it performs. It’s also not the only such application out there - a little googling comes up with a list of alternatives. However this is a product that might well fill a need for some people. Remember that your mileage may vary …

Written with StackEdit.

Monday, 11 August 2014

Zotero, RefMan and Jabref

Zotero is a rather nice bibliography manager which can export references in a number of formats, including BibTex.

I was playing with reference managers last week trying to work out a set of workflows to get the information exported for reloading into a different solution - it’s the old problem of tracking people’s publication history and loading it into a research management system.

There’s a good little BibTex exporter for Zotero known as autozotbib that works as a plugin for both the desktop client and firefox that pushes the records out in BibTex as they are updated.

If you use dropbox for filesharing you can of course output the export file directly to Dropbox, which makes it readily accessible to a number of other reference management products, including JabRef.

In the course of playing about with this I also tried installing RefMan on my Android tablet, and telling it to read the Zotero output file - which it did.

Now I’m by no means a power Zotero user, but one thing I do sometimes need to do is check references and information, and something that increasingly I find myself going to a tablet to do so because of their extreme portability rather than using either a Chromebook or one of my aging netbooks. While I’ve only got one way synchronization - ie all the changes have to be made to the Zotero end of things, this little trick makes it comparatively easy to search reference lists with a native (and free) android app …

Written with StackEdit.