Saturday, 14 November 2009

slight complication in the epub game ...

good news and bad news time: Stanza does happily import pdf's and export them as epubs without difficulty. The bad news is that as most pdf's are created without a lot of the metadata fields filled in originating pdf, meaning the resulting epub created has the name of the document and the document author set to 'unknown'.

Stanza doesn't care - it opens the file you ask it to open and so if you've created a file called thing.epub it will open the file as thing.epub.

The Cool-er is annoyingly cleverer. It knows the file is an epub so it opens the epub file to display the document name - not the filename, and of course as the document name is unknown as it was never set in the original pdf that's what you get - a document called unknown (and often created by unknown). If you have two such documents it picks the first one it finds and doesn't display the second - logical but annoying.

So the next stage is to unpick an existing epub - edit the files appropriately and then zip them back together ....

a little later ...

which following the jedisaber.com tutorial turned out to be fairly easy.

Let us say we're editing a file called lnt.epub - my sequence was:

$ mv lnt.epub lnt.zip
$ unzip lnt.zip

edit the following files on the OEBPS directory using the text editor of your choice

content.opf
toc.ncx
title.xhtml

replacing all the unknowns with appropriate text - title, author etc

then

$ zip -r lnt.zip OEBPS/content.opf OEBPS/toc.ncx OEBPS/title.xhtml

(if you're feeling paranoid or your version of zip doesn't support this could be done as three separate commands)

$ mv lnt.zip lnt.epub

Check the result with Stanza. Now this is where there's a problem - in my version there are two spurious unknowns. Now originally I didn't edit the title.xhtml file and that produced an epub with four spurious unknowns. Editing the title.xhtml file got rid of two of them so I'm guessing there are a couple of other fields - and certainly changing the appropriate values in part1.xhtml got rid of the unknowns, even though I didn't change the values in part2.xhtml - something I thought might provoke an error by the application parsing the epub

Now I'm hacking this - the jedisaber tutorial doesn't mention doing this so I'm going to guess I'm working with a revision to the format which I don't quite understand as yet, certainly there should be a way of supressing these values ...

What also is clear that given a suitable application to extract text from a pdf and convert it to xhtml it would be reasonably simple to write a perl script to build the supporting files programmaticaly from a template.

It also means that it os relatively straightforward to anyone with text that can be converted to xhtml to package their books as epubs - an ideal way for small specialist publishers to redistribute their back list.

2 comments:

dgm said...

Open office creates decent enough xhtml for our purposes - and of course can be driven from the command line. The Linux poppler-utils have commnd line utils to extract the metadata, and to convert pdf to text (or html) - next stage I guess is to play with poppler to work out how best to extract data and then recast it as an epub ...

dgm said...

http://www.hxa.name/articles/content/epub-guide_hxa7241_2007.html is a useful alternative reference to the epub format