good news and bad news time: Stanza does happily import pdf's and export them as epubs without difficulty. The bad news is that as most pdf's are created without a lot of the metadata fields filled in originating pdf, meaning the resulting epub created has the name of the document and the document author set to 'unknown'.
Stanza doesn't care - it opens the file you ask it to open and so if you've created a file called thing.epub it will open the file as thing.epub.
The Cool-er is annoyingly cleverer. It knows the file is an epub so it opens the epub file to display the document name - not the filename, and of course as the document name is unknown as it was never set in the original pdf that's what you get - a document called unknown (and often created by unknown). If you have two such documents it picks the first one it finds and doesn't display the second - logical but annoying.
So the next stage is to unpick an existing epub - edit the files appropriately and then zip them back together ....
a little later ...
which following the jedisaber.com tutorial turned out to be fairly easy.
Let us say we're editing a file called lnt.epub - my sequence was:
$ mv lnt.epub lnt.zip
$ unzip lnt.zip
edit the following files on the OEBPS directory using the text editor of your choice
content.opf
toc.ncx
title.xhtml
replacing all the unknowns with appropriate text - title, author etc
then
$ zip -r lnt.zip OEBPS/content.opf OEBPS/toc.ncx OEBPS/title.xhtml
(if you're feeling paranoid or your version of zip doesn't support this could be done as three separate commands)
$ mv lnt.zip lnt.epub
Check the result with Stanza. Now this is where there's a problem - in my version there are two spurious unknowns. Now originally I didn't edit the title.xhtml file and that produced an epub with four spurious unknowns. Editing the title.xhtml file got rid of two of them so I'm guessing there are a couple of other fields - and certainly changing the appropriate values in part1.xhtml got rid of the unknowns, even though I didn't change the values in part2.xhtml - something I thought might provoke an error by the application parsing the epub
Now I'm hacking this - the jedisaber tutorial doesn't mention doing this so I'm going to guess I'm working with a revision to the format which I don't quite understand as yet, certainly there should be a way of supressing these values ...
What also is clear that given a suitable application to extract text from a pdf and convert it to xhtml it would be reasonably simple to write a perl script to build the supporting files programmaticaly from a template.
It also means that it os relatively straightforward to anyone with text that can be converted to xhtml to package their books as epubs - an ideal way for small specialist publishers to redistribute their back list.
2 comments:
Open office creates decent enough xhtml for our purposes - and of course can be driven from the command line. The Linux poppler-utils have commnd line utils to extract the metadata, and to convert pdf to text (or html) - next stage I guess is to play with poppler to work out how best to extract data and then recast it as an epub ...
http://www.hxa.name/articles/content/epub-guide_hxa7241_2007.html is a useful alternative reference to the epub format
Post a Comment