Fixing Ernesta Drinker's book turned out to be easier than expected.
First of all I used gedit to remove the front matter from the text file and then used cat-s to suppress double blank lines introduced by the digitisation process to get a halfway clean file.
I then used sed to replace the header and footer strings
sed s/header_string\n//g
with null strings, which gave me a reasonably clean text. The only problem was that the file had hard coded end of line markers, and paragraphs were mostly separated by double end of line markers. Here perl was my friend
perl -pi -0 -w -e 's/\n\n/ qq9 /g' infile.txt
to replace the paragraph breaks with qq9 - a string that did not occur in the document. Then I used
perl -p -e 's/\n//g' infile.txt > outfile.txt
to take out the end of line markers
perl -p -e 's/qq9/\n /' infile.txt > outfile.txt
to put back the paragraph breaks. (And yes, I used stackoverflow). I could have wrapped all of this up in a script, but working out the best order of opeation was a bit iterative , and consequently I ran the individual operations in a terminal window.
At this point I opened the text with Libre Office to check the format and remove a couple of headers garbled in the OCR process. If I was being pedantic I could then have spell checked the document but what I had was good enough to read and take notes from, so I simply used CloudConvert to make an epub file from the saved file.
Not perfect, but good enough.
First of all I used gedit to remove the front matter from the text file and then used cat-s to suppress double blank lines introduced by the digitisation process to get a halfway clean file.
I then used sed to replace the header and footer strings
sed s/header_string\n//g
with null strings, which gave me a reasonably clean text. The only problem was that the file had hard coded end of line markers, and paragraphs were mostly separated by double end of line markers. Here perl was my friend
perl -pi -0 -w -e 's/\n\n/ qq9 /g' infile.txt
to replace the paragraph breaks with qq9 - a string that did not occur in the document. Then I used
perl -p -e 's/\n//g' infile.txt > outfile.txt
to take out the end of line markers
perl -p -e 's/qq9/\n /' infile.txt > outfile.txt
to put back the paragraph breaks. (And yes, I used stackoverflow). I could have wrapped all of this up in a script, but working out the best order of opeation was a bit iterative , and consequently I ran the individual operations in a terminal window.
At this point I opened the text with Libre Office to check the format and remove a couple of headers garbled in the OCR process. If I was being pedantic I could then have spell checked the document but what I had was good enough to read and take notes from, so I simply used CloudConvert to make an epub file from the saved file.
Not perfect, but good enough.
No comments:
Post a Comment