Wednesday, 14 October 2015

ipython notebooks and text cleaning

Twice in the last few days I've had an interactive text cleaning session - once with Ernesta's diary and once with a 125 page listing of journal editor strings that lacked some quality control - leading spaces, double spaces between some entries and so on.

All easy enough to fix with sed to get a file where the names were nicely comma separated and hence easy to split into the individual names.

None of this is rocket science mostly it's just

s/ and /,/
s/ & /,/
s/ ,/,/
s/,,/,/


and of course each time you do it the steps you follow are slightly different.

Most times I don't use sed, mostly I use gedit which includes the functionality. It could also be done interactively from the command line using perl or python as I did cleaning up Ernesta when I felt lazy and raided stack overflow rather than doing it myself.

The crucial thing is of course that I don't actually have a record of what I did. I have notes of what I think I did, but this is reconstructed from a screenscrape of a terminal session and emails to colleagues. Crucially if you use a tool like gedit, you don't get a record of what you've done.

The same goes for work done in R such as my experiments to make a middle Scots stopword list - while I'm sure I've archived my script somewhere, I don't have a record of what I did.

While it might be overkill, the answer is to use something like ipython notebooks as an interactive work environment - and of course they're not just for python anymore - they're increasingly language agnostic.

So my little self improvement project is to get to grips with ipython notebooks, which if nothing else should improve my python skills ...

No comments: