Twice in the last few days I've had an interactive text cleaning session - once with Ernesta's diary and once with a 125 page listing of journal editor strings that lacked some quality control - leading spaces, double spaces between some entries and so on.
All easy enough to fix with sed to get a file where the names were nicely comma separated and hence easy to split into the individual names.
None of this is rocket science mostly it's just
s/ and /,/
s/ & /,/
s/ ,/,/
s/,,/,/
and of course each time you do it the steps you follow are slightly different.
Most times I don't use sed, mostly I use gedit which includes the functionality. It could also be done interactively from the command line using perl or python as I did cleaning up Ernesta when I felt lazy and raided stack overflow rather than doing it myself.
The crucial thing is of course that I don't actually have a record of what I did. I have notes of what I think I did, but this is reconstructed from a screenscrape of a terminal session and emails to colleagues. Crucially if you use a tool like gedit, you don't get a record of what you've done.
The same goes for work done in R such as my experiments to make a middle Scots stopword list - while I'm sure I've archived my script somewhere, I don't have a record of what I did.
While it might be overkill, the answer is to use something like ipython notebooks as an interactive work environment - and of course they're not just for python anymore - they're increasingly language agnostic.
So my little self improvement project is to get to grips with ipython notebooks, which if nothing else should improve my python skills ...
All easy enough to fix with sed to get a file where the names were nicely comma separated and hence easy to split into the individual names.
None of this is rocket science mostly it's just
s/ and /,/
s/ & /,/
s/ ,/,/
s/,,/,/
and of course each time you do it the steps you follow are slightly different.
Most times I don't use sed, mostly I use gedit which includes the functionality. It could also be done interactively from the command line using perl or python as I did cleaning up Ernesta when I felt lazy and raided stack overflow rather than doing it myself.
The crucial thing is of course that I don't actually have a record of what I did. I have notes of what I think I did, but this is reconstructed from a screenscrape of a terminal session and emails to colleagues. Crucially if you use a tool like gedit, you don't get a record of what you've done.
The same goes for work done in R such as my experiments to make a middle Scots stopword list - while I'm sure I've archived my script somewhere, I don't have a record of what I did.
While it might be overkill, the answer is to use something like ipython notebooks as an interactive work environment - and of course they're not just for python anymore - they're increasingly language agnostic.
So my little self improvement project is to get to grips with ipython notebooks, which if nothing else should improve my python skills ...
No comments:
Post a Comment