A couple of days ago I blogged
about my rather brain dead experiments with topic modelling
software and a couple of books by W G Burn Murdoch.
While the experiments were not mature
they certainly taught me a lot about text mining and topic modelling.
Essentially we take a body of texts,
strip out the common words (the a's the the's etc that glue a
sentence together) and search for statistically significant
combinations of words that differ in abundance from their usage in a
body of reference texts. Cluster analysis with attitude in other
words.
Just by coincidence there is an
excellent
recent post by Ted Underwood that reiterates and amplifies most
of what I've worked out by myself.
Since my initial experiments I've
confirmed my findings with Voyant. Basically I got the same results.
I was going to try the Stanford
text mining tools as well, but I need to teach myself a little
Scala first.
The point which I want to make is that
this exploratory research (aka geeking about) was trivial on my part
– I downloaded the software onto an old Dell laptop running Ubuntu,
chmod'd stuff where appropriate and I was doing it. Installation and
execution didn't demand a lot of effort.
People may of course object that I've
been playing with computers for years and that these things are easy
for me. Well they are. But so they are for everybody. And the uses
you can make of them can be innovative.
I started with a fairly simple
question. Because I understand a little about cluster analysis etc I
had no real problems in understanding what the data gave me – and
incidentally came up with a different question – the role of
critical reading in all of this.
However the exercise does have some
value in itself
When I showed J the 'fridge poetry'
topic lists and wordcloud stuff she immediately downloaded the
software to her Mac and fed the Gutenberg Bronte texts through it –
she alread had them lyng around as she was in the process of hacking
out passages as discussion texts for her English Novels students –
just for curiosity to see if it gave anything that looked useful as a
supplement for teaching. Questions like why do we get these words and
not their synonyms.
Now J is not technical – perhaps even
less so than she needs to be as she's got me around – but again the
effort involved in doing it was minimal – even less that that
required on Ubuntu. In other she thought maybe this might be a fun
approach – let's see if it spits out anything I can use ...
So the cost of curiosity with this
stuff is minimal as are the computing resources required.
There's a message in there somewhere
...
No comments:
Post a Comment