Wednesday, 15 August 2012

More fun with text analysis


A couple of days ago I blogged about my rather brain dead experiments with topic modelling software and a couple of books by W G Burn Murdoch.

While the experiments were not mature they certainly taught me a lot about text mining and topic modelling.

Essentially we take a body of texts, strip out the common words (the a's the the's etc that glue a sentence together) and search for statistically significant combinations of words that differ in abundance from their usage in a body of reference texts. Cluster analysis with attitude in other words.

Just by coincidence there is an excellent recent post by Ted Underwood that reiterates and amplifies most of what I've worked out by myself.

Since my initial experiments I've confirmed my findings with Voyant. Basically I got the same results.

I was going to try the Stanford text mining tools as well, but I need to teach myself a little Scala first.

The point which I want to make is that this exploratory research (aka geeking about) was trivial on my part – I downloaded the software onto an old Dell laptop running Ubuntu, chmod'd stuff where appropriate and I was doing it. Installation and execution didn't demand a lot of effort.

People may of course object that I've been playing with computers for years and that these things are easy for me. Well they are. But so they are for everybody. And the uses you can make of them can be innovative.

I started with a fairly simple question. Because I understand a little about cluster analysis etc I had no real problems in understanding what the data gave me – and incidentally came up with a different question – the role of critical reading in all of this.

However the exercise does have some value in itself

When I showed J the 'fridge poetry' topic lists and wordcloud stuff she immediately downloaded the software to her Mac and fed the Gutenberg Bronte texts through it – she alread had them lyng around as she was in the process of hacking out passages as discussion texts for her English Novels students – just for curiosity to see if it gave anything that looked useful as a supplement for teaching. Questions like why do we get these words and not their synonyms.

Now J is not technical – perhaps even less so than she needs to be as she's got me around – but again the effort involved in doing it was minimal – even less that that required on Ubuntu. In other she thought maybe this might be a fun approach – let's see if it spits out anything I can use ...

So the cost of curiosity with this stuff is minimal as are the computing resources required.

There's a message in there somewhere ...

No comments: