Thursday, 16 August 2012

Geeking about with wordcloud...

People may be wondering why I've been geeking about with Wordcloud.

I actually do have  a rational reason - information presentation. Research paper titles are sometime wonderfully opaque and the keywords are sometimes not much better.

People also don't have a lot of time to read lots of abstracts.

So I thought, why not generate a wordcloud on the paper and store it just in the same way that we might store an image thumbnail.

Doing this is actually more complex that it might be.

First of all, as my Gawain experiment showed you really need a discipline specific stopwords file. What they should be I don't know but feeding a whole lot of research papers on a particular topic and doing a frequency count should help generate a set of common terms discipline specific terms that are essentially 'noise'. The human eyeball would also need to play a part - if you have a set of papers on primatology for example you don't want the term 'baboon' to end up in the noise file just because it's common.

Equally you need to be able to classify papers in some way. Going back to my baboon paper, is a paper on changes in foraging behaviour in baboon troops as a result of drought ethology, ecology  or climate science?

Hence the idea behind topic modelling and the 'fridge poetry' output - my idea is to do something like the following:

Feed the text of a research paper through topic modelling software. Compare the results with discipline specific lists that you made earlier by feeding a whole set of papers sorted by discipline though the same software. This should give you some measure of 'likeness', and allow you to allocate it to no more than three fields of research.

Then, taking the top scoring field of research for a paper, feed it through the wordcloud software with the  appropriate stopword list.

This will give you a visual representation of the key themes in a paper, and allow people to rapidly flick through material and identify the papers they are interested in. Of course you also store the classification words to allow people to search for, to continue the example 'climate + baboon'

I say papers. I do mean papers, but at the back of my mind is the fact that scientific communication is changing - blogs as research diaries are becoming important, videos of conference sessions are bubbling up and we need some sort of way of classifying them and producing and easy to read visual representation of the themes - for example here's one of one of my other blogs:

Is this a good idea? I honestly don't know. There is no substitute for reading the material, but finding relevant material has always been a problem. As publications move away from the established journals to self deposit in various repositories we need to think of ways to make research more discoverable.

No comments: