Monday, 20 August 2012

So, what to do with this data stuff ?

So, having said that we can treat the literary canon as data for some analyses, what is data?

My view is that it is all just stuff.

Historically researchers have never carde overmuch about data and it retention - the obvious exception being the social and health sciences where people have built whole careers out reanalysing survey data and combining sets of survey data.

In other disciplines attitudes have varied widely, often driven by journal requirements to retain data for substantiation purposes. And in the Sciences Humaines no one has really thought about their research material as data until very recently.

What we can be sure of is that the nature of scholarly output is changing with the increased use of blogs to communicate work in progress and videos to present results etc etc. I'm sure we can all come up with a range of examples. What we can be sure of is that the established communication methods of the last 150 years are breaking down and changing. And to be fair they only really applied to the sciences and social sciences. In the Humanities and other disciplines it has never been the universal model for scholarly communication.

Likewise teaching and learning, the rise of the VLE and its implicit emphasis on the use of electronic media, has changed the way that learning resources are presented to students.

Data is just another resource and being electronic, highly volatile. We can still read Speirs Bruce's antarctic survey data because it was hand written in bound paper notebooks. Reading data stored as ClarisWorks spreadsheets written on a 1990's Mac is rather more complex - for a start you need a working machine and a copy of the software in order to export the files ...

However not all is total gloom,  sometimes one can recover the data.

Just recently I've helped recover some data from the 1980's. It was written on magnetic tape by machines built by a manufacturer that no longer exists. Fortunately the tapes had been written in a standard tape format which could be read by a specialist data recovery company, and the files had reasonably self describing names - and the original people who had carried out the research were still around to explain what the files were.

In 10 years time recovering this data might well be near impossible.

Once recovered, looking after the data is relatively simple - electronic resources are just ones and zeros. No files or content needs special handling in order to preserve them – the techniques to enable this are well understood, as are those to provide large data stores.

It is the accompanying metadata that adds value and makes content discoverable. And that of course is where the people come in. Electronic resources are incomprehensible without documentation, be it technical information on file format and structure or contextual information.

So, if we are to attempt to preserve legacy data we need to start now while the people who can explain and help document the data are still around.

It also means that you have to make the process of deposit easy. Users will not understand the arcane classification rules we may come up with based on discipline or data type.

While it is perfectly possible, and indeed sensible to implement a whole range of specialist collections and repositories, if you are taking a global view you need to start with as ervice of last resort. Such as service should be designed to be agnostic as regards content, and also can hold content by reference ie it can hold the metadata without any restrictions as to type and point to the content stored on some other filesystem not managed by it.

It is essentially a digital asset management system. It is not a universal solution to enable all things to all people.

It is perfectly sensible to group digital assets in a variety of ways that reflect the Institution's internal processes. PDF's of research papers live in ePrints. Learning objects live in the learning object repository should we have one. A specialist collection of East Asian texts live in a dedicated collection, etc.

This means three things:

1) A framework round data governance and data management is a must have. It needs to say things like 'if it's not in a centrally managed repository you need to make sure it's backed up and has a sensible management plan'
2) The institution concerned requires a collection registry that holds information about all the collections and where they are located to aid searching and discovery
3) We need as simple and as universal submission and ingest process. If it's not simple for people to use people won't use it. We might have a model as to what goes where and have demarcation disputes but these are irrelevant to the users of the system(s)

Text Analysis, neither snake oil or a cure all ...

A lot of this text analysis stuff is about treating text documents as data.

And while you can get valuable insights from these analyses it's important to understand that the original creators of these documents were not creating content or depositing data. Jane Austen did not create content. She wrote novels.

When she set out to write these novels, which we could describe as comedies of manners in the main, she inadvertantly described the society in which she lived, one in which, for women a 'good' marriage was necessary to ensure financial security, and one in which communication and travel was difficult, leading to a small and compressed social circle.

Critical reading of these novels allows us to build a portrait of how they lived.

I've picked on Jane Austen as an example, but I could just as easily have chosen Aristophanes or Juvenal.

It is important to understand that the to approaches are complementary. When for example I used the Google Ngram viewer to plot the use of the term Burmah, you could get some measure of the significance of use of the term to the ordinary reader at the time.

It doesn't tell you anything about how colonial society functioned.

This isn't of course to rubbish topic modelling or other such techniques. It lets you identify topics of concern within a corpus, just as looking at the frequency of medieval property transfers might identify times of social turmoil and change.

So we need to be critical in our approach. Topic modelling and other text mining techniques are now possible due to the sheer amount of digitised text available, and they definitely give an index of popular concern.

They are however not a substitute for critical analysis. Rather, they complement it ...

Thursday, 16 August 2012

Geeking about with wordcloud...

People may be wondering why I've been geeking about with Wordcloud.

I actually do have  a rational reason - information presentation. Research paper titles are sometime wonderfully opaque and the keywords are sometimes not much better.

People also don't have a lot of time to read lots of abstracts.

So I thought, why not generate a wordcloud on the paper and store it just in the same way that we might store an image thumbnail.

Doing this is actually more complex that it might be.

First of all, as my Gawain experiment showed you really need a discipline specific stopwords file. What they should be I don't know but feeding a whole lot of research papers on a particular topic and doing a frequency count should help generate a set of common terms discipline specific terms that are essentially 'noise'. The human eyeball would also need to play a part - if you have a set of papers on primatology for example you don't want the term 'baboon' to end up in the noise file just because it's common.

Equally you need to be able to classify papers in some way. Going back to my baboon paper, is a paper on changes in foraging behaviour in baboon troops as a result of drought ethology, ecology  or climate science?

Hence the idea behind topic modelling and the 'fridge poetry' output - my idea is to do something like the following:

Feed the text of a research paper through topic modelling software. Compare the results with discipline specific lists that you made earlier by feeding a whole set of papers sorted by discipline though the same software. This should give you some measure of 'likeness', and allow you to allocate it to no more than three fields of research.

Then, taking the top scoring field of research for a paper, feed it through the wordcloud software with the  appropriate stopword list.

This will give you a visual representation of the key themes in a paper, and allow people to rapidly flick through material and identify the papers they are interested in. Of course you also store the classification words to allow people to search for, to continue the example 'climate + baboon'

I say papers. I do mean papers, but at the back of my mind is the fact that scientific communication is changing - blogs as research diaries are becoming important, videos of conference sessions are bubbling up and we need some sort of way of classifying them and producing and easy to read visual representation of the themes - for example here's one of one of my other blogs:

Is this a good idea? I honestly don't know. There is no substitute for reading the material, but finding relevant material has always been a problem. As publications move away from the established journals to self deposit in various repositories we need to think of ways to make research more discoverable.

Gawayne the Green Knight meets the wordcloud

I've always had an affection for Sir Gawayne the Green Knight, (first bit of real middle English I read) so I thought as final act of fiddling about with wordcloud I 'd feed the Guternberg version into the IBM wordcloud software just to see what came out

which neatly demonstrates the need for a proper middle english stopwords file. Hacking my original file to produce an extended though very incomplete file one gets something a little better:

which shows that one of the things we need to take this outside of playing with nineteenth and twentieth century English text is a set of agreed stopword files for analyses.

This would clearly also apply to analyses with other languages, be it Malay or Old Irish...

Wednesday, 15 August 2012

Chaucer wordcloud

And finally,

for fun I fed the Gutenberg Collected works of Chaucer into the wordcloud software ...

this is actually quite interesting.

I didn't have a middle english stopwords file so of course we see that common forms of speech (ye, thou, thee, thy etc) predominate. So, I made myself a very simple supplementary stopwords file consisting of the obvious bits of middle english (thee, thy, thou, ye, eke, gan) in the wordcloud and then  reran the generation process:

which I think we can agree is possibly a bit better though it needs more work - for example quoth, hath, anon and may should probably be excluded.

Using an extended stopwords list one can come up with something like this:

which is possibly a more accurate model of Chaucer's drivers. I must say that I'm quietly impressed with the power of this to display the themes in a body of text ...

Wuthering Heights wordcloud

and just for fun this is what you get when you feed Wuthering heights into the IBM wordcloud software:

More fun with text analysis

A couple of days ago I blogged about my rather brain dead experiments with topic modelling software and a couple of books by W G Burn Murdoch.

While the experiments were not mature they certainly taught me a lot about text mining and topic modelling.

Essentially we take a body of texts, strip out the common words (the a's the the's etc that glue a sentence together) and search for statistically significant combinations of words that differ in abundance from their usage in a body of reference texts. Cluster analysis with attitude in other words.

Just by coincidence there is an excellent recent post by Ted Underwood that reiterates and amplifies most of what I've worked out by myself.

Since my initial experiments I've confirmed my findings with Voyant. Basically I got the same results.

I was going to try the Stanford text mining tools as well, but I need to teach myself a little Scala first.

The point which I want to make is that this exploratory research (aka geeking about) was trivial on my part – I downloaded the software onto an old Dell laptop running Ubuntu, chmod'd stuff where appropriate and I was doing it. Installation and execution didn't demand a lot of effort.

People may of course object that I've been playing with computers for years and that these things are easy for me. Well they are. But so they are for everybody. And the uses you can make of them can be innovative.

I started with a fairly simple question. Because I understand a little about cluster analysis etc I had no real problems in understanding what the data gave me – and incidentally came up with a different question – the role of critical reading in all of this.

However the exercise does have some value in itself

When I showed J the 'fridge poetry' topic lists and wordcloud stuff she immediately downloaded the software to her Mac and fed the Gutenberg Bronte texts through it – she alread had them lyng around as she was in the process of hacking out passages as discussion texts for her English Novels students – just for curiosity to see if it gave anything that looked useful as a supplement for teaching. Questions like why do we get these words and not their synonyms.

Now J is not technical – perhaps even less so than she needs to be as she's got me around – but again the effort involved in doing it was minimal – even less that that required on Ubuntu. In other she thought maybe this might be a fun approach – let's see if it spits out anything I can use ...

So the cost of curiosity with this stuff is minimal as are the computing resources required.

There's a message in there somewhere ...

Monday, 13 August 2012

W G Burn Murdoch meets topic modelling ...

quite some time ago I blogged about W G Burn Murdoch's from Edinburgh to Burmah chronicling a trip he made in 1908.

As well as being an enjoyable bit of Edwardian travel writing one thing that struck me was the writer's developing sense of Scottishness and also his sympathy with the Burmese people and the annexation of Upper Burma.

At the time I suggseted that one could trace a scottishness meme through his friendship with William Spiers Bruce and the Scottish Antarctic Expedition.

So today I decided to test this out.

First of all I downloaded and installed the gui version of the mallet topic modelling software and fed the texts of both his books through it. Some beautiful fridge poetry resulted but not much of a hint of Scottishness.

Edinburgh to Burmah topics:

1.water sand great left grey soft burmah till evening top
2.home sea made good night indian things natives pretty dinner
3.light hot country burmese brown told thought royal figures faces
4.white black time morning dark board long put head steps trees back yellow small colours feel train flowers notes
6.people men man air women fish music chinese young coloured colour sun house full prince deck golden hair low half gold sky miles big shore pass hand days
9.side feet work project high ladies native gutenberg pleasant make
10.river round green india place night line open east ground

Edinburgh to Antarctica topics:

1.great night boat seals feet doctor till found called hours white man illustrations south put weather warm ship op
3.water snow grey left seal line hard option balaena thought
4.antarctic vo edinburgh round vols back crew svo fcp red
5.wind air work boats world mate sea top brought cabin
6.long men small blue islands sky birds penguins rev turned
7.sea made black life cr home half dark colour green day days light whale north amp bergs whales sun
9.board ship good works lay heard cold pack making end
10.time land deck make head side skins mist hands sir

The second list is a little odd as  Burn Murdoch's Antarctica book is a non-proofread ocr version, which contains what are obviously font or word recognition errors.

So to sanity check what I was seeing I installed the ibm word cloud software and fed the books through that, not a hint of Scottishness standing out.

Edinburgh to Antarctica wordcloud

Edinburgh to Burmah wordcloud

Now I'm not about to rubbish topic modelling as a technique, however it possibly is not a complete substitute for critical reading. In his 1908 Burmah book I certainly got a sense of W.G's developing sense of Scottishness as opposed to Britishness and that this informed his feelings about Upper Burmah. It doesn't show up in these analyses.

And that I think is important. Applied to newspaper reports or scientific publications it quite clearly can pull out important themes. What it doesn't pull out is the subjective and impressionistic ...

Monday, 6 August 2012

Reading books on the bus on the phone

Way back in September last year I blogged about what sort of devices people were using to read books on the bus.

A few days ago I had to drop the car off for an oil change, so I caught the bus into work. The thing that struck me was the number of young Asian women (I'd guess the usual Canberra mix of Vietnamese, Chinese and Korean) who were reading books on their smartphones.

It of course makes perfect sense - carry one device, and of course getting books in your preferred language, let's say Chinese for sake of argument, as ebooks means that you can get the latest novel/romance or whatever from an online store without having to trawl round local Chinese language bookstores and wait for them to get a delivery.

It's not just an Asian phenomenon.

J is teaching nineteenth century novels this semester, and was about to berate a girl for fiddling with her phone during a session, thinking that Emily Bronte had been displaced by facebook.

But no, the student said she'd forgotten her version of the text, but she'd found a copy on Project Gutenberg, downloaded it and was now flicking through to find the section under discussion (no page numbers you see ...)