Thursday, 12 September 2013

Making a Middle Scots stopword file ...

Over a year ago I played with topic modelling and wordclouds. As always the reason has not quite gone away, and a year on, I thought I'd better teach myself how to do it properly using R.

Now one of the things I found when I played about with wordclouds is that if you feed middle english text into a wordcloud it does help to have a middle english stopword file.

Playing with the Gutenberg version of Troilus and Cressida I found it was quite easy using R to come up with a stopwords file based on the 100 most common words in the file excluding the names of the protagonists.

The choice of 100 words is purely arbritrary - Ranks.nl links some example stopwords files for modern English and they sit around the (200 +/- 50) mark. Chaucer used just over 5600 distinct words in Troilus so we'll assume that the hundred most common words are a valid stopword list. (In fact, applying the eyeball test, a stopword list of around 70 is probably close enough),

Now, a stopword list based on a single poem might be interesting, but it's not very useful. You need a number of poems to come up with a stopwords file that's valid for a particular author.

Then you can do such tricks as comparing the frequency of words (minus the stopwords) between poems. If one has a very different distribution of words it might be by a different author.

So having discovered how to make a stopwords file I though I'd make a stopwords file for middle scots and then see if I can find frequency differences between various poems by various authors as well as using it to generate wordclouds.

For the corpus I chose the works in the Oxford Text Archive Early Scottish Texts archive. I chose middle Scots quite deliberately, as it was (a) different enough from contemporary English in its spelling to treat as if it was a different language, (b) there was a decent body of online text available and (c) it didn't do anything complicated with word endings other than using is for a plural rather than s.

As such it meant that I could use standard off the shelf programs written for contemporary English, but simulate using it it on a different language with all the default assumptions turned off rather than relying on someone else's choice of stopwords.

The files came with some angle bracket delimited non-standard markup which was probably intended to be read by some other program. I wrote a simple perl script to remove this markup, remove irrelevant bracketed inline text such as ( ITM ), and a few other stray characters, and while I was at it converted the files to lower case for future processing.

I didn't try to fix any orthographic quirks - I made the assumption that all the likely stopwords would be words in common use with an agreed spelling. Given that I'd ended up with a sample of around 860,000 words I was running on the basis that any really common variant would probably turn up in the stopwords file.

After some final porcessing with R the source text contained just under 50,000 unique items, which is probably a rich enough corpus, although this may be masking orthographic quirks. The resulting stopword list consists of the first 200 words in the frequency list.

Of this I'd say four of the words are possibly words you might wish to exclude:

prince (444th most common)
kingis (462) 
knycht (565) 
lordis (526)

The csv file containing the 500 most common words is also available for download if you wish to make your own decisions as to what should be in the stopwords file ...

No comments: