Well, having made a stopwords file the thing to do is test it.
I chose to use the text of Barbour's Brus, as the Oxford Text archive copy was fairly clean of inline markup, clean enough to fix by hand rather than modifying my original text cleaning code.
The first time around the results were not quite what I expected:
I chose to use the text of Barbour's Brus, as the Oxford Text archive copy was fairly clean of inline markup, clean enough to fix by hand rather than modifying my original text cleaning code.
The first time around the results were not quite what I expected:
so I modified the stopwords list by removing the following from the list:
king
kingis
lordis
lord
haly
and adding:
fayis
ner
yen
schyr
yan
yis
gan
towart
swa
her
gert
which gave a better representation:
a little more tweaking might be required, but it has promise as a technique.
This statistical generation of stopword lists could also be applied to analyses of bodies of scientific literature by generating discipline specific extra stopword files so one could filter out the common noise words to get a better impression of a research group's strengths and focus from their published papers - something that is increasingly important as at least one study of search practices among researchers suggests a dependence on Google and by implication it's search algorithms.
Building topic or keyword extraction models may help counter this by allowing the generation of 'other related' lists ...
No comments:
Post a Comment