Recently, I’ve been thinking a lot about how to measure impact in a research data repository.
Impact is a fairly loose concept - it’s not something objectively countable such as citation rates - it is rather more some expression of perceived value.
Here, perceived value is some way of producing numbers (and numbers of course don’t lie) that seem to indicate that the data is being used and accessed by people.
Counting accesses is easy You can use something like AWstats - this will tell you who from where is accessing what - actually of course it doesn’t, it tells you that a computer from a particular ip address has accessed a particular document.
There is of course no way to tell if that is a result of someone idly surfing the web and following a link out of curiosity, or if it’s the start of a major engagement. Both have impact but there’s no way of quantifying or distinguishing the two.
Likewise if you rely on ip addresses, there is no way in this always on contracted out world we live in being able to tell who is accessing you from a revered academic institution’s network and who is on the number 267 bus is of little value. The fact the number 267 bus terminates next to the revered institution is probably unknown to you.
Basically all web statistics gives us is crude counts. This url is more popular than that url. We cannot assess the value of the access.
For example if I look at the google analytics figure for this blog I can say that most posts are read by around 30 individual ip addresses. Some are read by considerably more, a few are read by considerably fewer people. If I look at the originating ip addresses I can see that a few are read from recognisable academic institutions, but that most of the accesses come from elsewhere.
For example, I know that a friend of mine at Oxford has read a particular post, but no Oxford University ip address is reflected in the accesses. I’m going to guess he read it on the bus, or at home.
And then of course there is the question of exactly what do the crude counts tell us. Two of the most popular posts on this blog have been on using office365 with gmail and using google calendar with orage. Both have clearly had some impact as people have emailed me both to compliment and to complain about them. Interestingly, most people seem to have found them via a search engine, not through being passed on from individual to individual via some other mechanism such as twitter.
And that perhaps explains the problem with impact. People nowadays search for things rather than look them up (I know that seems a tautology, but what I mean is that they use google in preference to looking at a research paper and following the citations).
Which of course mean that impact is at the mercy of the search engine algorithm. And in the case of datasets, or other digital objects, which are basically incomprehensible blobs to the search engine we are at the mercy of the quality of the metadata associated with these objects …
Written with StackEdit.