Thursday 28 November 2013

Impact and Impact Story

At Tuesday’s ANDS/Intersect meeting one piece of software getting some traction was Impact Story.

So I thought I’d check it out and make myself an ImpactStory profile. After all the only way to find out about these things and their usability, or otherwise, is to experiment on oneself.

In very crude terms, ImpactStory is an attempt to build a Klout like application for academia. Like Klout it attempts to quantify that nebulous thing ‘influence’. Unlike Klout, it is an open source publicly funded initiative (by the NSF and Alfred P. Sloan Foundation, no less) and transparent - the code and the algorithms used are up for scrutiny.

And of course it’s an important initiative. Scholarly communication has changed. Researchers blog and tweet about work in progress and this forms an important source of preliminary announcements, as well as the usual crosstalk about ideas and interpretations. Project websites have increasingly become a source of research information, etc etc.

So what about ImpactStory ?

I think it’s fair to say it’s a work in progress - it harvests data from twitter, orcid, wordpress, slideshare, vimeo and a few others. Crucially it doesn’t harvest material from blogger, academia.edu or scribd, bot of which are often used to host preprints and grey literature, such as working papers and various reports, for example Rebecca Barley’s work on late Roman copper coins in South India.

In short it’s the first or second pass at a solution. It shows us what the future might be as regards the quantification of ‘impact’, basically la bruit around a researcher and their research work, but it is not the future.

In its current state it probably under reports a researcher’s activity, it can show that a researcher has some impact but it does not show a lack of impact, and we should consequently be wary of including its metrics in any report.

Written with StackEdit.

Monday 25 November 2013

Student Computer Use ii

Way back in September this year I blogged about Student Computer use. Slightly to my surprise that post garnered quite a high readership despite its extremely informal basis

  • walk round campus at lunch time
  • note what computers you see students using
  • blog about it

Over the weekend a colleague from York posted some interesting figures from there on computer use

OS stats of connections INTO campus for desktop OSs: Win81 1%, W8, 9%, Win7 50%, Vista 5%, XP 15%, Linux 3%, OS X 17%

OS stats of computers ON campus: Win81 5%, Win8 20%, Win7 19%, Vista 7%, XP 16%, Linux 6%, OS X 26%

Which is kind of interesting. Now I don’t know how the figures were filtered but they probably reflect a mix of staff and student connections, with, given that York is a universtity where a lot of students live on campus, the second set of figures reflecting student use more accurately than the first.

From this I think it’s fair to say students have a strong preference for OS X. The presence of Vista and XP is interesting, and I think a reflection of my long held suspicion that students buy themselves a laptop at the start of the their degree course and never ever upgrade the OS over the course of their three or four years.

If I’m right, it probably means that the end of support for XP is less of a problem than it might be, as the XP and Vista machines will age out of the population by the end of the northern hemisphere academic year (Of course here in Canberra, our academic year is already over).

This also explains why quarter of the machines are windows 8 or 8.1 - they represent new purchases.

Connections into campus probably reflect a mix of staff and graduate student connections - and the dynamics of machine replacement are probably different for them - they probably use a machine for more of its natural life of four to five years, and given the initial distaste for Windows 8, they probably tried to replace machines with Windows 7 where possible.

The numbers of Vista and XP are concerning, but given that most people never upgrade computers anyway one would need to take human factors into account in any upgrade campaign.

Sidegrading to Ubuntu is probably a step too far for most of these users, given the current low penetration of Linux among that community.

However, the key takeaways are that OS X has made substantial inroads into the student computer community, Linux hasn’t, and despite OS X’s advance Windows OS’s are still the majority

Written with StackEdit.

Friday 22 November 2013

Impact (again!)

I’ve been thinking some more about impact and what it means for datasets.

From a researcher’s point of view impact is about trying to create is something akin to a Klout score for academics.

Klout is a social media analytics company that generates a ranking score using a proprietary algorithm that purports to measure influence through the amount of social media engagement generated.

Because I experiment on myself I know what my Klout score is - an average of 43 (+/- 1), which is respectable but not stellar. Now the interesting thing about this score is
  • While I’ve connected my Facebook account to it I don’t actively use Facebook
  • I have a small band of loyal twitter followers (230 at the last count)
  • Google Analytics shows my blogs have a reasonable penetration with an average readership of around 30 per post
In other words, while I am active in social media terms, I’m not massively so. So Klout must be doing some qualitative ranking as well as some quantitative ranking, perhaps along the lines of X follows you, X has N followers, and these N followers have an average of M followers. I’m of course guessing - I actually have no idea how they do it. The algorithm is proprietary and the scoring system could be completely meaningless.

It is however interesting as an attempt to measure impact, or in Klout terms social media influence.

Let’s turn to research papers. The citation rates of scientific papers reflect their influence within a particular field, and we all know that a paper in the Journal of Important Things gets you a nice email from the Dean, and one in Proceedings of the Zagistani Cybernetics Institute does not. And this of course is the idea behind bibliometrics and attempting to quantify impact. Crudely, a paper in a well respected journal is likely to be more widely read than one that is not.

Even if it is not widely cited the paper has had more influence and one less widely read.
And of course we know it probably is more or less true. If you’re an ethologist you’re probably going to want some publications in Animal Behaviour on your CV.

So we could see it could sort of work within disciplines, or at least those in which journal publication is common. There are those, such as computer science, where a lot of the material is based around conference proceedings and that’s a slightly different game.

Let’s now think about dataset citation. By it’s nature data that is widely available is open access and there is no real established infrastructure, with the exception of a few dedicated specialist repositories such as the Archaeological Data Service in the UK and IRIS in the US for Earth Sciences.

These work because they hold a critical mass of data for the disciplines, and thus archaeologists ‘know’ to look at the ADS just as ethologists ‘know’ to look at Animal Behaviour.

Impact is not a function of the dataset, but of some combination of its accessiblity and dissemination. In other words it come down to
  • Can I find it?
  • Can I get access to it ?
Dataset publication and citation is immature. Sites such as Research Data Australia go some way by aggregating the information held in institutional data repositories, but they are in a sense a half way house - if I was working in a university in the UK would I think to search RDA - possibly not - and remember that most datasets are only of interest to a few specialists so they are not going to zoom up the Google page rank.

At this state of the game, there are no competing repositories in the way that there are competing journals, which means that we can simply use raw citation rates to compute influence, and to use citation rates we need to be able to identify individual datasets uniquely - which is where digital object identifiers come in - not only do they make citation simpler, they make counting citations simpler …
Written with StackEdit.

Raspberry Pi as a media player

I’ve recently been looking at buying myself a Raspberry Pi - admittedly more as toy than for any serious purpose - as I’ve been feeling the lack of a linux machine at home ever since my old machines built out of recycled bits died in a rainstorm.

(To explain - I have a bench in the garage where I play with such things, and not unnaturally I had the machines on the floor. We had a rainstorm of tropical dimensions, one of the garage downpipes blocked, and the rainwater backed up and then got in under the tin and flowed across the floor, straight through my linux machines).

Anyway, to the point, I’ve been researching options to buy a Pi, especially as we don’t really have much of a local ecosystem in Australia.

And one thing that became very obvious is that they have a major role in powering media players and displays - which kind of makes sense given that they have HDMI output and run a well known operating system, making the ideal for streaming content off of a local source or powering a display system - run a kiosk app on the Pi, and push your content out onto a display device - wonderful what you can do with cheap technology.

Again, by pure coincidence I came across a post describing the role of cheap Android devices and how in the main they are used as ways of viewing video content or else as embedded devices.

In other words there is a lot of under the radar demand for content viewing which is different from how we think tablets are used - for more engaged activities such as web surfing and email, as well as routine tasks like online banking.

And here we have the key takeaway - tablets like raspberry pi’s are versatile computing devices, just as pc’s are. And just the same way pc’s have a lot of uses other than general purpose computing, tablets and other such devices do.

PC’s became general purpose computing devices in the main because of their open architecture and the fact that various factories in the Far East could make bits for the relatively cheaply, meaning that if you wanted to make a gene sequencer say, rather than having to design embedded hardware, and then have the difficulty of maintaing and upgrading it, you could write software and use the standard interfaces available on a pc - thus significantly reducing your development and delivery costs.

Android, and the Raspberry Pi, both of which are open systems like the original PC are giving us a similar effect - cutteing development and delivery costs for embedded systems as the software environment is already there …

Written with StackEdit.

Wednesday 6 November 2013

Measuring impact

Recently, I’ve been thinking a lot about how to measure impact in a research data repository.

Impact is a fairly loose concept - it’s not something objectively countable such as citation rates - it is rather more some expression of perceived value.

Here, perceived value is some way of producing numbers (and numbers of course don’t lie) that seem to indicate that the data is being used and accessed by people.

Counting accesses is easy You can use something like AWstats - this will tell you who from where is accessing what - actually of course it doesn’t, it tells you that a computer from a particular ip address has accessed a particular document.

There is of course no way to tell if that is a result of someone idly surfing the web and following a link out of curiosity, or if it’s the start of a major engagement. Both have impact but there’s no way of quantifying or distinguishing the two.

Likewise if you rely on ip addresses, there is no way in this always on contracted out world we live in being able to tell who is accessing you from a revered academic institution’s network and who is on the number 267 bus is of little value. The fact the number 267 bus terminates next to the revered institution is probably unknown to you.

Basically all web statistics gives us is crude counts. This url is more popular than that url. We cannot assess the value of the access.

For example if I look at the google analytics figure for this blog I can say that most posts are read by around 30 individual ip addresses. Some are read by considerably more, a few are read by considerably fewer people. If I look at the originating ip addresses I can see that a few are read from recognisable academic institutions, but that most of the accesses come from elsewhere.

For example, I know that a friend of mine at Oxford has read a particular post, but no Oxford University ip address is reflected in the accesses. I’m going to guess he read it on the bus, or at home.

And then of course there is the question of exactly what do the crude counts tell us. Two of the most popular posts on this blog have been on using office365 with gmail and using google calendar with orage. Both have clearly had some impact as people have emailed me both to compliment and to complain about them. Interestingly, most people seem to have found them via a search engine, not through being passed on from individual to individual via some other mechanism such as twitter.

And that perhaps explains the problem with impact. People nowadays search for things rather than look them up (I know that seems a tautology, but what I mean is that they use google in preference to looking at a research paper and following the citations).

Which of course mean that impact is at the mercy of the search engine algorithm. And in the case of datasets, or other digital objects, which are basically incomprehensible blobs to the search engine we are at the mercy of the quality of the metadata associated with these objects …

Written with StackEdit.