Wednesday, 30 October 2013

A gang of seventeenth century puritans and research impacts ...

I’ve recently become interested in the history of the Providence Island Company.

In abbreviated terms, in the late sixteenth century and early seventeenth centurey there was a slew of Merchant Venturers companies set up to fund and initiate exploration for new lands.

This was in the main a reaction to the Spanish conquest of the Aztec and Inca polities and the resultant flood of wealth. There was a range of companies, including the East India Company, but one of the most interesting was the Providence Island Company.

The what ? Well if you know anything of the disputes between the king and parliament, and look at the names of the principal investors in the Providence Island Company, some names leap out at you - John Pym for one, and others less well known such as Gregory Gawsell, who was later the treasurer of the Eastern Association, one of the most effective Parliamentary military organisations in the early stages of the civil war.

In short the Providence Island Company provided a legitimate vehicle for men who went on to lead the Parliamentary side in the early stages of what became the first English Civil War.

None of this is of course new. Historians have known this for years, just as they know that various country houses, such as Broughton Castle in Oxfordshire, owned by the protagonists are known as the scenes of various conversations and resolutions in the run up to the wars.

In short they all knew each other, and many of them were connected to the people who signed Charles the first’s death warrant.

So being a geek, I thought it might be fun to try and build a social graph and then feed it through a network analysis tool such as Gephi.

There is of course no convenient list of names and relationships, so I started building one using YAML - perhaps not the ideal choice, but it lets me do little index card entries for each person, with information like who participated in which body, and who knows who. Due to it’s flexibility YAML allows me to create a little folksonomy rather than trying to make a formal database while I’m working out what I want to do.

At some point I’ll probably need to write a little code to express the YAML content as RDF triples. The great virtue of YAML is that it’s text based, which means that I can use regexes and suchlike to extract information from the file.

As a data source I’m using wikipedia and following links to compile my YAML folksonomy. Very geeky, but it keeps me amused.

And it’s quite fascinating in a geeky sort of way. For example, Thomas Rainsborough, a Leveller leader (in so far as the Levellers had leaders) was related by marriage to John Winthrop, the Puritan governor of Massachusetts and had also visited the Providence Island colony, even though he had no direct relationship with the directors of the Providence Island Company.

Once I’ve got a big enough data set I’ll transform it and feed it into Gephi and see what comes out.

However this is not just an exercise in geekery, it does have a degree of more general applicability.

Universities are very interested these days in the impact that their researchers have. Using similar social network analyses it ought to be possible to show who has collaborated with who, and who regularly they have collaborated with people.

As as result of our Metadata stores project we actually have a lot of this data, and will shortly have it in an RDF expression.

Potentially by analysing information such as the email addresses used in subsequent papers it might be possible to show where secondary authors (typically graduate students and postdocs) have moved to. Coupled with some bibliometric data this might just give us a measure of the impact graduate students and postdocs within five years say of their moving elsewhere.

In other words trying to gauge the impact of researchers, not just research papers …

Thursday, 24 October 2013

Further thoughts on eresearch services

I’ve been watching the eresearch services thread from the eresearch 2013 conference. I’m beginning to regret not having gone - it looks to have been more interesting than I expected, but that’s life.
A lot of people seem to be getting interested in eresearch services. I’ve already expressed my opinions about such things, but generally a heightened level of interest is probably a good thing.
However there seems to be a bit of conflation going on, so here’s my $0.02:
  • eresearch is not the same as big data - big data refers to the handling of very big data sets and their analysis using inferential analyses, eresearch refers to the application of computational and numerical techniques to research data
  • eresearch is not the same as digital humanities - digital humanities really refers to the move to using electronic resources in the humanities - this move may enable the application of eresearch techniques to humanities research
  • astronomers, physicists, economists and many more have been using inferential analyses such as cluster analysis for many years - eresearch is not new, but its spread and penetration is
  • the rise of cheap (cloud) computing and cheap storage are key drivers in the adoption of eresearch by allowing bigger datasets to be handled more easily and cheaply
In short, you can do perfectly good eresearch with an old laptop and an internet connection, you don’t need all the gee-whizzy stuff, all you need is a problem, the desire to solve it, and a little bit of programming knowledge.
Any eresearch service is going to be there to support such work, by faciliating access and providing advice and support to researchers, in fact it’s taking on the role of research support that belonged to computing services in the days before the commodification of computing and the rise of the internet when in the main computing meant time on a timesharing system to run some analytical software …
Written with StackEdit.

Wednesday, 23 October 2013

Data Science for business [book review]

Data Science for Business: What you need to know about data mining and data-analytic thinking
Foster Provost and Tom Fawcett
O'Reilly Media 

Data science is the new best thing, but like Aristotle’s elephant people study to define 
exactly what data science is and what the skills required are.

When we see data science we tend to recognise what it is, that mixture 
of analysis, inference and logic  that pulls information out of numbers, be it social 
network analysis, plotting interest in a topic over time, or predicting the impact of the 
weather on supermarket stock levels.

This book serves as an introduction to the topic. It’s designed for use as a 
college textbook and perhaps  aimed at business management courses. It starts at a very 
low level, assuming little or no knowledge of statistics or of any of the more advanced 
techniques such as cluster analysis or topic modelling.

If all you ever do is read the first two chapters you’ll come away with enough 
high level knowledge to fluff your way through a job interview as long as you’re 
not expected to get your hands dirty.

Chapter three and things get a bit more rigorous. The book noticably changes 
gear and takes you through some fairly advanced mathematics, discussing 
regression, cluster analysis and the overfitting  of mathematical models, all of 
which are handled fairly well

It’s difficult to know where this book sits. The first two chapters are most 
definitely ‘fluffy’, the remainder demand some knowledge of probability theory 
and statistics of the reader, plus an ability not to be scared by equations embedded 
in the text.

It’s a good book, it’s a useful book. It probably asks too much to be ideal for the 
general reader or even the non numerate graduate, I’d position it more as an 
introduction to data analysis for beginning researchers and statisticians more than 
anything else, rather than as a backgrounder on data science.

[originally written for LibraryThing]

Tuesday, 22 October 2013

What does an eresearch service look like ?

There has been a lot of discussion about eresearch and eresearch services. However when you try and pin down what constitutes an eresearch service it seems to be all things to all people.

In an effort to try and find some consensu I did a very simple survey. I typed 'eresearch services' into Google and chose pages from Australian universities. I've tabulated the results of this fairly unscientific survey in a google spreadsheet.

Each institution of course described the service on offer differently, so the spreadsheet is purely my interpretation of the information available on the web.

There are some clear trends - all sites offer help with
  • storage
  • access to compute/virtual machines
  • cloud services
  • collaboration (which includes data transfer and video conferencing)
Other services tend to be more idiosyncratic, perhaps reflecting the strengths of individual institutions. However it's clear that a lot of the effort revolves around facilitation.

My personal view is that we do not try to second guess researchers. Instead of prescribing we facilitate by helping researchers get on with the business of research.

This is based on my experience. Over the course of the data commons project we fielded a number of out of band questions such as
  • Access to storage for work in progress data
  • Data management and the use of external services like dropbox and skydrive
  • Access to a bare vm to run an application or service
  • Starting a blog to chronicle a project
  • What is the best tablet for a field trip
  • How can I publish my data on the web
which suggests what researchers want is advice and someone to help them do what they want to do - a single point of contact.

Provision of a single point of contact hides any internal organisational complexity from the researcher, it becomes the contact’s problem to manage access to services and not the researcher’s.

There are of course other views - for example this presentation from eresearch 2013 but I think we can agree that what researchers want is easy access to services and a small amount of support and help ...

Monday, 14 October 2013

Recovering data serially

Over the past few weeks I've noticed a number of posts along the lines of

we've an old XYZ machine without a network connection, can anyone help with recovering data from it?

Not having an ethernet connection is a problem, but assuming that the machine still powers ups and the disk spins, it might not be so much of a problem.

The key is to go looking to see if it has terminal application. This isn't as odd a question as a lot of computers were used to access timesharing systems back then in these pre web days, and a terminal application was fairly standard.

The good thing about terminal applications back then is that they usually incorporated a serial file transfer protocol such as xmodem, ymodem, zmodem or kermit. Of these kermit is perhaps the best, not the least because it can be put into server mode and you can push files from your host in batches.

The good news is that both lrzsz, the ?modem client for linux and ckermit are available for install on ubuntu from the repositories via apt-get.

Then all you need is a usb to 9 pin serial adapter cable and a serial nine pin null modem cable - both avaiable from ebay for a few dollars and then you should be ble to transfer data from your old machine to the new.

Yo will of course need to set up things like parity and baudrate, and it might be an idea to practice transfering data first by setting up a second linux machine and transferring data between the two - see []( for an exaple.

Despite this sounding a bit of black art, it's actually quite easy. The other good thing is that a number of embedded communications devices are still configured over a serial port, so most network technicians still know something about debugging serial connections.

Once you have a managed to establish a working connection you should then be able to get the serial communications software on your problematical machine to talk to your newly enabled serial host.

From there it's simply a matter of transferring the files across one by one and converting them to something usable - if they're wordprocessor files, LibreOffice can read most of the legacy formats and web based services like cloudconvert and zamzar can read many more ...

Written with StackEdit.

Thursday, 10 October 2013

Archiving, persistence and robots.txt

Web archiving is a hazardous business. Content gets created, content gets deleted, content gets changed every minute of every day. There's basically so much content you can't hope to archive it all.

Also a lot of web archiving assume that the pages are static, even if they've been generated from a script - pure on the fly pages have no chance of being archived.

However you can usually make an assumption that if something was a static web page and there long enough, that it will be on the wayback machine in some form.

Not necessarily it turns out. I recently wanted to look at some content I'd written years ago. I didn't have the original source, but I did have the url and I did remember searching successfully for the same content on the wayback machine some years ago. (I even had a screenshot as proof that my memory wasn't playing tricks).

So, you would think it would be easy. Nope. Access is denied because the wayback machine honours the sites current robots.txt file, not the one current at the time of the snapshot, meaning that if your favouriet site changes its robots.txt between then and now to deny access you are locked out.

Now there's a lot of reasons why they've enacted the policy they have but it effectively locks away content that was once public, and that doesn't seem quite right ...

Written with StackEdit.

Wednesday, 9 October 2013


If you're someone who follows my twitter stream you may have noticed that I seem to post bursts of tweets around the same time every day.

This because I've taken to using Bufferapp to stage some of my tweets. Basically bufferapp is a little application that integrates nicely with Chrome and AddThis which allows you to put tweets into a buffer to be reposted later in the day.

I only use the free version, which means that my buffer is only 10 deep, but that seems to cover most of the tweets I'm likely to make in a day. I'm not obsessive compulsive about twitter, no matter what it seems like.

Why use it?

One could imagine lots of scenarios including making it look as if one was online when one wasn't but my reasons are a little different. Basically I tweet about two topics - geeky stuff to do with computing and data storage, and equally geeky stuff about history and archaeology. There is of course an overlap, for example, big digitisation projects and computational text analysis do provide a degree of overlap but in the main there are two topic groups and two topic audiences. (I had the the same thing with my blogs, which is why I split them - this one is more technically focussed, while the other one is a bit more discursive and random)

When I look at my twitter followers I can say very roughly that the computing and data people are in the same or adjacent timezones to me, but the people interested in the geeky history stuff are clustered in North America and Western Europe - of course that's not quite true, I have followers in South Africa and Chile to name but two, but it's a good enough approximation.

In other words the history followers tend to be between eight and eighteen time zones away from me on the east coast of Australia, and hence unlikely to be awake when I'm tweeting (well except for Chile and the west coast of America where there's a few hours of overlap).

So I've taken to using bufferapp to delay the tweets for that audience, which has the effect of de cluttering the feed for the computing and data people.

I'm still tweaking the schedule and I'm conscious (because some of my followers have said so) that some of both communities like a leavening of the other sort of information so it's not a hard split, and of course there's always the daily summary of the most popular tweets from both me and the people I follow ...
Written with StackEdit.

Tuesday, 8 October 2013

Usenet, VM's and Pan

Like most people who were in at the beginnings of the internet as something widespread (I'll say sometime around 1991 when JANET connected to the Internet and abandoned Coloured Books for TCP/IP for me) Usenet News filled the niche taken nowadays by twitter and blog feeds.

Usenet news fell apart and lost popularity in the main due to it being hijacked by trolls and other malefactors with the result that people walked away from it when the signal to noise ratio got too high.

In fact I closed down work's usenet news server a few years ago. It was quite an interesting experience as we had a couple of downstream servers elsewhere that we provided a feed to under an SLA. Finding someone at the downstream sites who could remember what a usenet news server was and why we should agree to terminate the SLA (and the feed) was a task in itself. People really don't use it anymore.

However, despite that there's still a couple of technical newsgroups I still find useful, especially now the trolls have abandoned it for twitter and facebook, making the experience kind of like the old days.

To access them I use pan running on a minimal crunchbang linux vm.

This of course has the problem of getting the information out of pan and into somewhere useful - having that useful post sitting on a vm you run up once a week isn't really that useful.

There's lots of ways of solving that problem, but I didn't want to spend a lot of time installing extra software such as dropbox on the vm. My answer is incredibly simple and incredibly old school - install alpine on the vm, set up a dummy account on, manually attach the usenet posts as text file and email them to my evernote account, my work account, or where ever suits.

Remarkably old school, but remarkably efficient ...

Written with StackEdit.