Friday, 30 January 2015

Software and reproducibility

In my career I've seen two cases where errors in the processor hardware or support environment caused problems in reproducing results - one was a GFLOAT bug in Vax microcode, the other was Intel's infamous Pentium floating point bug.

So if someone asked me how likely a problem reproducing a set of results was due to a firmware or hardware problem, I'd say pretty unlikely.

Unfortunately I can't say the same about software. Libraries change, compilers have interesting features, and increasingly these days code is dependent on third party libraries and extensions, all of which conspire to make reproducing results a nightmare.

Through in powerful environments like R and python iBooks and one has a verification problem. It's also unrealistic to expect a researcher to document all the libraries and dependencies involved - it's not their core competence.

Just like one expects Word or Open Office to open the previously saved draft of a research paper one expects that R will open a previous notebook and run the job.

However, if a few years have elapsed and the dependencies and performance of a particular library have changed, or one's running ostensibly the same version in a different computing environment there's a risk that things might not work the same.

The risk is small, and for computationally light research, say botanical work on plant morphology, can be ignored. However the problem is rather more significant in the case on computationally heavy research such as genetics, or in rerunning complex models, such as are found in climate research.

In the case of less intensive work, the sort of work that can be run on a laptop, creating, saving and documenting a VirtualBox image might be a solution - we're assuming that VirtualBox images are future proofed, but at least we can guarantee that state of the machine, in terms of libraries and so on is identical to that run by the original researcher, and whats more saves anyone trying to reproduce the research the travail of building an identical environment.

This of course does not scale in the case of computationally intensive work. Cloud computing and Docker are perhaps our saviours here. Most 'big' computing jobs are run in the cloud these days, or at least in a cloud like environment. Likewise, Docker, which was designed to simplify application portability can be our friend by allowing us to save the stat of our computing environment - an idea recently put forward in an paper by Carl Boettiger.

This actually is quite a neat idea. As the Docker container is digital object we can archive it and manage it in the same way as we manage other digital objects using some form of data repository solution with all the capabilities such solutions have to guard against bitrot.

It also as a solution allows the archiving of valuable interactive resources such as websites particularly those driven out of a database such as a lot of language resources.

I like the idea of containerization as an archiving tool - it neatly sidesteps a lot of the problems associated with archiving software and interactive environments ...

No comments: