Thursday, 14 October 2010

Source Code and data archiving

Interesting article (pdf doi:10.1145/1831407.1831415) from this month's 's Communications of the ACM on whether scientists should release their source code along with experimental data for review.

It's my view that they should - large experiments in the disciplines of genomics, astronomy, physics and the like often produce terabytes of data which is unmanageable by standard processing techniques, often meaning that data is often filtered at the instrument level, and sometimes by custom built FPGA's.

And this very simply means that there is a risk of introducing artefacts due to errors in the gate array code, meaning that we are looking at the risk of producing chimeras, ie results that actually aren't there and producing the digital equivalent of cold fusion.

This is a risk in all disciplines with the preprocessing of the data, and here the source code is simply part of the experimental method and hence should be open to review. The same also refers to code designed to process the results. Errors can creep in, and not necessarily due to coding errors on the part of the people carrying out the analysis. Both the pentium floating point bug and the VAX G_FLOAT microcode bug could have introduced errors. (In fact the latter error was noticed precisely because running the code on a VAX 8650 gave different results to running it on an 8250).

And this introduces a whole new problem for archiving:

If we archive the source code can we be sure that it will run identically when recompiled and run under a different operating system and different compiler?

It should, but experience tells us that this won't always be the case. And emulation, while it helps, is probably only part of the answer.

No comments: