Stuff, geeky stuff: 12/01/2017

Thursday, 28 December 2017

Orage revisited

Way back in 2007 I wrote a fairly simple script to download a google calendar file in ics format and stuff it into the Orage, a desktop calendar application bundled with the xfce window manager that came with xubuntu.

I did it just to see how easy it was to do. Nothing more.

Even though a year or so on I started using a ppc imac with Xubuntu as my principal desktop machine, I didn't really invest a lot of effort in the script, even though some people at the time found it useful, preferring to use evolution to handle mail and calendar type stuff.

Fast forward to 2017:

For no good reason other than it was the day after Xmas I decided to see if I could get jpilot to import a google calendar file with a bit of handwritten code to convert the ics file to a basic palm compatible csv file.

Well I havn't yet got as far as doing the csv conversion bit as I found my orage download script didn't work any more.

Orage now keeps its ics file in ./local/share/orage and google's calendar file syntax has changed.

So I fixed it:

touch ~/calendar/basic.ics
date >> ~/calendar/google_download.log
while test ! -s ~/calendar/basic.ics
do
wget -rK -nH \

  https://calendar.google.com/calendar/ical/yourprivateicsfile.ics \

  -O ~/calendar/basic.ics -a ~/calendar/google_download.log
sleep 30
done
if test -s ~/calendar/basic.ics
then
mv ~/.local/share/orage/orage.ics ~/.local/share/orage/orage_old.ics
mv ~/calendar/basic.ics ~/.local/share/orage/orage.ics
fi

Obviously you replace yourprivateicsfile.ics with the link to your private google calendar ics file. If you are unsure how to find this check out this google help page - the bit you want is titled 'See your calendar...".

I've also spread the wget command over three lines for improved readibility. Depending on the unix shell you are using you may need to get rid of the backslashes and turn it back into a single very long line to get it to execute

wget now whinges about the combination of command line options but you can cheerfully ignore that (or fix it if you want)...

Wednesday, 20 December 2017

Using open source products for data collection

Following on from my little to do with Excel and the problems in getting a product activation key updated when I was off the corporate network, I'm even more strongly of the opinion that open source products are the way to go.

While the organisation that provides our IT support resolved the problem efficiently. professionally, and with good humour, it did take an hour of phone calls to resolve the problem. Given that I'm IT literate, even though I'm no windows engineer, I do wonder how easy it would have been had I been a less expert user.

In contrast, with open source the maintenance overhead is so little - no licence keys to worry about, and while there is clearly still a day to day support cost, it's probably not much different from proprietary, and in these days of Google and StackOverflow, it's less than it might once have been.

There is of course a case for ensuring that the applications used are of suitable quality and perhaps also a need for standard toolkits. It is of course unrealistic to expect individual researchers to do this, which is where product directories such as the Dirt Tools directory play a crucial role in allowing researchers to select and use suitable tools, but equally we also need to think about putting together a set of standard toolkits as a means of enabling the development of a set of community knowledge as to how to resolve common problems ...

Friday, 15 December 2017

Laptops for data collection

Over the years, a number of people have asked me about what I would suggest in the way of a computer for fieldwork, or research work in dusty libraries without internet or convenient power sockets.

Fieldwork computers tend to have a hard life, carried about repeatedly, bounced about in trucks, and always at risk of the wet, either as rain or spillages, or from dust and dirt.

My advice has always been to aim for the longest battery life for the lowest cost to keep the replacement cost down. Also these devices don’t need to do a lot - run a spreadsheet to record data, some sort of note management program and a text editor.

I’ve tried the cheap android tablet and keyboard combo. and that’s pretty good for straight note taking or even creating structured text (eg markdown) but tends not to shine for creating tabular data. Which is a pity as they are cheap enough to be treated as a consumable.

So recently I’ve swung back to the refurbished netbook or laptop with linux, and a combination of basic tools. The software base of linux is so large that you can find just about anything, but I tend to favour CherryTree for notes management, Gnumeric for recording tabular data, gedit or kate for basic text, and perhaps something more specialist such as ReText for structured text, although kate’s syntax checker is pretty good.

If you want something for writing up draft reports, Focuswriter is fast and lightweight.

The downside is that battery life is poor. Two hours, three hours at most. Not enough for a decent session.

However, there are a number of these cheap eMMc memory based windows laptops available. Mostly I’ve avoided these as the amount of storage, typically 32Gb, is too small, given that Windows will take around 20Gb, depending exactly how it’s configured.

Add a few extra programs and a bit of data, and there’s not a lot of headroom there. However devices with 64Gb storage are beginning to appear at a price that’s reasonable, for example the Lenovo Yoga 310-6K can be picked up from the usual suspects at around $400 - 450 from the usual suspects, which is about the midway price for a refurbished laptop.

But there’s two downsides to the refurbished laptop route - firstly if you want to keep windows, you’ll probably end up having to pay for a Windows 10 upgrade, and secondly battery life won’t be great. And if you go for an older or cheaper machine it’ll probably have a 5400 rpm SATA drive, so you won’t be getting lightening disk performance anyway.

These cheaper eMMc laptops come with Windows 10. Versions of CherryTree, Gnumeric, and Focuswriter are available for windows. There’s always notepad or windows Codewriter as an editor, and if you need something a little more flashy for structured text there’s Typora, or Texts.io which will cost you around US$15 for a licence key.

What of course you’re getting is the longer battery life. You also get the bonus of being able to use the device in tablet mode, which makes showing people images - be it of plants, finds, sites, or handwritten text - much easier than on a laptop. The other bonus is OneNote, Microsoft’s note management tool.

I didn’t use to like OneNote - it seemed clumsy and slow compared to Evernote, but since working on the Dow’s Pharmacy project I’ve warmed to it.

Evernote remains the best ragbag management tool ever for categorising snippets garnered from everywhere. OneNote really isn’t good at imposing structure on chaos. What it is good for is building up a collection or collections of related notes - a subtle difference but an important one.

And of course you can have the best of both worlds and have both Evernote and OneNote on your machine.

So, what would I choose?

A few months ago I would have gone down the refurbished laptop with linux route, and if we’re talking about clever stuff like using R or iPython notebooks for on site data management and analysis I still would. For pure data collection, I’m not so sure. The increased storage and longer battery life certainly makes these eMMc based devices an interesting option ...

Update 16/12/2017

I've ignored iPads - deliberately - simply because they have the same problems as using an android tablet, the lack of a decent software base for data entry

Friday, 1 December 2017

More on spreadsheet preservation and normalisation

Yesterday, inspired on a post about preserving Google sheets I blogged about spreadsheet preservation in general.

As is the way of these things the question has been rumbling round my brain ever since.

A long time ago, the National Archive of Australia released Xena, a normalisation tool that converts files into open xml based formats - essentially the open office formats used by Libre Office and others, on the basis that the xml produced is both documented and readily parsable and that it would be possible to recover the data and the calculations from any preservation file.

And in fact when we built the original ANU data archive, we silently implemented this normalisation process as part of the workflow. We didn't use Xena, but after using Pronom to work out if we could recognise the file type, and if we had a normalisation engine for it - essentially an xml export tool, we would use that to produce a long term preservation copy which we would store, along with the original, in a bagit archive.

The idea of storing both, of course, is that as we didn't test the normalisation processes, and tended to trust the tools, it is just possible we could have produced garbage as part of the normalisation process.

In fact we deliberately ignored the year 1900 problem, as we reckoned that only a small number of spreadsheets would be affected.

So what does this mean for Google sheets?

Exporting to an xml format such as ods would seem to be the way to go, but given that it's not possible to preserve the original document, the sensible thing would be to download the spreadsheet in two formats, both ods and xlsx, given that both are in xml and that parsers exist for both formats.

The reconstituted spreadsheets should of course give identical results imported into the appropriate utilities.

Exporting a single sheet spreadsheet as as csv, or whatever, is only appropriate where there are no calculations involved, an example being where the spreadsheet was used to record species abundances in a number of quadrats.

The decision about whether to use an ascii format such as csv is best left to the researcher, they know their data, and whether it's appropriate.

The standard procedure should be to use a richer xml based format, and preferably two of them.

Ideally there should be some sanity checking before ingest ...