Wednesday 30 September 2020

Of bookscanning and image sizes

 J, my life companion, is an accomplished pastel artist, and wanted to put some of her artwork into a competition.

Pre-Covid, this would have meant selecting a picture or two, getting them framed, driving somewhere, and watching someone from the exhibition team put them on the wall.

This year, of course, everything is different. Pictures are photographed, and the images uploaded to the exhibition website, where they are loaded into some gallery software.

Now, what was interesting about this process is that the exhibition organisers said to use a digital SLR for the images, not a mobile phone because of the image quality.

Now, J's artworks are normally something between A4 and A3 in size (that's because that's the sizes specialist paper for pastel work comes in), and for archival purposes she takes a picture with her iPhone, which has an 8 Megapixel camera, and archives them in iCloud, using what I'll call iPhoto (it's actually called Photos these days).

Apart from iPhoto's tendency to produce smaller than expected jpegs on export this works well as a process


Internally, Photos uses the newer High Efficiency Image File Format  rather than one of the other more standard formats to achieve an efficient use of resources using lossless compression.

As always, we can argue about compression, image formats and archiving, but using HEIF is no more at risk of introducing compression artefacts than anything else, and may even be better as it is claimed to be lossless.

Professionally though, most people use cameras for archiving work rather than mobile phones.

We've all seen pictures of archivists using digital SLR's mounted vertically on a stand to take images of old photographs, and obviously when you don't know the exact size of the image and want a high quality image this makes sense. 

But the question is what is good enough?

Well my little experiment using a photoscanning app on a phone has convinced me that a phone produces a good enough image, even if the OCR's result of the text would need a little work:


and there was report in Nature this morning (which I retweeted) about a group of scientists using the Covid hiatus to scan old lab  notebooks


now the interesting thing is that most of the work was done using mobile phone cameras and a phone scanning app - in other words the scientists concerned found the images perfectly adequate.

At the same time if one searches for book scanner Google shopping or Amazon, one gets results similar to this


delving into the specifications one finds that they all use a camera with a fixed image size - the cheaper ones tend to be designed to image only a set page size, usually A4, the more sophisticated 'bendy' ones can be adjusted to scan a page to a maximum paper size - usually A4 or A3. All, or almost all, use either an 8Megapixel or 5Megapixel camera - assuming the better or pricier devices using an 8MP camera, the cheaper fixed image size devices a 5MP camera.

I don't know this, but I'd guess that the scanners are using mobile phone camera assemblies. An 8MP image of an A4 page would give you roughly 300 dots per inch, which is pretty sharp and as sharp as many high quality printed images. (If you are planning to OCR the text, you actually don't want a supersharp image of old typeset pages as these can introduce artefacts that confuse the OCR software.)

So, where does that leave us?

For J's artwork, for a sub A4 image is probably good enough at 8MP and for book scanning it's certainly good enough for OCR.

If your image is bigger, yes there's probably an advantage in using a higher quality camera, but for most purposes 8MP is good enough ...



Sunday 27 September 2020

Huawei mediapad

 It's no secret that I like messing around with old documents.

Normally, when working with digitised content, I use an old 2008 vintage iMac - long unsupported but still with an excellent screen - to display the item, and I'll  type the notes into a laptop.

I could, I guess, have a single machine with dual screens, but at the moment this works for me. What this solution is not, is portable, which can be a pain when working somewhere like a library (which I havn't done for six months because Covid.)

Now the little note taking ipad I bought myself a year or two ago has become useful as a carry around device - but the screen is a little too small to work with when looking at old documents. 

Given that I normally work off of a laptop, I decided that a standard format tablet would probably be the thing to go for.

An iPad Air would have filled the bill, but not at the price Apple charge for a new one, and decent refurbished items have disappeared off the market.

So that meant Android.

Now if you go to any of the big box stores you have a choice of Lenovo or Samsung, and the items with decent quality displays are reasonably pricy. 

So I read some reviews and overseas people seemed to rate the Huawei mediapad - decent screen, good battery life etc. There's two models and the better specc'd 64GB model isn't currently available in Australia - except that for some reason Amazon will sell you one from Amazon UK via their market place.



so that's what I did.

It only took a couple of weeks to get here. Gratifyingly it was not crammed full bloatware, giving you a fairly vanilla machine to work with. The only problem was of course that it came with one of these bizarre UK claw chargers:


which wasn't a problem as, like most people, I've oodles of spare micro-USB chargers. I've also got an array of international sockets in my workshop dating from the days when I used to play with kit from overseas


Setup is standard Android, and the tablet comes bundled with the Office 365 tools for Android. There's not a lot in the way of unnecessary bundled apps, but the device comes with  Huawei's own app store as well as Google Play. The whole setup experience is pretty vanilla.

In use the device is responsive and the bundled Microsoft swiftkey virtual keyboard is one of the nicest I've come across.  Screen quality is as good as promised, and the device is light and sits nicely in the hand.

Definitely a business class machine despite its low price.

Huawei include their own mail client and 5GB of their own cloud storage but there's no compunction to use them, or their own app store - you can just as easily use your own preferred mail client and cloud provider, and delete their apps off of the device should you prefer.

Due to Huawei being banned from 5G networks in Australia and the recent reported hacks of university computer data, there's obviously going to be some questions around security.

Personally, if like me, you are a private individual, your data is probably no more at risk with Huawei than with any other cloud provider. 

If, however I still worked for a university or a government body, paranoia might kick in and I might think twice about buying such a device, but equally, you can be too paranoid - after all Telstra, no less, sold me a Huawei 4G broadband modem a couple of years ago ...

Equally, if you want to be careful, you can simply avoid installing applications like online banking on the device, or simply access them via the web.

It's a shame that the Mediapad is not better known in Australia. It's a good well made, well priced device that does what it says.






Saturday 12 September 2020

What should happen when an online journal dies

 

Over the last few days I’ve tweeted links to a research paper and two news articles, one from The Register, the other from Nature, on the phenomenon of disappearing open access journals.

I must say I’m not surprised.

While I have never worked on an open access journal, I have built a number of data repository solutions for both higher education and government, and was once even on the management committee for the long gone UK Higher Education National Software Archive.

And if there’s one problem with every solution I’ve built, it’s sustainablity.

While the systems are comparatively cheap and simple to deploy – you can build an Omeka instance in an afternoon, and building a non customised Dspace install is similarly quick, production based systems need hardening, security and customisation, all of which requires a small of software engineers – usually about two, and a part time manager to manage the install and deployment of the solution – and because the only metric we have is money, we can say that if deployment takes a year it will cost around $300,000.

Pre-cloud, and pre-virtualisation, the cost of hardware and storage was a significant consideration – nowadays, less so, so let’s stick with the $300,000 annual cost but assume we manage to deploy and get signed off in less that twelve months, and that we are using a virtualised server and cloud based storage. Sure there are hosting fees and storage costs, but you don’t need to worry about redundancy, backups, and maintenance costs for the hardware – a lot of these costs are simply abstracted into your monthly hosting and cloud storage bill.

After you’ve got your solution deployed, there’s probably less work for your deployment team, but they still need to have a role patching your repository or journal system, adding features, and so on.

So while you may not need so much of your repository or journal system team’s time you’ll still need a reasonable bit of it, so let’s stick our fingers in the air and say that the ongoing costs of running a solution is around $200,000 a year.

Remember that’s the cost of keeping it running. It doesn’t cover any of the costs, in the case of a journal solution, associated with managing the publication workflow – getting the submitted paper in, out to the peer reviewers, back from the peer reviewers, updated, revised, returned to the reviewers etc.

It’s quite a lot of work and require employing at least a couple of staff and a journal manager. Obviously, you can reduce your costs by running a preprint server as opposed to an open journal solution. Typically, though, preprint servers do not charge a submission fee, and trust that anyone submitting a preprint cares enough about their academic reputation not to publish rubbish.

Many open access journals work by charging a fee for you to publish your research – for example PLOS One charges a one off fee of US$1350. In the case of PLOS One, a well known journal with high impact scores, they almost certainly have a submission rate that allows them to cover their operating costs.

For smaller journals, and ones dealing with a highly specialist area, it may be difficult to charge a fee sufficient to cover their costs, or indeed achieve a submission rate that generates a sufficient level of income.

Inevitably, that will mean that the cost of running the journal is subsidised in some way by a learned society or by an academic institution, sometimes for reasons of prestige.

Now times are hard in academia. Government funding is grudging to say the least, and in these Covid times, student fees don’t provide the income they once did.

And departmental managers then look at the $200,000 or so it’s costing them to host a journal and not unnaturally think ‘we could get three, even four, postdocs for that, and they might do something significant’.

And so the journal ceases publication.

But of course it doesn’t end there.

To keep the already published content available, you need to keep the server running and patched, which means employing someone with suitable skills. In the old days you could trust that some libraries would keep the old issues on the shelf. With electronic journals it's a little more tricky.

So not surprisingly, sometimes the host ends up killing the whole thing and the content simply goes. Specialist dark archives such as CLOCKSS sometimes ensure that the content survives, but CLOCKSS is no resourced to cover everything, so smaller journals might simply be missed and disappear down through the cracks.

People who start small specialist journals sometimes fail to understand that starting a specialist journal is a bit like owning a cat – when you take on a cat you agree to cover its costs, feed it, take it to the vet, and in return you get affection, companionship and the occasional dead rodent – but the point is that you agree, implicitly, to pay for the animal for the fifteen or seventeen years of its life, and if circumstances change you get the animal rehomed so it can continue to scratch furniture for the rest of its natural life – in other words you have a tacit sustainability plan.

Small online journals need to have such a sustainability plan to cover what happens when the host institution can no longer afford to cover the costs of the journal, including alternative hosting arrangements …