Sunday 21 May 2023

Literature searches in the time of Chat-GPT

 I have never been convinced by bibliometrics, viewing it as something between black magic and a shell game. The fact that Scopus is owned by Elsevier didn’t exactly help either.

It’s been my view that all these attempts to measure impact are flawed and incorporate unconscious bias against researchers who work at less prestigious institutions an perhaps do not publish mainly in English.

The reasons for this are complex, but I suspect it is in part because the reviewers and editorial boards tend to be drawn from a small number of anglophone institutions who tend to favour researchers working in institutions known to them.

And twenty years ago, they might have had a point. Computing resources were expensive and access to libraries and journals was difficult outside of institutions that did not take a full range of journals. (When I was a researcher forty years or more ago, I had an Inter Library Loan allowance to cover gaps in my home institution’s journal collection, but it did mean the process of reviewing the literature on a topic could be tedious as one waited for the loan article to arrive and then inevitably had to request another two.)

Nowadays it’s easier. Most laptops are powerful enough to run quite complex data analyses, R is public domain and a cornucopia of tools and techniques, online access to journals is relatively easy, and if there’s a problem getting hold of something, you can always hope the lead author is on Researchgate or Academia.edu and amenable to providing an electronic offprint.

But searching the literature is still much the same. One starts with a search engine and a query, 

Usually it’s Google, but it could be Bing or Kagi, all of which  make use of large language models, with Bing being the most reliant on a large language model.

Let’s say I was doing some fecal analysis at an archaeological site - old latrines and their deposits provide a host of information about what people ate, even if the contents are not the nicest to work with, and I had discovered a lot of raspberry seeds.

Raspberries do grow wild in Europe, so I might wish to know if they were cultivated, or gathered wild.

So a reasonable first query would be ‘when were raspberries first cultivated in Europe

I’d expect then to search for sources for the results, perhaps the results of other fecal analyses, but as part of the search process, the first results would be crucial. I ran the same query on the AI Enabled Bing, Google, Kagi, and as a control on the old school Yandex search engine.

The results are, shall we say, inconsistent.


Bing

Google

Kagi

Yandex


Bing, even though it quotes a less reliable source is possibly the most accurate. There’s a lot of evidence for fruit cultivation starting in monastery gardens across Europe from the 12th century onwards.

Kagi is helpful, Google less so and Yandex simply goes for wikipedia, whch is as good a solution as any. 

None of them mentioned cesspits, so I reran the exercise specifically mentioning cesspits in the hope of getting more focused results.

When asked about raspberry seeds being found in cesspits most of the found the same content although only Kagi found detailed research, although Bing made a creditable attempt.

Bing

Yandex

Google

Kagi


And what does this mean for research citation?

I’m not sure, but the differences suggest that the various large language models have biases, and probably it’s best at the moment to run literature searches on multiple search engines ...

Wikis (again)

 Earlier today I read a mastodon post about how someone had upgraded their personal website to a different static site generator, a topic about which I am woefully ignorant, although I can immediately see the value.

It may seem strange, despite having been a computer fiddler since Algol W was trendy, I don't run my own website.

I have my blogs and my wiki of interesting links, but I don't have my very own server. 

Even though there are occasions where it might have been useful (and I admit to in the past having instances of Dspace and Omeka running on a machine under my desk - purely for test and evaluation your honour).

And the reason is very simple. Having once managed a content management system based web site, I'm acutely aware of the sheer amount of work required to keep things patched and secure. If you're into that sort of thing, that's great, but I've always felt that dealing with system internals is like dealing with waste water systems - you do it if you have to, but on the whole it's better to get someone else to do it.

And so it is with my links wiki.

The web view is boringly simple, nothing flash.

And that's because of one of the superpowers of a wiki - creating simple content is quick. I actually use Notepad as a simple text editor to write the text, markup and all, and then paste it into the wiki page editor and do a validation.

And because there's no complicated design or web wiggling, I can concentrate purely on the content rather than worry about the HTML or the page appearance, making page editing and updating pretty trivial.  

Trivial,  because it's been separated from the furniture, the stuff that web designers and implementers do to make a site look both consistent and nice.

It's interesting that the Zola static site generator takes a similar approach, where you generate some very simple furniture  - actually it can be as simple or as complex as you want, and then add content, with the content being written in markdown, something which simplifies content creation.

Thursday 18 May 2023

History links (plus some other stuff)

 As anyone who has been following along at home will know, I've largely abandoned social media, deleting my accounts to avoid any temptation to go back.

At the same time I've started collecting the links I might otherwise have posted to twitter in a wiki, starting a new page every Friday.

This week's links are at

(and there's a link back to the previous week's links at the top of the page.)

It's mostly about Roman history and archaeology with a smattering of nineteenth century stuff.

I don't collect usage data, I'm purely doing this for fun, and the links are simply things that floated my boat ...