Monday, 23 September 2013

So what do you actually use -Q3 update

Since my Q2 update I have of course become a Chromebook user - and that's the major change this quarter ...
  • Dropbox – used mainly to sync files across computers irrespective of file format
  • Libre Office – platform agnostic document editor for off line writing. Often used in conjunction with Dropbox
  • Evernote – used as a notes and document management system (Nixnote is used on Linux to access my evernote files)
  • Wunderlist for 'to do' list management
  • Chrome – browser extraordinaire
  • Gmail – email solution
  • Postbox - lightweight email client for windows to cope with slow connections
  • Evolution - linux email client principly used in conjunction with Libre Office
  • Google docs – fast means to create quick and dirty documents irrespective of platform
  • Windows Live writer – offline blog post creation TextEdit – android text editor for note taking and integrates nicely with evernote and Gmail
  • Kate - my favourite editor
  • TextWrangler - my secondmost favourite editor
  • Stackedit - Google chrome markdown editor (and blog posting tool)
  • Pandoc - converts markdown to a range of other formats
  • Microsoft Skydrive – used for document backup
  • Excel Web App – for these occasions when Google Spreadsheets or Libre Office Calc will not do
  • GanntProject for gannt chart generation
  • InoReader for RSS feed tracking
  • Twitter for tracking interesting things – rarely for messaging
  • Hosted Wordpress and blogger for blogging, and wikidot for creating structured web pages
  • Hojoki for tracking documents and tasks (Gives unified visibility of GoogleDocs, Skydrive, GitHub, Dropbox and Evernote)
The real change has been to the hardware used. My trusty old Android tablet is still in use for checking email and reading news websites at breakfast time - as evidenced by some of the gluckier marks on the screen. The newer seven inch device is still in use as a note taker and I see no reason to change for the moment. The real change has been the Chromebook. It allows me to check my email. create quick and dirty drafs using either Google Docs and StackEdit, as well as surf the web and research things. If anything has ever demonstrated how much of my day to day reading and specification checking has moved to the web the Chromebook certainly has.
It's also fast, well fast enough, boots quickly and shuts down quickly. It's not a full featured computer but it most definitely provided on the go functionality.
In fact it shows why my original Asus netwbook was such an effective tool and the windows netbook a bit of a clunker - basically load time. The platform is irrelevant, it's access to a browser that counts.
Incidentally over the last month I've been failing to upgrade the memory in the windows netbook. It turns out that there are those that use DDR2 and DDR3. I got myself some DDR 3 and it turns out I've got one that uses DDR2. What is hopefully the correct module is currently somewhere between Shenzen and here. If it works I'll write up the whole upgrade saga ...
Written with StackEdit.

Tuesday, 17 September 2013

Using Alpine with

On Friday I wrote that you could now use imap with Microsoft's mail service.

I also wrote that I'd had problems getting it to work with evolution on a virtual box vm running crunchbang. I still don't know why it didn't work but I'm happy to report that Alpine - the newer updated replacement for Pine - works just fine on the same vm.

Pine is a mail program with a heritage going back to the early nineties and was one of the first mailers to use imap.

It was extensively used on multi user unix systems, and when I was managing York's early nineties managed pc desktop service we used pc-pine as a pc mail client due to licensing and performance problems with the other windows imap clients available at the time. (This wasn't such a problem as it might be, existing Unix pine users made the shift pretty seamlessly, we could reuse the training materials and documentation from the Unix version, and I wrote a program (in Turbo Pascal no less) to automatically populate a user's confirguration file the first time they ran the application.)

I of course havn't used Pine seriously for years, and had never used Alpine in production, so I used Sanjeeva Wijeyesakere's post on setting up Alpine with gmail as a starting point. Basically if you follow his advice but set the inbox path to


and the smtp server to

as well as setting the domain name

and the personal name


it all worked. Obviously you replace myusername with your account name. Being old school I edited the .pinerc file directly with nano rather than using the Alpine configuration menu. You could of course use gedit, vi, or any other text editor.
Written with StackEdit.

Specially for text analysis people

Project Gutenberg have just released A Middle English Vocabulary by John Ronald Reuel Tolkien which looks to be nicely structured document and one from which it would be comparatively easy to extract wordlists etc ...

Written with StackEdit.

Friday, 13 September 2013 and evolution

Microsoft have recently announced that their email service now supports IMAP.
So I thought I'd try it with that well known linux gui email client, evolution. I've previously got evolution to work with office 365 so I thought it would be straight forward - change the name of the servers and set the client to be imap and it should work (and royally confused me to boot).

It does - or more accurately it does with evolution on ubuntu with a standalone, real machine it works fine. Installing it on a crunchbang linux virtual machine it doesn't for reasons I havn't got to the bottom of.

However setting it up is quite simple - basically if you give a Microsoft mail service style address evolution will try and set you up for pop.

To override this change the server type under receiving mail to imap, set the server to and the port to 993 with ssl encryption enabled. For sending mail set the server to  with port 587 with tsl encryption - should then just work ...

As I've said before the only real use case for this is Libre/Open office integration and the ability to send emails from inside of the application. I suspect that the number of habitual linux users with accounts is fairly small, unless like me, you've been using the service since 1998
Written with StackEdit.

Students and Public access computers

In the building in which I work we have a raft of public access computers for students.

In the old days - like last year - students would come up to them, log in, do work, such as writing papers or running some specialist software, log off, and go away.

Recently I've noticed an increasingly common trend whereby they put their laptop on the desk, work on their laptop, and at the same time use the public access computer to access some specialist resource.

It's not a rational behaviour but I've seen it often enough to reckon that it's a thing now.

It looks like either students have not worked out about sharing content from their university filestore (it's a webdav mount), or we've failed in the communication business by either not telling them this, or not making it easy for them to push data back and forth.

It also lends credence to my belief that students have to a large extent self outsourced their computing needs, and that the name of the game is connectivity and access to specialist services...
Written with StackEdit.

Thursday, 12 September 2013

Using the scots stopword list on Barbour's Brus

Well, having made a stopwords file the thing to do is test it.

I chose to use the text of Barbour's Brus, as the Oxford Text archive copy was fairly clean of inline markup, clean enough to fix by hand rather than modifying my original text cleaning code.

The first time around the results were not quite what I expected:

so I modified the stopwords list by removing the following from the list:

and adding:

which gave a better representation:

a little more tweaking might be required, but it has promise as a technique.

This statistical generation of stopword lists could also be applied to analyses of bodies of scientific literature by generating discipline specific extra stopword files so one could filter out the common noise words to get a better impression of a research group's strengths and focus from their published papers - something that is increasingly important as at least one study of search practices among researchers suggests a dependence on Google and by implication it's search algorithms.

Building topic or keyword extraction models may help counter this by allowing the generation of 'other related' lists ...

Making a Middle Scots stopword file ...

Over a year ago I played with topic modelling and wordclouds. As always the reason has not quite gone away, and a year on, I thought I'd better teach myself how to do it properly using R.

Now one of the things I found when I played about with wordclouds is that if you feed middle english text into a wordcloud it does help to have a middle english stopword file.

Playing with the Gutenberg version of Troilus and Cressida I found it was quite easy using R to come up with a stopwords file based on the 100 most common words in the file excluding the names of the protagonists.

The choice of 100 words is purely arbritrary - links some example stopwords files for modern English and they sit around the (200 +/- 50) mark. Chaucer used just over 5600 distinct words in Troilus so we'll assume that the hundred most common words are a valid stopword list. (In fact, applying the eyeball test, a stopword list of around 70 is probably close enough),

Now, a stopword list based on a single poem might be interesting, but it's not very useful. You need a number of poems to come up with a stopwords file that's valid for a particular author.

Then you can do such tricks as comparing the frequency of words (minus the stopwords) between poems. If one has a very different distribution of words it might be by a different author.

So having discovered how to make a stopwords file I though I'd make a stopwords file for middle scots and then see if I can find frequency differences between various poems by various authors as well as using it to generate wordclouds.

For the corpus I chose the works in the Oxford Text Archive Early Scottish Texts archive. I chose middle Scots quite deliberately, as it was (a) different enough from contemporary English in its spelling to treat as if it was a different language, (b) there was a decent body of online text available and (c) it didn't do anything complicated with word endings other than using is for a plural rather than s.

As such it meant that I could use standard off the shelf programs written for contemporary English, but simulate using it it on a different language with all the default assumptions turned off rather than relying on someone else's choice of stopwords.

The files came with some angle bracket delimited non-standard markup which was probably intended to be read by some other program. I wrote a simple perl script to remove this markup, remove irrelevant bracketed inline text such as ( ITM ), and a few other stray characters, and while I was at it converted the files to lower case for future processing.

I didn't try to fix any orthographic quirks - I made the assumption that all the likely stopwords would be words in common use with an agreed spelling. Given that I'd ended up with a sample of around 860,000 words I was running on the basis that any really common variant would probably turn up in the stopwords file.

After some final porcessing with R the source text contained just under 50,000 unique items, which is probably a rich enough corpus, although this may be masking orthographic quirks. The resulting stopword list consists of the first 200 words in the frequency list.

Of this I'd say four of the words are possibly words you might wish to exclude:

prince (444th most common)
kingis (462) 
knycht (565) 
lordis (526)

The csv file containing the 500 most common words is also available for download if you wish to make your own decisions as to what should be in the stopwords file ...

Monday, 9 September 2013

Skydrive and the student filestore

A year ago I blogged that skydrive had killed the student filestore. At the time I argued that as students increasingly had multiple computing devices they would tend to self-outsource and store their work on solutions such as google drive, skydrive and dropbox.

And a year on, I see no reason to think otherwise. Students do seem to be using such solutions to store their work. They probably start at high school or college, and carry the habit on to university.

For the rest of this post I'll talk about skydrive, but it's really shorthand for all the options out there.

The one thing that skydrive doesn't do is provide network shares. You cannot use it interactively with an application as if it was a bit of mounted filestore.

As always, not quite true, there are solutions like Gladinet that let you do this, which is useful for things like background automated backup, but really doesn't give you a truly interactive service like a sher, purely because it's just too slow.

So that got me thinking again about filestores and why we have them. In part it's tradition, just like providing an interactive time sharing unix box. We've done it so we keep on doing it, ignoring the fact that the box never has two or three sessions live on it at any one time.

We started providing filestore on a large scale to students twenty or more years ago when computers started becoming really common in education. In the main we did it because we couldn't provide a computer for everyone , and so went for the public toilet model of computer provision - lots of more or less similar computers with more or less similar software. Didn't matter which one you used, it was all the same.

Students of course needed somewhere to store heir work between sessions, and making them use floppy disks or other local removable storage was impractable for a whole range of reasons, so we took to providing filestore.

In the meantime, computers have become cheap enough so that anyone who can afford course fees can afford a computer, and one that contains more storage than anyone is likely to use over their course.

The result is that students have self outsourced for all the routing tasks like essay writing and project reports.

In fact the only reason for providing filestore is to allow access to specialist software, whether we deliver this via some sort of VDI solution or via the classic public toilet model - in short they need enough storage for coursework that requires the use of specialist facilities and a means of getting data off it.

They have more storage available through services like skydrive than we are likely to provide.

A few days ago I trawled some UK university websites (I chose the UK because it is start of the academic year there and thus what it says about provision is current).

Most sites seem to offer between 1 and 2GB storage - quite a lot offer only 1GB - significantly less than Skydrive's default of 7GB and Amazon's 5GB, but they all offer ways of easily moving data to and from the filestore, ie there is a tacit admission that they are no longer the primary storage provider.

So, what does this mean?

As long as students need access to specialist facilities they will still need filestore as a place to write out their work and to store work in progress between lab sessions.

This storage requirement is fairly modest as students have ready access to other storage and consequently we should actively expect them to wish to upload and download data.

The storge requirement then ceases to be onerous, a few terabytes at most, and one that can be easily be met by the provision of off the shelf NAS technology. Implicit in this is moving responsibility for looking after their work to students, rather than looking after it for them, meaning that we no longer put substantial resources into mirroring or backing up the student filestore as we treat it purely as work in progress filestore.

Given that students have already self outsourced this is not as big a change as it might be, but it is a move that should not happen by default.

Such a move should also be accompanied with a push to better education about data management and the sensible use of commercial storage services, and the risks involved ...

Wednesday, 4 September 2013

When your desktop is in the cloud ...

... you don't care about the operating system on the device in front of you.

The old arguments about how software base is what sold machines and put windows in such a dominant position no longer apply. If(and it is still a big if) everything you use is abstracted to the cloud you really don't care.

You then start buying the thing in front of you on these other factors such as look, feel, street cred and the rest. This may in part explain the growing preference for Macs - the triumph of design over utility. (I'm not immune to this as my ten dollar watch story shows)

It then starts to matter which cloud software ecology you belong to - Google has one, Microsoft and their hardware friends have one, Zoho has an interesting set of tools. And Apple doesn't really have one ...

Written with StackEdit.

Tuesday, 3 September 2013

Student Computer use

It's spring here in Canberra, with the days in the low to mid twenties, even if the nights are still chilly.
On campus this means that the students finally emerge from their winter burrows and start sitting out during the day, even though they are still doing work.

In my day this usually meant intimidatingly thick textbooks, or xeroxed copies of research papers plus a notepad or two. The technology of the mid seventies didn't allow for much more, although in one share house we bought an old desk and put in the back yard, and I have happy memories of banging away on a portable typewriter in the evening sun with a glass of cheap Romanian red ...

Well, these days students seem to have got over the textbook and notepad thing and use their laptops out of doors. Given we've wifi just about everywhere this means they can sit outside and work providing it's not too far from a building.

So in the spirit of my informal surveys of how people read books on the the bus here's my totally unscientific survey.
  • Macbooks are the computer of choice, overwhelmingly so
    • the most preferred MacBook appears to be the Air
  • There is no dominant recognisable preferred Windows based computer brand. 
    • All the common retail brands are represented plus some Vaios 
    • there's a preference for Ultrabooks
  • No one seemed to be using a tablet out of doors with or without a keyboard

Given that students carry them around all day I can see why ultrabook style computers are popular. I was surprised at the lack of tablets + keyboards give their better battery life and their potential as note takers.

Of course it could be that only the cool people are out of doors and the rest are inside beavering away on their clunky old laptops ...
Written with StackEdit.

Thin clients and clouds

I've previously written about how cloud providers can provide services more efficiently through their economies of scale and through the fact they don't need fancy maintenance contracts.

I remember the first time we had NetApp servers all the big system boys couldn't get their heads around the fact that NetApp didn't do hardware support, they gave you some spare disks, memory and a motherboard. If you could build your own PC you could do first line maintenance on a NetApp filer.

But I digress. The other thing that is happening in computing is that processing, as well as data is moving to the cloud. Virtualisation and astonishingly cheap compute and elastic on demand resourcing has made it a viable option to run your applications from the cloud. Zoho does it. Google Apps does it. Office 365 does it.

At the same time the rise of tablet computing has pushed the migration of services to the cloud - tablets individually do very little real work and instead serve content fetched over the internet - of course not quite true, TextEdit for example lets you create and save files without an internet connection.

A chromebook is fundamentally the same concept - almost all the processing is offloaded onto the cloud and it is effectively a thin client that runs a browser that acts as a gateway to non Google services.

This is of course not a new idea - a French ISP was doing something similar back in 2007, and there have been various attempts over the years to introduce thin client computing.

I've been playing off and on with thin client since the mid nineties. The problems have always been:

  1. The backend infrastructure has required one off big lumps of investment
  2. Software licensing has been a nightmare and overly expensive
  3. Not everything runs in the environment

As we know the application everyone wants on their desktop is office, and Microsoft licensing in thin client environments has always been a nightmare as Microsoft have never done concurrent licensing (ie buy n, share them among m users where m>n, but you never run more than n sessions). Cynics might speculate that the only reason Sun got interested in Star Office, the precursor of Libre Office and Open Office was the need to provide an office application in their SunRay thin client environment.

Cloud changes this. Elastic compute plus virtualisation means that most things can be got to run, and suddenly buying all that infrastructure in a big lump doesn't impact the cashflow. Just the same as buying individual PC's or tablets is more manageable than buying a container load - basically you buy what you need.

And the rise of tablet computing has caused changes. For example you now rent Office 365 at a cost that's considerably less than going out and buying licences. Yes, you end up paying just as much if not more, but you pay for what you use.

So, cloud computing plus reliable networking has finally given us thin client computing. Strictly of course it's browser based computing - things like virtual desktops are stepping stones, ways of getting these specialist applications available in the environment.

However the main points are

  1. Clouds and virtualisation have largely removed the cost barriers to investing in the host infrastructure required for thin client computing
  2. The end devices increasingly need do no more than run a browser - look at the Chromebook as a possible model

The implication is that not only will the pc makers be squeezed by the commodification of the server market, but that desktop sales will decline as desktops are increasingly replaced with low cost delivery devices.

Written with StackEdit.