Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google ,
GitHub, and More
O'Reilly Media (2013), Edition: Second Edition, Paperback, 448 pages
- also available as an ebook in most common formats
Thursday, 5 December 2013
Tuesday, 3 December 2013
Over the past few years I’ve been doing a ‘what worked’ every December, but this year I’ve also been doing quarterly updates on what I actually use. So this year’s post is also my quarterly update on the tools used
- Dropbox – used mainly to sync files across computers irrespective of file format
- Libre Office – platform agnostic document editor for off line writing. Often used in conjunction with Dropbox
- Evernote – used as a notes and document management system (Nixnote is used on Linux to access my evernote files)
- Wunderlist for ‘to do’ list management
- Chrome – browser extraordinaire
- Gmail – email solution
- Postbox - lightweight email client for windows to cope with slow connections - used with great success in Sri Linka
- Evolution - linux email client principly used in conjunction with Libre Office
- Google docs – fast means to create quick and dirty documents irrespective of platform
- Windows Live writer – offline blog post creation TextEdit – android text editor for note taking and integrates nicely with evernote and Gmail
- Kate - my favourite editor
- TextWrangler - my secondmost favourite editor
- Stackedit - Google chrome markdown editor (and blog posting tool)
- Pandoc - converts markdown to a range of other formats
- Microsoft Skydrive – used for document backup
- Amazon cloud drive - also used for documents
- Excel Web App – for these occasions when Google Spreadsheets or Libre Office Calc will not do
- GanntProject for gannt chart generation
- InoReader for RSS feed tracking
- Twitter for tracking interesting things – rarely for messaging
- Hosted Wordpress and blogger for blogging, and wikidot for creating structured web pages
- Hojoki for tracking documents and tasks (Gives unified visibility of GoogleDocs, Skydrive, GitHub, Dropbox and Evernote)
The real change has been to the hardware used. My trusty old Android tablet is still in use for checking email and reading news websites at breakfast time - as evidenced by some of the gluckier marks on the screen. The newer seven inch device is still in use as a note taker and I see no reason to change for the moment although I do admit being tempted by the new iPad mini - more because of the software base and the availability of decent keyboard solutions than anything else.
Textedit - the android text editor is now unsupported and while I’m continuing to use it successfully I fear that one day there will be an api change on google drive or evernote that will break things.
The real change has been the Chromebook. It allows me to check my email. create quick and dirty drafs using either Google Docs and StackEdit, as well as surf the web and research things. If anything has ever demonstrated how much of my day to day reading and specification checking has moved to the web the Chromebook certainly has.
It’s also fast, well fast enough, boots quickly and shuts down quickly. It’s not a full featured computer but it most definitely provided on the go functionality.
In fact it shows why my original Asus netwbook was such an effective tool and the windows netbook a bit of a clunker - basically load time. The platform is irrelevant, it’s access to a browser that counts.
However I still use my windows netbook - the Chromebook’s dependence on the internet makes it useless for off net travel, and my windows netbook does support my Virgin 3G dongle, though admittedly I seem to have been staying in places with poor Virgin coverage lately.
My Kindle has become my recreational reading devoce of choice although my vereable Cool-er has taken on a new life as a means of reading Gutenberg epub texts.
The Asus netbook has finally reached the end of it’s useful life but I’m tempted to try Crunchbang Linux on it as a basic writing/note machine, especially as it has a nicer keyboard than my seven inch Android tablet.
Despite several attempts to resurrect them I’m forced to admit that my pair of ppc imacs are too old and slow to be much use and are most probably headed for the great data centre in the sky …
Written with StackEdit.
Monday, 2 December 2013
NLA Innovative Ideas talk 02 Dc 2013
Ed Summers @edsu
This talk was held at the National Library of Australia. I went out of curiousity expecting a demo of cool things from the Library of Congress. Well there were certainly some cool things but given my current interest in the quantification of impact I came away with something else - a set of arguments and positions about access and impact and what exactly that means.
This post is basically my edited and cleaned up notes - any opinions or asides are my own and this is my interpretation of Ed’s talk. Comments and asides are marked up like this.
- Ed Summers has worked on digitisation at Library of Congress
- Linkypedia developer among other tools
- Now a softwre dev at library of Congress
Library of Congress is de facto US National Library
- includes repository devlopment centre - essentially a digital preservation group
- but no final view on what is preservation or repository
- could make the argument that a digital repository is just all the infrastructure for storage and access facilitation
use as justification to focus on doing useful things
could make similar argument for eresearch - rather than focus on grand initiatives focus on being useful
Doing useful things
role of access especially web based access and what access means in the context of digital preservation
digital preservation is access in the future
preservation means access as a way of enabling preservation
- access really is the same as web based access - no brainer
if people engage with your content it will be ustained
balanced value impact model - Simon Tanner
- think about how preservation has impact
what is the benefit? eg cultural preservation and return of digital patrimony to the originating communities, such as Aboriginal groups - may not show formal cost/benefit result
idea of web as customer service medium - the great success of the web has around involvement and engagement on a mass scale eg social media
example nla newspaper ocr correction by crowd sourcing
success of wikipedia by author engagement
GLAM galleries archives libraries and museums
- wikipedia glam effort to engage with GLAM community
- use of GLAM content in wikipedia
- American memory - 1990 effort to digitise Library of Congress data and distribute on laser disc to universities to provide access
- innovative move to web 1993
- very hierarchical content model oriented round collections - very taxonomic view
- lots of clicks to get to an item
- wondered on content use -moved to Flickr to make more searchable photo stream - no massive click frenzy to get to an item
- simplify access -get 200% increase in access
- flickr allowed people to tag and reuse content ie engage with content
click counting does not measure impact
linkypedia - shows how web content used on wikipedia - find how many articles on wikipedia use a a particular resource for citation
- gives counts give number of secondary links - indicator of degree of value and reuse
- usage can be monitord by rss
see reuse of data in sites such as pinterest
wikistream - harvest content from wikipedia via irc to harvest updates
- wikipedia content very dynamic - lots of changes
easy to build an application by gluing tools- unix style building
wikipulse -shows edit activity as a spdometaphor
wikipedia community is active and engaged
Chroma - tool running on amazon to give impression of Wikipedia activity and site usage - question is what use people made of a resource
visual representations gives richer more impressistic use of sites
use of twitterbot to provide auto feed 100 yeasrs ago today to build engagement
mechanical curator - bl tumblr feed of image detail
It’s all about usefulness - if it is useful people will engage with content, cite it and make use of it, and much of the repository space should be around providing access to content - if it’s useful people will engage
Thursday, 28 November 2013
So I thought I’d check it out and make myself an ImpactStory profile. After all the only way to find out about these things and their usability, or otherwise, is to experiment on oneself.
In very crude terms, ImpactStory is an attempt to build a Klout like application for academia. Like Klout it attempts to quantify that nebulous thing ‘influence’. Unlike Klout, it is an open source publicly funded initiative (by the NSF and Alfred P. Sloan Foundation, no less) and transparent - the code and the algorithms used are up for scrutiny.
And of course it’s an important initiative. Scholarly communication has changed. Researchers blog and tweet about work in progress and this forms an important source of preliminary announcements, as well as the usual crosstalk about ideas and interpretations. Project websites have increasingly become a source of research information, etc etc.
So what about ImpactStory ?
I think it’s fair to say it’s a work in progress - it harvests data from twitter, orcid, wordpress, slideshare, vimeo and a few others. Crucially it doesn’t harvest material from blogger, academia.edu or scribd, bot of which are often used to host preprints and grey literature, such as working papers and various reports, for example Rebecca Barley’s work on late Roman copper coins in South India.
In short it’s the first or second pass at a solution. It shows us what the future might be as regards the quantification of ‘impact’, basically la bruit around a researcher and their research work, but it is not the future.
In its current state it probably under reports a researcher’s activity, it can show that a researcher has some impact but it does not show a lack of impact, and we should consequently be wary of including its metrics in any report.
Written with StackEdit.
Monday, 25 November 2013
Way back in September this year I blogged about Student Computer use. Slightly to my surprise that post garnered quite a high readership despite its extremely informal basis
- walk round campus at lunch time
- note what computers you see students using
- blog about it
Over the weekend a colleague from York posted some interesting figures from there on computer use
OS stats of connections INTO campus for desktop OSs: Win81 1%, W8, 9%, Win7 50%, Vista 5%, XP 15%, Linux 3%, OS X 17%
OS stats of computers ON campus: Win81 5%, Win8 20%, Win7 19%, Vista 7%, XP 16%, Linux 6%, OS X 26%
Which is kind of interesting. Now I don’t know how the figures were filtered but they probably reflect a mix of staff and student connections, with, given that York is a universtity where a lot of students live on campus, the second set of figures reflecting student use more accurately than the first.
From this I think it’s fair to say students have a strong preference for OS X. The presence of Vista and XP is interesting, and I think a reflection of my long held suspicion that students buy themselves a laptop at the start of the their degree course and never ever upgrade the OS over the course of their three or four years.
If I’m right, it probably means that the end of support for XP is less of a problem than it might be, as the XP and Vista machines will age out of the population by the end of the northern hemisphere academic year (Of course here in Canberra, our academic year is already over).
This also explains why quarter of the machines are windows 8 or 8.1 - they represent new purchases.
Connections into campus probably reflect a mix of staff and graduate student connections - and the dynamics of machine replacement are probably different for them - they probably use a machine for more of its natural life of four to five years, and given the initial distaste for Windows 8, they probably tried to replace machines with Windows 7 where possible.
The numbers of Vista and XP are concerning, but given that most people never upgrade computers anyway one would need to take human factors into account in any upgrade campaign.
Sidegrading to Ubuntu is probably a step too far for most of these users, given the current low penetration of Linux among that community.
However, the key takeaways are that OS X has made substantial inroads into the student computer community, Linux hasn’t, and despite OS X’s advance Windows OS’s are still the majority
Written with StackEdit.
Friday, 22 November 2013
From a researcher’s point of view impact is about trying to create is something akin to a Klout score for academics.
Klout is a social media analytics company that generates a ranking score using a proprietary algorithm that purports to measure influence through the amount of social media engagement generated.
Because I experiment on myself I know what my Klout score is - an average of 43 (+/- 1), which is respectable but not stellar. Now the interesting thing about this score is
- While I’ve connected my Facebook account to it I don’t actively use Facebook
- I have a small band of loyal twitter followers (230 at the last count)
- Google Analytics shows my blogs have a reasonable penetration with an average readership of around 30 per post
It is however interesting as an attempt to measure impact, or in Klout terms social media influence.
Let’s turn to research papers. The citation rates of scientific papers reflect their influence within a particular field, and we all know that a paper in the Journal of Important Things gets you a nice email from the Dean, and one in Proceedings of the Zagistani Cybernetics Institute does not. And this of course is the idea behind bibliometrics and attempting to quantify impact. Crudely, a paper in a well respected journal is likely to be more widely read than one that is not.
Even if it is not widely cited the paper has had more influence and one less widely read.
And of course we know it probably is more or less true. If you’re an ethologist you’re probably going to want some publications in Animal Behaviour on your CV.
So we could see it could sort of work within disciplines, or at least those in which journal publication is common. There are those, such as computer science, where a lot of the material is based around conference proceedings and that’s a slightly different game.
Let’s now think about dataset citation. By it’s nature data that is widely available is open access and there is no real established infrastructure, with the exception of a few dedicated specialist repositories such as the Archaeological Data Service in the UK and IRIS in the US for Earth Sciences.
These work because they hold a critical mass of data for the disciplines, and thus archaeologists ‘know’ to look at the ADS just as ethologists ‘know’ to look at Animal Behaviour.
Impact is not a function of the dataset, but of some combination of its accessiblity and dissemination. In other words it come down to
- Can I find it?
- Can I get access to it ?
At this state of the game, there are no competing repositories in the way that there are competing journals, which means that we can simply use raw citation rates to compute influence, and to use citation rates we need to be able to identify individual datasets uniquely - which is where digital object identifiers come in - not only do they make citation simpler, they make counting citations simpler …
Written with StackEdit.
I’ve recently been looking at buying myself a Raspberry Pi - admittedly more as toy than for any serious purpose - as I’ve been feeling the lack of a linux machine at home ever since my old machines built out of recycled bits died in a rainstorm.
(To explain - I have a bench in the garage where I play with such things, and not unnaturally I had the machines on the floor. We had a rainstorm of tropical dimensions, one of the garage downpipes blocked, and the rainwater backed up and then got in under the tin and flowed across the floor, straight through my linux machines).
Anyway, to the point, I’ve been researching options to buy a Pi, especially as we don’t really have much of a local ecosystem in Australia.
And one thing that became very obvious is that they have a major role in powering media players and displays - which kind of makes sense given that they have HDMI output and run a well known operating system, making the ideal for streaming content off of a local source or powering a display system - run a kiosk app on the Pi, and push your content out onto a display device - wonderful what you can do with cheap technology.
Again, by pure coincidence I came across a post describing the role of cheap Android devices and how in the main they are used as ways of viewing video content or else as embedded devices.
In other words there is a lot of under the radar demand for content viewing which is different from how we think tablets are used - for more engaged activities such as web surfing and email, as well as routine tasks like online banking.
And here we have the key takeaway - tablets like raspberry pi’s are versatile computing devices, just as pc’s are. And just the same way pc’s have a lot of uses other than general purpose computing, tablets and other such devices do.
PC’s became general purpose computing devices in the main because of their open architecture and the fact that various factories in the Far East could make bits for the relatively cheaply, meaning that if you wanted to make a gene sequencer say, rather than having to design embedded hardware, and then have the difficulty of maintaing and upgrading it, you could write software and use the standard interfaces available on a pc - thus significantly reducing your development and delivery costs.
Android, and the Raspberry Pi, both of which are open systems like the original PC are giving us a similar effect - cutteing development and delivery costs for embedded systems as the software environment is already there …
Written with StackEdit.