Tuesday, 24 April 2012

Seminar Report: Pragmatic approaches to the Semantic Web


The semantic web or rather the use of semantic web techniques seems to engender a fair amount of hype, accompanied by groups of wide eyed mystics preaching the gospel of open data and open access.
This seminar was different. Mike Bergman, the CEO of Structured Dynamics is a serious player with a considerable track record in the field, including the original concept of the ‘deep web’, all the data locked away in databases and dynamic web pages generated by content management solutions, and which is inaccessible to standard spidering techniques.


His basic argument was that data could be coerced if it was:


a) structured (either explicitly or implicitly say by using a standard document template)
b) could be expressed using a standard canonical model such as RDF
c) could use simple ontologies for wordld views
d) was subject to curation


It was his view that expecting organisations to provide beautifully marked up data as a public good was unrealistic - and it was better to go searching for sources that could be used and coerced (my personal favourite example is library finding aids - they usually have a lot of information and follow a standard format, meaning it is reasonably simple to write some code to pull them apart and do MARC lookups as required to assemble a set of standard RDF documents) to extract data.


As well as linked data being burdensome to produce, Mike pointed out a range of common problems including the over use of SameAs when generating mappings and the general lack of agreed vocabulary alignments (something the DCMI is looking at) and the poor curation of data, vocabularies and schemas - it is no use being reliant on an external schema if the server is always down.


To gain true value we need agreement on vocabulary mapping and alignments so that we can meaningfully map and combine data sources - Mike pointed out the LODLAM (Linked Open Data - Libraries, Archives and Museums) community as a good example of this.


He also pointed out the role of Natural Language Processing in helping elucidate structure in unstructured documents such as automated concordancing to work out which terms relate to which in a document.
He was also strong on the need for proper curation of infrastructure and data sources - the schemas, the vocabularies, the data needs to be properly managed and have persistance - ie not only available but with a long term guarantee of availability and a persisting URI even if the underlying servers change.


The seminar was non technical, but valuable for communicating what is required to make this semantic web thing work from from someone who does this for money rather than as theoretical or academic exercise.


[this seminar was organised by the Canberra Semantic Web Meetup group - slides from the presentation should be online from the group's website in the next day or so]


[slides are now online - 01 May 2012]

No comments: