Archive for the 'semantic web' Category

Aug 01 2010

Content Identifiers for Digital Rights Persistence

This is another write-up from the Henry Stewart DAM London conference.

Identity and identification

Robin Wilson discussed the issue of content identifiers, which are vitally important for digital rights management, but yet tend to be overlooked. He argued that although people become engaged in debates about titles and the language used in labels and classification systems, people overlook the need to achieve consensus on basic identification.

(I was quite surprised, as I have always thought that people would argue passionately about what something should be called and how using the wrong terminology affects usability, but that they would settle on machine-readable IDs quite happily. Perhaps it is the neutrality of such codes that makes the politics intractable. If you have invested huge amounts of money in a database that demands certain codes, you will argue that those codes are used by everyone else to save you the costs of translation or acquiring a compatible system, and there are no appeals to usability, or brokerage via editorial policy, that can be made. It simply becomes a matter of whoever shouts the loudest gets to spend the least money in the short term. )

Robin argued that the only way to create an efficient digital marketplace is to have a trusted authority oversee a system of digital identifiers that are tightly bound within the digital asset, so they cannot easily be stripped out even when an asset is divided, split, shared, and copied. The authority needs to be trusted by consumers and creators/publishers in terms of political neutrality, stability, etc.

(I could understand how this system would make it easier for people who are willing to pay for content to see what rights they need to buy and who they should pay, but I couldn’t see how the system could help content owners identify plagiarism without an active search mechanism. Presumably a digital watermark would persist throughout copies of an asset, provided that it wasn’t being deliberately stripped, but if the user simply decided not to pay, I don’t see how the system would help identify rights breaches. Robin mentioned in conversation Turnitin’s plagiarism management, which has become more lucrative than their original work on content analysis, but it requires an active process instigated by the content owner to search for unauthorised use of their content. This is fine for the major publishers of the world, who can afford to pay for such services, but is less appealing to individuals, whether professional freelances or amateur content creators, who would need a cheap and easy solution that would alert them to breaches of copyright without their having to spend time searching.)

The identifiers themselves need to be independent of any specific technology. At the moment, DAM systems are often proprietary and therefore identifiers and metadata cannot easily flow from one system to another. Some systems even strip away any metadata associated with a file on import and export.

Robin described five types of identifier currently being used or developed:

  • Uniform Resource Name (URN)
  • Handle System
  • Digital Object Identifier
  • Persistent URL (PURL)
  • ARK (Archival Resource Key).

He outlined three essential qualities for identifiers - that they be unique, globally registered, and locally resolved.

So why don’t we share?

Robin argued that it is easier for DAM vendors to build “safe” systems that lock all content within an enterprise environment, only those with a public service/archival remit tend to be collaborative and open. DAM vendors resist a federated approach online and prefer to use a one-to-one or directly intermediated transaction model. Federated identifier management services exist but vendors and customers don’t trust them. The problem is mainly social, not technological.

One of the problems is agreeing to share the costs of services, such as infrastructure, registration and validation, governance and development of the system, administration, and outreach and marketing.

(Efforts to standardise may well benefit the big players more than the small players and so there is a strong argument for them bearing the initial costs and offering support for smaller players to join. Once enough people opt in, the system gains critical mass and it becomes both easier to join and costs of joining become less of an unquantifiable risk – you can benefit from the experiences of others. The semantic web is currently attempting to acquire this “critical mass”. As marketers realise the potential of semantic web technology to make money, no doubt we will see an upsurge in interest. Facebook’s “like” button may well be heralding the advent of the ad-driven semantic web, which will probably drive uptake far faster than the worthy efforts of academics to improve the world by sharing research data!)

3 responses so far

Jun 15 2010

Are you a semantic romantic?

Published by Fran under semantic web

The “semantic web” is an expression that has been used for long enough now that I for one feel I ought to know what it means, but it is hard to know where to start when so much about it is presented in “techspeak”. I am trying to understand it all in my own non-technical terms, so this post is aimed at “semantic wannabes” rather than “semantic aficionados”. It suggests some ways of starting to think about the semantic web and linked open data without worrying about the technicalities.

At a very basic level, the semantic web is something that information professionals have been doing for years. We know about using common formats so that information can be exchanged electronically, from SGML, HTML, and then XML. In the 90s, publishers used “field codes” to identify subject areas so that articles could be held in databases and re-used in multiple publications. In the library world, metadata standards like MARC and Dublin Core were devised to make it easier to share cataloguing data. The semantic web essentially just extends these principles.

So, why all the hype?

There is money to be made and lost on semantic web projects, and investors always want to try to predict the future so they can back winning horses. The recent Pew Report (thanks to Brendan for the link) shows the huge variety of opinions about what the semantic web will become.

On the one extreme, the semantic evangelists are hoping that we can create a highly sophisticated system that can make sense of our content by itself, with the familiar arguments that this will free humans from mundane tasks so that we can do more interesting things, be better informed and connected, and build a better and more intelligent world. They describe systems that “know” that when you book a holiday you need to get from your house to the airport, that you must remember to reschedule an appointment you made for that week, and that you need to send off your passport tomorrow to renew it in time. This is helpful and can seem spookily clever, but is no more mysterious than making sure my holiday booking system is connected to my diary. There are all sorts of commercial applications of such “convenience data management” and lots of ethical implications about privacy and data security too, but we have had these debates many times in the past.

A more business-focused example might be that a search engine will “realise” that when you search for “orange” you mean the mobile phone company, because it “knows” you are a market analyst working in telecoms. It will then work out that documents that contain the words “orange” and “fruit” are unlikely to be what you are after, and so won’t return them in search results. You will also be able to construct more complex questions, for example to query databases containing information on tantalum deposits and compare them with information about civil conflicts, to advise you on whether the price of mobile phone manufacture is likely to increase over the next five years.

Again, this sort of thing can sound almost magical, but is basically just compiling and comparing data from different data sets. This is familiar ground. The key difference is that for semantically tagged datasets much of the processing can be automated, so data crunching exercises that were simply too time-consuming to be worthwhile in the past become possible. The evangelists can make the semantic web project sound overwhelmingly revolutionary and utopian, especially when people start talking in sci-fi sounding phrases like “extended cognition” and “distributed intelligence”, but essentially this is the familiar territory of structuring content, adding metadata, and connecting databases. We have made the cost-benefit arguments for good quality metadata and efficient metadata management many times.

On the other extreme, the semantic web detractors claim that there is no point bothering with standardised metadata, because it is too difficult politically and practically to get people to co-operate and use common standards. In terms familiar to information professionals, you can’t get enough people to add enough good quality metadata to make the system work. Clay Shirky in “Ontology is overrated” argued that there is no point in trying to get commonalty up front, it is just too expensive (there are no “tag police” to tidy up), you just have to let people tag randomly and then try to work out what they meant afterwards. This is a great way of harvesting cheap metadata, but doesn’t help if you need to be sure that you are getting a sensible answer to a question. It only takes one person to have mistagged something, and your dataset is polluted and your complex query will generate false results. Shirky himself declares that he is talking about the web as a whole, which is fun to think about, but how many of us (apart from Google) are actually engaged in trying to sort out the entire web? Most of us just want to sort out our own little corner.

I expect the semantic web to follow all other standardisation projects. There will always be a huge “non-semantic” web that will contain vast quantities of potentially useful information that can’t be accessed by semantic web systems, but that is no different from the situation today where there are huge amounts of content that can’t be found by search engines (the “invisible web” or “dark web”) – from proprietary databases to personal collections in unusual formats. No system has been able to include everything. No archive contains every jotting scrawled on a serviette, no bookshop stocks every photocopied fanzine, no telephone directory lists every phone number in existence. However, they contain enough to be useful for most people most of the time. No standard provides a perfect universal lingua franca, but common languages increase the number of people you can talk to easily. The adoption of XML is not universal, but for everyone who has “opted in” there are commercial benefits. Not everybody uses pdf files, but for many people they have saved hours of time previously spent converting and re-styling documents.

So, should I join in?

What you really need to ask is not “What is the future of the semantic web?” but “Is it worth my while joining in right now?”. How to answer that question depends on your particular context and circumstances. It is much easier to try to think about a project, product, or set of services that is relevant to you than to worry about what everyone else is doing. If you can build a product quickly and cheaply using what is available now, it doesn’t really matter whether the semantic web succeeds in its current form or gets superseded by something else later.

I have made a start by asking myself very basic questions like:

  • What sort of content/data do we have?
  • How much is there?
  • What format is it in at the moment?
  • What proportion of that would we like to share (is it all public domain, do we have some that is commercially sensitive, but some that isn’t, are there data protection or rights restrictions)?

If you have a lot of data in well-structured and open formats (e.g. XML), there is a good chance it will be fairly straightforward to link your own data sets to each other, and link your data to external data. If there are commercial and legal reasons why the data can’t be made public, it may still be worth using semantic web principles, but you might be limited to working with a small data set of your own that you can keep within a “walled garden” – whether or not this is a good idea is another story for another post.

A more creative approach is to ask questions like:

  • What content/data services are we seeking to provide?
  • Who are our key customers/consumers/clients and what could we offer them that we don’t offer now?
  • What new products or services would they like to see?
  • What other sources of information do they access (users usually have good suggestions for connections that wouldn’t occur to us)?

Some more concrete questions would be ones like:

  • What information could be presented on a map?
  • How can marketing data be connected to web usage statistics?
  • Where could we usefully add legacy content to new webpages?

It is also worth investigating what others are already providing:

  • What content/data out there is accessible? (e.g. recently released UK government data)
  • Could any of it work with our content/data?
  • Whose data would it be really interesting to have access to?
  • Who are we already working with who might be willing to share data (even if we aren’t sure yet what sort of joint products/projects we could devise)?

It’s not as scary as it seems

Don’t be put off by talk about RDF, OWL, and SPARQL, how to construct an ontology, and whether or not you need a triple store. The first questions to ask are familiar ones like who you would like to work with, what could you create if you could get your hands on their content, and what new creations might arise if you let them share yours? Once you can see the semantic web in terms of specific projects that make sense for your organisation, you can call on the technical teams to work out the details. What I have found is that the technical teams are desperate to get their hands on high quality structured content – our content – and are more than happy to sort out the practicalities. As content creators and custodians, we are the ones that understand our content and how it works, so we are the ones who ought to be seizing the initiative and starting to be imaginative about what we can create if we link our data.

A bit of further reading:
Linked Data.org
Linked Data is Blooming: Why You Should Care
What can Data.gov.uk do for me?

No responses yet

Apr 27 2010

Web Science 2010

Published by Fran under semantic web

There have been lots of interesting presentations at Web Science 2010 in Raleigh. My metadata meerkats were popular - hard to beat charismatic megafauna. The papers and posters are online at The Journal of Web Science.

No responses yet

Jun 12 2009

Do publications need style guides for data?

Published by Fran under semantic web

Journalism.co.uk :: Do publications need style guides for data? That would be nice - can we tidy them all up and semantic webify everything while we’re at it?

No responses yet

May 14 2009

ISKO UK | Google Ups its Stakes

Published by Fran under search, semantic web

ISKO UK’s KOnnect blog notes that at least Google is taking metadata seriously.

Chatting about Wolfram Alpha the other day, it was pointed out to me that specialist knowledge for a general audience is actually a very niche area, and this is the source of the hype. You need to persuade your VC funders you are revolutionary, when actually you have a very tricky business model. Serious researchers will be using specialised systems already and most people want to look up things like train times rather than atomic weights of elements, so your market is people like students and journalists, who have an intermediate level of interest. Perhaps there are enough of them in the world to generate plenty of advertising revenue, but it seems like a tough call.

I hope the funders are happy with the old reference publishing model - lots of investment up front, in the hope not that the finished product will generate huge initial profits, but will have a long steady life. Wolfram Alpha employed 150 people in essentially traditional content creation roles and it will be interesting to see how they get their money back. Google doesn’t have to pay for its own content or metadata creation!

No responses yet

Apr 25 2009

Human-Machine Symbiosis for Data Interpretation

Published by Fran under KO, linguistics, search, semantic web

I went to the ISKO event on Thursday. The speaker, Dave Snowden of Cognitive Edge was very entertaining. He has already blogged about the lecture himself.

He pointed out that humans are great at pattern recognition (”intuition is compressed experience”) and are great satisficers (computers are great at optimising), and that humans never read or remember the same word in quite the same way (has anyone told Autonomy this?). I suppose this is the accretion of personal context and experience affecting your own understanding of the word. I remember as a child forming very strong associations with names of people I liked or disliked - if I disliked the person, I thought the name itself was horrible. This is clearly a dangerous process (and one I hope I have grown out of!) but presumably is part of the way people end up with all sorts of irrational prejudices and also explains why “reclaiming” words like “queer” eventually works. If you keep imposing new contexts on a word, those contexts will come to dominate. This factors into taxonomy work, as it explains the intensity people feel about how things should be named, but they won’t all agree. It must also be connected to why language evolves (and how outdated taxonomies start to cause rather than solve problems - like Wittgenstein’s gods becoming devils).

Snowden also talked about the importance of recognising the weak signal, and has developed a research method based on analysing narratives, using a “light touch” categorisation (to preserve fuzzy boundaries) and allowing people to categorise their own stories. He then plots the points collected from the stories to show the “cultural landscape”. If this is done repeatedly, the “landscapes” can be compared to see if anything is changing. He stressed that his methodology required the selection of the right level of detail in the narratives collected, disintermediation (letting people speak in their own words and categorise in their own way within the constraints), and distributed cognition.

I particularly liked his point that when people self-index and self-title they tend to use words that don’t occur in the text, which is a serious problem for semantic analysis algorithms (although I would comment that third party human indexers/editors will use words not in the text too - “aboutness” is a big problem!). He was also very concerned that computer scientists are not taught to see computers as tools for supporting symbiosis with humans, but as black box systems that should operate autonomously. I completely agree - as is probably quite obvious from many of my previous blog posts - get the computers to do the heavy lifting to free up the humans to sort out the anomalies, make the intuitive leaps, and be creative.

UPDATE: Here’s an excellent post on this talk from Open Intelligence.

4 responses so far

Jan 08 2009

Truevert: What is semantic about semantic search?

Published by Fran under search, semantic web

Truevert: What is semantic about semantic search? is an easy introduction to the thinking behind the Truevert semantic search engine. I was heartened by the references to Wittgenstein and the attention Truevert have paid to the work of linguists and philosophers. So much commercial search seems to have been driven by computer scientists with little interest in philosophy, or if they did they kept quiet about it (any counter examples out there?)! Perhaps philosophers have not been so good at promoting themselves either. Perhaps the Chomskyian attempt to divide linguistics itself into “hard scientific” linguistics and “fuzzy” linguistic disciplines like sociolinguistics has not helped.

As a believer in interdisciplinary and collaborative approaches, I have always wondered why we seemed to be so bad at building these bridges and information science has always struck me as a natural crossing point. Of course, there has been a lot of collaboration, but my impression is that academia has been rather better at this than the commercial world, with organisations like ISKO UK working hard to forge links. Herbert Roitblat at Truevert is obviously proud of their philosophical and linguistic awareness, and more interestingly, thinks it is worth broadcasting in a promotional blog post.

No responses yet

Dec 16 2008

National Centre for Text Mining

Published by Fran under search, semantic web

The National Centre for Text Mining is “the first publicly-funded text mining centre in the world”. It is an initiative of Manchester and Liverpool universities, working with the University of California at Berkeley and the University of Tokyo. They appear to be working mainly on biology texts at the moment, but I enjoyed the explanations of their techniques and processes, despite the technicality. There are links to events and seminars that are aimed at the scientific community but some would probably be of interest to more general semantic web enthusiasts.

No responses yet

Nov 04 2008

Semantic analysis technology and human beings

Published by Fran under search, semantic web

I very much enjoyed the presentations given at the ISKO UK event on semantic analysis technologies yesterday and was particularly heartened by the emphasis placed by almost all of the speakers on the need for a human factor to train, maintain, and moderate software systems. My overall impression was that you can have complex software systems that work very well, but you need a lot of human input to set them up - feeding them carefully crafted controlled vocabularies, taxonomies, and especially ontologies - and preferably checking their output.

The first presentation by Luca Scagliarini of Expert System outlined their use of large-scale taxonomies to create an enhanced index that included linked concepts and relationships. The “turbo-charged” index - a sort of faceted taxonomy - is then run against content to create a sophisticated filterable search function.

Jeremy Bentley of smartlogic described taxonomies as “semantic middleware” adding that the notion of having one standard taxonomy has given way to a recognition of the need for multiple taxonomies to reflect multiple viewpoints. He pointed out that shopping in a supermarket would be practically impossible if none of the tins had labels and you had to look in each one to find out what was in it, but that this is essentially what search engines do - they look in all the documents to decide what is in them. He stressed that automatic generation of metadata is essential because of the volume and the need for consistency, but that automated systems are not yet good enough to build well-crafted ontologies, as they cannot allow for the complexities of context-specific requirements and differences between subject domains. Nevertheless, human ontologists can be helped by automatically generated suggestions.

Rob Lee of Rattle Research then described a way of leveraging information in DBpedia - outputs of Wikipedia articles as rdf triples - (muddy it) as a controlled vocabulary to generate links from documents, such as news stories, to other free online resources, such as Music Brainz. DBpedia contains disambiguation information, which improves relevance of links. By adding a search engine layer (they used Lucene to pick out key words), even more interesting links can be made between resources. However, the systems were most successful when restricted to simple identifiable entities - such as people, places, and companies. Such entities can be matched against dictionary/gazetteer-style authority files, which is harder to do for broad subject areas.

Helen Lippell, Karen Loasby, and Silver Oliver then talked about three projects at the BBC. Helen described a joint project involving several organisations including the BBC and the Financial Times to generate metadata automatically to tag news stories. They looked for specific names - such as company names - but encountered problems with company names changing, different forms of names, nicknames, etc. Other problems occurred when company names were ordinary language words (Next, thus, IF) or when company names contained symbols or special characters (e.g. M&S, more th>n).

Karen Loasby described a way of prompting journalists to add metadata to articles being added to the BBC’s content management system. Automated analysis worked best for short, factual content. The journalists themselves were often confused by the suggestions, however, and found it hard to grasp the purpose of the metadata. The system is still used in a modified form, and with some editorial supervision.

Silver Oliver then discussed projects to use statistically-based categorisation to try to pull together related content from different repositories. The method was more successsful for some topics than others and in some cases, rules-based methods were more successful than statistical ones. A major strength of rules-based methods is that they tend to be less of a “black box”. When irrelevant connections are made, you can usually look at the rules and see why the system has found a link, and manually adjust the rule, but with a statistically-based approach, it is harder to diagnose why false connections have occurred. A disadvantage of the rules-based approach is that the rules need manual updating from time to time.

The presentations were followed by a panel session. Issues discussed included granularity - automated systems seem to work well with short pieces of content, but longer items - such as books - might need to be broken into smaller units. Historical archives may need different sorts of semantic analysis to ephemeral content, like news stories, and multilingual mapping may be difficult if languages do not have one-to-one correspondences of concepts so hierarchies have to be re-constituted, rather than simply translated.

No responses yet

Oct 30 2008

Semantic web explained

Published by Fran under search, semantic web

How the Semantic Web Will Change Information Management: Three Predictions makes the semantic web sound so easy! Well worth reading for a very straightforward overview of what’s involved.

No responses yet

Next »