Archive for the 'semantic web' Category

Apr 13 2013

ISKO UK 2013 - provisional programme

Published by Fran under KO, search, semantic web

I will probably be on the other side of the Atlantic when the ISKO UK conference takes place in July in London, UK. I will be sorry to miss it, because the committee have brought together a diverse, topical, and fascinating collection of speakers.

ISKO UK excels in unifying academic and practitioner communities, and the conference promises to investigate the barriers that separate research from practice and to seek out boundary objects that can bring the communities together.

This is demonstrated in person by the keynote speakers Patrick Lambe of Straits Knowledge and Martin White of Intranet Focus Ltd - both respected for their commercial as well as academic contributions to the field of Knowledge Organization.

Amidst what is already shaping up to be a very full and varied programme, the presentations by Jeremy Tarling and Matt Shearer (BBC News) and Jarred McGinnis and Helen Lippell (Press Association) will show how research in semantic techniques is now being put to practical use in managing the fast-flowing oceans of information that news organizations handle.

The programme also includes a whole session on combining ontologies with other tools, as well as papers on facet analysis and construction of controlled vocabularies. There’s even some epistemology to please pure theoreticians.

No responses yet

Jan 06 2013

Tag you’re it - but is your tag the same as my tag?

Published by Fran under KO, semantic web

Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.

Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.

 

 

1) A tag is a free text keyword you add as part of the metadata of something to help search

Free text tags are usually uncontrolled and unstructured (folksonomic) simple strings of characters. Free text tagging functionaliy is usually no more than a simple text field in a database, so it very easy to implement technically. For limited collections, collections with low research value, user-generated collections, and collections that are not otherwise catalogued, free text tags provide the ability to do at least some searching (e.g. if you have a small collection of still images that have no other metadata attached, any subject keyword tags are better than none).

Folksonomic tagging was hailed as revolutionary a few years ago because it is cheap. However, it fails to solve numerous information retrieval problems. Most significantly, if you use free text tags, you need to do additional work later on to disambiguate them (apple, apple, or apple - company, record label, fruit?) or add any structure to them, including grouping synonyms to provide a more complete search (a search for “automobile” can’t retrieve items tagged “car” unless you can associate these synonyms in a synset, synonym ring, or thesaurus).

 

2) A tag is a keyword that is selected from a controlled vocabulary or authority list

Controlled keywords are more useful than free text tags because they reduce the problems of synonyms and disambiguation by making the person applying the tag choose from a limited set of terms. It is easier to build a thesaurus containing all the controlled keywords, as you are not trying to encompass every possible word in the language (or indeed any string of characters that somebody might make up). Controlled vocabularies also avoid apparently trivial but practically problematic issues such as spelling variants and errors and use of abbreviations. However, flat controlled vocabularies become very unwieldy once you have more than about 50 terms. There may be a numeric identifier associated with a controlled vocabulary keyword, but it is usually only some kind of local internal system identifier.

Tags taken from controlled lists are often used for process-driven functions, as opposed to search or browse functions. So, someone might apply a tag from a controlled list to designate a workflow status of an asset. For such processes, it is usually fairly straightforward to control the vocabulary options available, so that only a few labels are available. Linguistic nuances are not so important in such contexts - people are just taught what the options are and usually it doesn’t occur to them to try to use other terms. If the available terms are inadequate, this often means there is something wrong with the business process or the system design (e.g. we need a workflow state of “pending approval” but we only have the labels “created” and “approved”).

 

3) A tag is a keyword that is selected from a taxonomy

Once a controlled vocabulary becomes too long to be easy to navigate, it can be “chunked up” or “broken down” into a taxonomy.
Keywords in taxonomies are more useful than keywords in flat controlled vocabularies because the taxonomy holds information about the relationships between terms. The simplest relationship is broader>narrower (parent>child). This means you can “chunk up” your flat vocabulary list into sections, e.g. to make it easier to navigate, to offer ways a researcher can modify their search (didn’t find what you wanted - try a broader search, too many results – try a narrower search). Usually internal IDs are used to connect the label displayed in the UI with the graph that contains the relationships between the concepts.
Often a taxonomy will also hold associative (”see also”) relationships, effectively extending the taxonomy to be a taxonomy-with-thesaurus.

 

4) A tag is a type of Uniform Resource Identifier (URI)

This is the Linked Open Data approach. There are important differences between tag URIs and other types of tag. URI tags have to conform to various technical conventions and standards that support interoperability. In Linked Open Data contexts, URI tags are usually public and shared, rather than being private IDs. Relationships between URIs are usually expressed in an ontology, rather than a taxonomy (although the ontology may associate taxonomies or the ontology may be derived from pre-existing taxonomies).

 

5) A tag is metadata added to a web page for search engines to index

It is possible to add any of the above types of tag to a web page (you can say a web page is just another type of asset). Differences between tags on assets and tags on web pages are usually to do with the ways those tags are stored and how they are used by other systems (i.e. a stock management system will need different information to a search engine). Search engine optimisation (SEO) bad practices led to a decline in the use of keyword tagging for search engine indexing, although the Semantic Web returns to the principle that content creators are the best people to index their content (see next section).

For web pages, the tags are often added in the header information, along with other instructions to the browser. On indiviudal assets (e.g. photos, videos) in content or asset management systems, the tags are often held in a particular field in a database. For Linked Open Data systems (whether managing web pages, traditional assets, or combinations of both), the tag URIs and their relationships (triples) are usually stored in a triple store, rather than conventional database.

With web pages, tagging can become very complex, as there might be a mixture of URI tags and basic labels, and a web page can be a complex information system in its own right, containing sub-elements such as audio and video content that itself might have various tags.

 

6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing

The language that is used to write web pages (HTML) is often described as comprising tags. So, you tag up flat text with instructions that tell the browser “this is a heading”, “this is a paragraph” etc. With the advent of HTML5 and vocabularies such as schema.org, more and more semantic information is being included in these tags. Search engines can use this information, for example to create more specific indexes.

 

So, when you ask someone if the content is tagged, and they say yes, it is always worth checking you both actually mean the same thing!

2 responses so far

Dec 02 2012

Libraries, Media, and the Semantic Web meetup at the BBC

In a bit of a blog cleanup, I discovered this post languishing unpublished. The event took place earlier this year but the videos of the presentations are still well worth watching. It was an excellent session with short but highly informative talks by some of the smartest people currently working in the semantic web arena. The Videos of the event are available on You Tube.

Historypin

Jon Voss of Historypin was a true “information altruist”, describing libraries as a “radical idea”. The concept that people should be able to get information for free at the point of access, paid for by general taxation, has huge political implications. (Many of our libraries were funded by Victorian philanthropists who realised that an educated workforce was a more productive workforce, something that appears to have been largely forgotten today.) Historypin is seeking to build a new library, based on personal collections of content and metadata – a “memory-sharing” project. Jon eloquently explained how the Semantic Web reflects the principles of the first librarians in that it seeks ways to encourage people to open up and share knowledge as widely as possible.

MIMAS

Adrian Stevenson of MIMAS described various projects including Archives Hub, an excellent project helping archives, and in particular small archives that don’t have much funding, to share content and catalogues.

rNews

Evan Sandhaus of the New York Times explained the IPTC’s rNews – a news markup standard that should help search engines and search analytics tools to index news content more effectively.

schema.org

Dan Brickley’s “compare and contrast” of Universal Decimal Classification with schema.org was wonderful and he reminded technologists that it very easy to forget that librarians and classification theorists were attempting to solve search problems far in advance of the invention of computers. He showed an example of “search log analysis” from 1912, queries sent to the Belgian international bibliographic service – an early “semantic question answering service”. The “search terms” were fascinating and not so very different to the sort of things you’d expect people to be asking today. He also gave an excellent overview of Lonclass the BBC Archive’s largest classification scheme, which is based on UDC.

BBC Olympics online

Silver Oliver described how BBC Future Media is pioneering semantic technologies and using the Olympic Games to showcase this work on a huge and fast-paced scale. By using semantic techniques, dynamic rich websites can be built and kept up to the minute, even once results start to pour in.

World Service audio archives

Yves Raimond talked about a BBC Research & Development project to automatically index World Service audio archives. The World Service, having been a separate organisation to the core BBC, has not traditionally been part of the main BBC Archive, and most of its content has little or no useful metadata. Nevertheless, the content itself is highly valuable, so anything that can be done to preserve it and make it accessible is a benefit. The audio files were processed through speech-to-text software, and then automated indexing applied to generate suggested tags. The accuracy rate is about 70% so human help is needed to sort out the good tags from the bad (and occasionally offensive!) tags, but thsi is still a lot easier than tagging everything from scratch.

No responses yet

Aug 11 2012

SLA Conference in Chicago

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes - Dublin Core, FOAF, SKOS, which provide structures but no meanings - and semantic schemes - taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

No responses yet

Jul 31 2012

New York Public Library and metadata

I spent a wonderful afternoon at the New York Public Library on July 20th, thanks to Phil Sutton, reference librarian, who was kind enough to talk to me about his work and introduce me to several of his colleagues in the NYPL Labs, website, and local history teams.

As the Library holds such vast and diverse collections, it is not surprising that the metadata work of the Labs team is varied and wide ranging. One project involves rationalising and mapping metadata across collections that use different standards, another involves creating metadata for content strategy and website navigation, while more experimental work includes looking to use Linked Data techniques to open up and cross reference data sets.

What’s on the Menu? is using crowd sourced help to transcribe the Library’s collection of restaurant menus. So far, they have completed 998,899 dishes transcribed from 14,872 menus, and are investigating ways of linking the data to enable researchers to make interesting connections. So far, the data is in a fairly raw form, but is available to access through an API.

The Labs team are also working on the Library’s numerous directories, with an emphasis on helping genealogists, starting with census data from 1940 in the DirectMe project.

Previous projects have opened up collections of stereographs and maps, as well as content related to musical theatre, theatrical lighting, and the Shelley-Godwin archive.

No responses yet

May 20 2012

Google goes semantic

Published by Fran under search, semantic web

A happy week for ontologists, taxonomists, and other knowledge organisers as Google reveals its knowledge graph.

Patrick Lambe sums it up wonderfully:
Google Finally Comes Out of the Closet on Taxonomies.

Here’s a great post by Seth Earley:
Google Knowledge Graph and Taxonomy - It’s in There.

No responses yet

Mar 29 2012

On Location - geospatial information

Published by Fran under semantic web

I attended an event co-hosted by ISKO UK and the British Computer Society about location data and have written about it for the ISKO UK blog.

No responses yet

Nov 20 2011

Data Ghosts in the Facebook Machine by Fantasticlife

Published by Fran under culture, search, semantic web

Understanding how data mining works is going to become increasingly important. There is a huge gap in popular and even professional knowledge about what organisations can now do “under the surface” with our data. For a very clear and straightforward explanation of how social graphs work and why we should be paying attention read Data Ghosts in the Facebook Machine.

One response so far

Nov 13 2011

Holodecks, marketing, and crime scenes - the DAM link between different worlds

In the last two weeks I have attended three very different conferences, with DAM as the common thread. The first was Media Pro Expo, where I spoke on a panel with the DAM Foundation, alongside Mark Davey, Madi Solomon, and David Lipsey. The second was Createasphere’s first European DAM conference, and the third (co-located with the Createasphere event) was the SPAR Europe Conference on 3D Imaging and Data Management for Engineering, Construction, Manufacturing, and Security.

The contrast between Media Pro and SPAR, and their respective audiences, was striking, but so were the similarities of the problems they faced, such as the common need to manage rich media assets and huge volumes of data. Media Pro was aimed at marketing companies, and had lots of amusing exhibits showcasing ways of using technology to create engaging and entertaining campaigns. (I enjoyed playing with an interactive magazine cover linked to a camera that allowed you to put your picture “on the cover” and select your favourite headlines.) Marketing companies are concerned with keeping, curating and mining data not just about customers’ contact details, but also their likes, social connections, and shopping habits in order to create personalised campaigns, so they have become great consumers of metadata.

3D Imaging and Data Management

SPAR was all about scanning and mapping, not in the sense that I am familiar with, but literally surveying the Earth and making maps. There were companies that use lasers to create roadmaps, others that carry out aerial surveys, and some that create 3-D representations of buildings. There are systems for surveying and modelling building sites to make sure that construction avoids sewers, pipes, and underground cables, and even a system for creating 3-D photosets of crime scenes to help the police in investigation and evidence gathering.

Createasphere

At Createasphere I talked about managing metadata in complex information environments and how we need to treat metadata as content in its own right. There were a range of excellent and diverse presentations, covering topics from the potential of immersive virtual worlds and the huge volumes of data they produce, to descriptions of technical metadata exchange projects.

I began to think about the crossover point between the creativity and imagination of the media and marketing companies and the power and accuracy of the surveying companies and how this is going to bring about hugely powerful fantasy “Holodeck” worlds that will make Second Life and the Sims look quainter than the Mickey Mouse cartoons of the 1930s.

Better than the real world

One challenge for information professionals is to think about how we can create navigation and search systems that do more than just replicate the real-world paradigms we are used to at the moment - I am thinking of things like road signs and timetables - but how to harness the best of semantic techniques and data mining processes to create reactive intuitive worlds that work better than the real one. Ed Lantz of Vortex Immersion Media spoke of “intelligent spaces” that automatically access our data, our assets, information about us, and arrange themselves to suit us. How do we prepare for a world when the likes of Apple’s speech recognition system Siri aren’t genies in bottles, but are the environment around us? We used to worry about ghosts in the machine, but will we end up as the ghosts inside the machine? We worry about putting our assets out there into the cloud, but perhaps we should be thinking more about what it will be like when we step inside the cloud or bring the cloud into our homes?

There was a post circulating on Twitter recently describing the library of the future as a hellish place where characters from books come alive and stalk the readers in the rooms. It was somewhat derided as a childish joke, but if we create Holodecks and then try to live in them, it could well come true. The implicit warning it contains that we could inadvertently trap ourselves in such a hellish place where privacy, rights, control, and manipulation are so hidden from view that we lose our sense of self seems to be very mature and insightful. Another post I read was about how interface designers are currently working on “pictures under glass” and need to start to use the full tactile, haptic, and 360 degree expressivity of our physical bodies, such as we are beginning to with technologies like the Wii and Kinect.

Making work fun

Theresa Regli of the Real Story Group pointed out that the world we are in now is one in which people still don’t grasp the importance of labelling their images, so immersive virtual worlds seem a long way off, but she also talked of the need for corporate interfaces to embrace “gamification”, as employees are far more productive when their jobs are fun. It may take some time, but I like the idea of a Holodeck meeting room where people make presentations and collaborate on plans by dancing around, rather than sitting staidly at a table. Rather than the hellish library where AI brings fictional monsters to life, it might turn out to be a lot of fun and all that movement may even be good for our health!

No responses yet

Oct 09 2011

More than a schedule, give me an index

Published by Fran under KO, archives, semantic web

People have started to talk about the death of the schedule, often in the context of complaining that broadcasters are ill-prepared for this inevitability and schedulers complaining that no-one appreciates their skills in placing programmes appropriately and in context. One example is “hammocking” – making sure that viewers receive a “varied diet” across an evening, perhaps placing the news between two lighthearted pop culture programmes.

Meanwhile, the anti-schedulists point out that given the choice, some people will download and watch an entire series in one marathon session (people have “Torchwood weekends”), so that they don’t have to commit to being in front of the TV at 9pm every Thursday, or will watch a film broken down into 20 minute sections on their mobile phone while commuting. Schedulism and anti-schedulism can seem like major culture clash, but is easily resolved when you think purely in terms of knowledge organisation.

A schedule is just metadata

A schedule is merely a set of metadata about programmes. It used to be the most important set of metadata for most people (along with the programme title!) as it was the key to not missing the programme. Now that we have catchup services and archives, knowing exactly when a programme will be broadcast or was broadcast may be less significant for finding that programme again, leading some people to claim that schedules are no longer needed. However, there are plenty of people who don’t want to look for specific programmes but want to sit down and be entertained for the evening. For them, schedules remain vital as they outline what is available. Scheduling in this sense is editorial selection, with all the craftsmanship and judgement that implies.

People are fascinated to know what was broadcast on the day they were born, and which programmes went out together, and schedules offer all sorts of socio-political and cultural information, giving snapshots of what were popular topics or contentious issues over time.

Schedule data is less significant in a vast online digital archive, but it is still useful. For example, you might want to find an episode you missed in a long-running series. You probably won’t know that it was episode 12 of 26, but you might remember that the reason you missed it was because you were out celebrating a friend’s birthday, which is a date you know. This may be a lot quicker than reading through the episode descriptions, which are usually too vague to be helpful, as the writers don’t want to give away “spoilers”, such as the final cliffhanger, which is often the part of the episode you remember the best. The programme descriptions are intended to entice you to watch the programme, not help you work out whether or not you have already seen it.

Don’t ditch the schedule, add to it

What is important to bear in mind is that digital archives can offer schedule data almost effortlessly, but can offer many more metadata streams as well. These metadata streams are in many ways innovative and can lead to fascinating new ways of grouping programmes and promoting content. Rich subject metadata (such as a subject index) becomes an engine by which you can drive all sorts of automatically created content channels. You can group programmes by theme or topic as well as series and genre. So you don’t have to rely on when something was shown, you can use an index to gather together all programmes about fishing, or harpsichords, or the miners’ strike – bringing together documentaries (Heart of the Matter, Panorama), news and current affairs (also Question time, Newsnight, even The Money Programme), as well as plays (The Price of Coal), or even comedies (The Comic Strip Presents.. The Strike).

Such subjective metadata also gives you extra contextual information, for example in the case of the Miners’ Strike, it shows you that there were miners’ strikes in 1921, 1926, 1955, 1972, 1974, 1981, as well as in 1984, and that miners around the world have gone on strike at various times. This historic perspective is hard to pick out from schedule data. (Even if you could see programmes about miners’ strikes had been broadcast in these years, you would have to do further research to find out if they were covering contemporary events.) If the programmes have such metadata attached, anyone – any user of the archive – can effectively build rich personalised channels on their favourite topics or themes, and share those with others who have similar interests.

Metadata advertises content

If the metadata is in a Linked and Open format, the associative trails can wander beyond your collection to others, reaching new audiences, perhaps via social networks. This releases the “long tail” of content that is otherwise hard to find and re-use, as well as putting popular content into context. Making your metadata available more widely means more people will have more and more routes in to exploring your archive, even if you choose to restrict this to in-house teams or paying subscribers.

Either way, if you want to sell individual programmes or parts of programmes, knowing not just when you transmitted them but knowing exactly what they are about - via the rich semantic metadata you have added - offers a very useful sales and marketing tool.

No responses yet

Next »