Aug 18 2012

Keeping your Taxonomy Fresh and Relevant - SLA Chicago

Published by Fran under KO

Matt Johnson from EMC gave an extremely clear and useful guide to managing change in taxonomy projects.

Matt carried out a survey of taxonomy projects which looked at how many people were involved in processing taxonomy change requests and what business processes there were around making changes.

He stressed that taxonomies may last longer than taxonomists, especially from the point of view of staying within an organization, and so business processes need to be defined and established to make sure the taxonomy doesn’t wither and decay once the people who created it move on. However, business processes themselves need to be dynamic and flexible too. Changes need to be integrated with existing data and workflows, and taxonomy managers need to have good communications with consumers of taxonomy data. Service-level agreements and metrics can help in negotiations and monitoring by clarifying and documenting issues and agreements, especially when liaising with other departments and external consumers of taxonomies.

My presentation gave an overview of the taxonomy migration and revision project I have been working on for the past couple of years.

Matt and I were delighted to have such a big and lively audience for our session, especially as it was at 8 am! Thank you to everyone who joined us, to SLA’s Taxonomy division for organzing the session, to the session sponsor Gale Cengage Learning, and to Larry Lempert for moderating.

One response so far

Aug 11 2012

SLA Conference in Chicago

Last month I had a wonderful time at the SLA (Special Libraries Association) conference in Chicago. I had never previously been to an SLA conference, even though there is a lively SLA Europe division. SLA is very keen to be seen as “not just for librarians” and the conference certainly spanned a vast range of information professions. The Taxonomy Division is thriving and there seem to be far more American than British taxonomists, which, although not surprising, was a pleasure as I don’t often find myself as one of a crowd! The conference has a plethora of receptions and social events, including the “legendary” IT division dance party.

There were well over 100 presentation sessions, as well as divisional meetings, panel discussions, and networking events that ranged from business breakfasts to tours of Chicago’s architectural sights. There was plenty of scope to avoid or embrace the wide range of issues and areas under discussion and I focused on taxonomies, Linked Data, image metadata, and then took a diversion into business research and propaganda.

I also thoroughly enjoyed the vendor demonstrations, especially the editorially curated and spam-free search engine Blekko, FastCase, and Law360 legal information vendors, and EOS library management systems.

My next posts will cover a few of the sessions I attended in more detail. Here’s the first:

Adding Value to Content through Linked Data

Joseph Busch of Taxonomy Strategies offered an overview of the world of Linked Data. The majority of Linked Data available in the “Linked Data Cloud” is US government data, with Life Sciences data in second place, which reflects the communities that are willing and able to make their data freely and publicly available. It is important to keep in mind the distinction between concept schemes - Dublin Core, FOAF, SKOS, which provide structures but no meanings - and semantic schemes - taxonomies, controlled vocabularies, ontologies, which provide meanings. Meanings are created through context and relationships, and many people assume that equivalence is simple and association is complex. However, establishing whether something is the “same” as something else is often far more difficult than simply asserting that two things are related to each other.

Many people also fail to use the full potential of their knowledge organization work. Vocabularies are tools that can be used to help solve problems by breaking down complex issues into key components, giving people ways of discussing ideas, and challenging perceptions.

The presentation by Joel Richard, web developer at the Smithsonian Libraries, focused on their botanic semantic project – digitizing and indexing Taxonomic Literature II. (I assume they have discussed taxonomies of taxonomy at some point!) This is a fifteen-volume guide to the literature of systemic botany published between 1753 and 1940. The International Association for Plant Taxonomy (IAPT) granted permission to the Smithsonian to release the work on the web under an open licence.

The books were scanned using OCR, which produced 99.97% accuracy, which sounds impressive but that actually means 5,000-12,000 errors – far too many for serious researchers. Errors in general text were less of a concern than errors in citations and other structured information, where – for example, mistaking an 8 for a 3 could be very misleading. After some cleanup work, the team next identified terms such as names and dates that could be parsed and tagged, and selected sets of pre-existing identifiers and vocabularies. They are continuing to look for ontologies that may be suitable for their data set. Other issues to think about are software and storage. They are using Drupal rather than a triplestore, but are concerned about scalability, so are trying to avoid creating billions of triples to manage.

Joel also outlined some of the benefits of using Linked Data, gave some examples of successful projects, and provided links to further resources.

No responses yet

Jun 06 2012

Building, visualising and deploying taxonomies and ontologies; the reality - Content Intelligence Forum event

I have been trying to get to the Content Intelligence Forum meetups for some time as they always seem to offer excellent speakers on key topics that don’t tend to get the attention they deserve, so I was delighted to be able to attend Stephen D’Arcy’s talk a little while ago on taxonomies and ontologies.

Stephen has many years of experience designing semantic information systems for large organisations, ranging from health care providers, to banks, to media companies. His career illustrates the transferability and wide demand for information skills.

His 8-point checklist for a taxonomy project was extremely helpful – Define, Audit, Tools, Plan, Build, Deploy, Governance, Documentation – as were his tips for managing stakeholders, IT departments in particular. He warned against the pitfalls of not including taxonomy management early enough in search systems design, and the problems that you can be left with if you do not have a flexible and dynamic way of managing your taxonomy and ontology structures. He also included a lot of examples that illustrated the fun aspects of ontologies when used to create interesting pathways through entertainment content in particular.

The conversation after the talk was very engaging and I enjoyed finding out about common problems that information professionals face, including how best to define terms, how to encourage clear thinking, and how to communicate good research techniques.

No responses yet

May 20 2012

Google goes semantic

Published by Fran under search, semantic web

A happy week for ontologists, taxonomists, and other knowledge organisers as Google reveals its knowledge graph.

Patrick Lambe sums it up wonderfully:
Google Finally Comes Out of the Closet on Taxonomies.

Here’s a great post by Seth Earley:
Google Knowledge Graph and Taxonomy - It’s in There.

No responses yet

May 25 2011

Review of The Accidental Taxonomist

Published by Fran under KO

Rather late to the party on this one, but I finally got around to reading The Accidental Taxonomist by Heather Hedden. I have to confess to bias as I was very pleased to see that my FUMSI article on folksonomies was mentioned in the recommended reading section. Written in a clear, sensible and readable tone, Hedden gives a very thorough overview of practical taxonomy work. The book works as a textbook, but reads pleasantly and although I anticipate referring to it as a reference resource, I enjoyed reading it chapter by chapter.

I am very pleased to have so much practical and useful information in one place (for example lists of relevant standards, definitions of taxonomies, ontologies, and thesauri, the functions of taxonomies) as in my day-to-day work, I often have to explain the basics to people. I have been recommending the book to my team, especially those who are new to taxonomies, and they have appreciated its clarity and comprehensiveness as a “field guide”. It covered familiar ground, but for me much of that was my “tacit knowledge” that I had never fully articulated to myself, so I am sure that this “knowledge capture” from the mind of an experienced taxonomy practitioner will be very useful.

No responses yet

Jul 03 2010

Procuring a Digital Asset Management system

Published by Fran under Digital Asset Management

This is the first of a series of summaries of the Henry Stewart DAM London conference on June 30, chaired by David Lipsey. The panels (one of which included me) were a pleasing mix of very practical information and more theoretical discussion.

Classic DAM vendor “overstatements”

Theresa Regli, who does a great job as a “professional sceptic” stressed the need for a calm and considered approach to procurement with the most important stage being the testing stage. You wouldn’t buy a car without taking it for a test drive, but people buy software without finding out if it can handle their content. Nobody’s assets and business processes are exactly the same, and just because a system suited somebody else perfectly doesn’t mean it is right for you. Vendors will say that they can do anything, but that’s their job so don’t take their word for it. Don’t be distracted by the coolest of the cool new features or other bells and whistles. Cool costs - but may not make - money for your business. On the one hand, if the cool features don’t actually improve your specific business processes, they won’t benefit you, and on the other, vendors have become increasingly adept at marketing the same old features in new ways, so it is very important to dig beneath the surface to find out how they are doing what they claim. Surprisingly little has changed technologically in the DAM vendor landscape over the last five years. So, a wonderful new system for automatically indexing images directly may in fact just be the familiar territory of analysing textual metadata associated with images.

Speech to text

One area that has moved on is the technology to convert speech to text. This means that you can, to an extent, subtitle a film automatically (which isn’t quite the same as a system that can “watch a movie and understand what’s going on scene by scene”). This then gives you a chunk of textual metadata you can search and analyse (“understanding” what’s going on relies on sentiment analysis – looking up words in thesauruses, so, for example, if the dialogue mentions guns, shooting, and bullets a lot, the software could suggest it is a gunfight scene). However, accuracy rates are patchy and the systems require training, which could be labour intensive, so you need to make sure those training costs and the time required are included in budgets and schedules. The systems work best if you can get everything read by someone like Patrick Stewart, as he has very clear and even enunciation. Anyone with an unusual accent or who mumbles is far more difficult to process. As usual, the software is easiest to train if you are working within a specific context, so you can focus on relevant words and accents, rather than anything anyone anywhere in the world might happen to say.

A clever use of the technology is by the car industry to save time analysing focus group interviews. They asked interviewers to “audio index” their interviews by saying a key “trigger” word when somebody in the focus group said something interesting. The technology was set to clip out a section of video a few seconds before and after the trigger word, so the interviewers could then automatically generate “edited” versions of the interviews, saving a lot of time. I can see this being a great tool for anyone processing ethnographic data or conducting UX or similar testing based on interviews.

Zooming in on the detail

Another feature Theresa demonstrated was a high definition zooming tool, so that you can see very fine detail in your digital images – lovely for museums and art galleries but costly in terms of storage space and bandwidth. I could see it working well as an in-gallery interactive guide to certain collections. It wouldn’t be so good if you were trying to access it externally from a dodgy wifi or bandwidth-limited connection.

(The British Museum’s Magnificent Maps collection – which I saw on a London IA visit – has an interesting interactive zoom feature that works entirely differently, but was very popular. It worked by using a “magnifying glass” – actually a device with some LED transmitters that send an infrared signal to a webcam to trigger a zoom response through a special display interface.)

Procurement process tips

The other panel members talked through various DAM system procurement processes, from a huge global project for Cambridge University Press that began with a list of 452 vendors, through to a very detailed process for adidas with a smaller initial list but a large number of criteria to be fulfilled. It was pleasing that the panel agreed that cultural fit can be as important as any technical specifications. A state of the art or very large vendor who just doesn’t get your world is very unlikely to provide you with a good solution, but a mid-range vendor who really understands your particular context is much more likely to find or develop something that matches your business processes.

Although the use of personas (popular in the UX world) in procurement is quite unusual, Theresa suggested that user stories could be more effective than requirements spreadsheets. Vendors are likely to tick all the boxes in the spreadsheet without getting to grips with the business processes behind them. It is also hard to explain complex interactions as sets of requirements, but telling a story can make it clear what the system as a whole should provide, e.g. Sue has to research images for marketing campaigns and make sure that editors based in offices around the globe can see them to approve them and designers need to be able to access them remotely and then they need to be output in a variety of formats for publication both in print and online.

It is also worth making sure that any arrangements with outsourced suppliers are checked. Sometimes vendors will provide case studies of a successful implementation but not mention that they have never worked with your supplier before.

I noted the emphasis panellists placed on making sure taxonomies and vocabularies are user-friendly and effective in order to get the best out of any DAM system.

Manage your metadata

Sarah Saunders of Electric Lane discussed the importance of controlled vocabularies and managed metadata for image search and management. Speech-to-text software can’t help with stills collections, or when part of your collection is video without accompanying audio (e.g. a rushes collection – the “spare” footage that wasn’t used in a broadcast and which often has no associated dialogue or voiceover script). She described advances in visual sorting software that use a combination of textual metadata and content-based image retrieval (CBIR) to refine search results. Although CBIR is still in its infancy, when running over a small image set pre-selected by text searching it can be very helpful. CBIR can identify basic features like the colour that is used the most in an image, not much help if you run it over a large image collection with no other metadata (i.e. “give me all the mainly red pictures” will bring up images of everything from fire engines to strawberries – fun if all you want is inspiration, not so good if you have something more specific in mind). However, if you have a set of images of the Eiffel Tower for example, it could distinguish between close-ups and shots with lots of blue sky. If you like the blue sky ones, you can click on one and ask for “more like this” and be offered other mainly blue sky ones.

The second panel will be the subject of my next post.

One response so far

May 02 2010

The power of parametadata

First we had content, then not long after that we had metadata, although no-one called it that. Now we need parametadata – the metadata about metadata!

Neither metadata nor parametadata are anything new, but what is new is how central they have become to all sorts of business processes. People think there is something modern and techie about metadata, but ever since the first author signed their initials on a piece of work, or added a title, we have had metadata. Librarians are just one group who have been using metadata for centuries.

Thanks to technological advances, there is now a huge amount of processing that can be done with metadata, indeed that needs to be done if we are to have any idea what assets we have available. Metadata has become the active driver of numerous business processes. You couldn’t operate a computer without the metadata that tells you the name of a file, its location, when it was last saved, etc. and this sort of metadata is so ubiquitous that nobody tends to think about it too much. Now metadata is so pervasive, it is becoming increasingly important to talk about it and define different aspects and types.

One key distinction is the one between objective and subjective metadata. Subjective metadata refers to classification, tagging, taxonomies, etc. This metadata is subjective because it is always possible to argue about it. Objective metadata on the other hand is uncontroversial and typically process-driven – a file format is what it is, the time the file was last saved might cause consternation after a PC crash, but is unarguable. However, there is actually surprisingly little uncontroversial metadata. Even something like a title can be edited and changed – what do you do when some content acquires a popular or folk title that is not the same as its official title? This happens a lot with comedy sketches and songs, but can also happen to names of projects, working groups, etc.

Parametadata (or meta-metadata) is another subset of metadata – it is the metadata about the metadata, giving its provenance, date of creation, technical specifications, etc. Once you start to think about metadata as content in its own right, it becomes obvious that just as you wish to track the author, title, and so on of the core content, so too you need to track the author(s), provenance, date of creation and latest update of the metadata as well. For subjective metadata, parametadata becomes hugely useful. Because you can have multiple classifications of an asset, it is very important to track the source – distinguishing between author added keywords, indexer keywords, and folksonomic tags, for example – so that people can tell where a tag has come from.

As long as you know where tags have come from, you can decide whether or not you want to trust in their authority. In an increasingly muddled web, it is helpful to be told the source of a comment or an opinion in order to try to distinguish sound information from propaganda or uninformed speculation. Anecdotally, many people who were initially excited about citizen review sites – rating hotels, etc. – have now given up on them on the grounds that the people who contribute to them tend to have some kind of axe – or worse – to grind, so you can’t take them seriously. Even reviews that aim to be fair may not be relevant if the reviewer is too dissimilar to the reader. The perfect holiday for a group of teenagers is unlikely to be what a retired couple are looking for. So any review needs to carry sufficient information so that the reader can work out how relevant the content is to them. A good review site would carry a range of reviews aimed at different audiences.

Similarly, a rich navigation system needs to offer a range of tags and taxonomies, but these will only be useful when there is sufficient parametadata to tell the user where each scheme or tag came from, who created it, how up to date it is, etc. From a user perspective, being able to choose from a range of well-documented navigation systems means they can make an informed choice about whether to have fun with the randomness of folksonomic tags, to follow a specialist taxonomy in order to learn how a subject is handled by experts, or to use a guide constructed by the content curators for a general audience.

Interface designers can use the parametadata to make different sources of metadata distinct – with different visual or other cues, for example, to indicate different navigation environments. This means you can create a range of different “navigation worlds” and let your users wander to and fro while always making sure they know where – in terms of trust and authority – they are.

9 responses so far

Dec 30 2009

FUMSI Folio on Taxonomies and Tagging

Published by Fran under KO

FUMSI Shop - FUMSI Folio on Taxonomies and Tagging looks like a useful resource (well, I’m biased - it has one of my articles in it!).

No responses yet

Nov 16 2009

Many to many

Published by Fran under KO

A wise taxonomist once said to me “taxonomies are technology agnostic” and I’ve been thinking about why systems are not taxonomy agnostic. If you underpin a taxonomy with a thesaurus, can you use that to map one taxonomy to another, without altering either taxonomy? You can keep both taxonomies as metadata attached to your asset and expose one or the other depending on user choice. It’s just an interface issue. The mapping would enable cross navigation, so you could wander down one taxonomy, skip to another, then pop back to the first one if you wanted.

You could attach folksonomies too if you wanted to, and just store those as extra metadata.

I can see that there might be terminology issues that need resolving (no small task), or perhaps software or storage issues, but I can’t see why the system itself couldn’t work in theory.

I’ve spent a lot of time thinking about mediating stakeholder needs to get the best taxonomy, and that is still a valid approach when you need management and control, but I don’t see any reason not to attach other taxonomies to your core taxonomy. Those satellite taxonomies can then serve minority interests or specialised needs. As long as you collect metadata about your taxonomies and make it clear to your user the provenance of the taxonomy or folksonomy they are viewing, you can offer a range of viewpoints.

Perhaps I am missing something obvious, but it seems there is still debate about getting the best taxonomy, or choosing to implement one instead of another. That debate seems to be based on the presumption that you can only have one taxonomy at a time, but why not have lots?

2 responses so far

Oct 09 2009

Still trying to please everyone

Published by Fran under KO

I was very flattered to be mentioned by Bob Bater in this KOnnect post: Trying to please everyone. I wanted to spend my research time on something that would be of practical interest to taxonomy professionals, while avoiding the danger of becoming too philosophical. As Bob has such extensive experience in taxonomy work, I am delighted that he found my project interesting.

No responses yet

Next »