Jul 03 2010

Procuring a Digital Asset Management system

Published by Fran under Digital Asset Management

This is the first of a series of summaries of the Henry Stewart DAM London conference on June 30, chaired by David Lipsey. The panels (one of which included me) were a pleasing mix of very practical information and more theoretical discussion.

Classic DAM vendor “overstatements”

Theresa Regli, who does a great job as a “professional sceptic” stressed the need for a calm and considered approach to procurement with the most important stage being the testing stage. You wouldn’t buy a car without taking it for a test drive, but people buy software without finding out if it can handle their content. Nobody’s assets and business processes are exactly the same, and just because a system suited somebody else perfectly doesn’t mean it is right for you. Vendors will say that they can do anything, but that’s their job so don’t take their word for it. Don’t be distracted by the coolest of the cool new features or other bells and whistles. Cool costs - but may not make - money for your business. On the one hand, if the cool features don’t actually improve your specific business processes, they won’t benefit you, and on the other, vendors have become increasingly adept at marketing the same old features in new ways, so it is very important to dig beneath the surface to find out how they are doing what they claim. Surprisingly little has changed technologically in the DAM vendor landscape over the last five years. So, a wonderful new system for automatically indexing images directly may in fact just be the familiar territory of analysing textual metadata associated with images.

Speech to text

One area that has moved on is the technology to convert speech to text. This means that you can, to an extent, subtitle a film automatically (which isn’t quite the same as a system that can “watch a movie and understand what’s going on scene by scene”). This then gives you a chunk of textual metadata you can search and analyse (“understanding” what’s going on relies on sentiment analysis – looking up words in thesauruses, so, for example, if the dialogue mentions guns, shooting, and bullets a lot, the software could suggest it is a gunfight scene). However, accuracy rates are patchy and the systems require training, which could be labour intensive, so you need to make sure those training costs and the time required are included in budgets and schedules. The systems work best if you can get everything read by someone like Patrick Stewart, as he has very clear and even enunciation. Anyone with an unusual accent or who mumbles is far more difficult to process. As usual, the software is easiest to train if you are working within a specific context, so you can focus on relevant words and accents, rather than anything anyone anywhere in the world might happen to say.

A clever use of the technology is by the car industry to save time analysing focus group interviews. They asked interviewers to “audio index” their interviews by saying a key “trigger” word when somebody in the focus group said something interesting. The technology was set to clip out a section of video a few seconds before and after the trigger word, so the interviewers could then automatically generate “edited” versions of the interviews, saving a lot of time. I can see this being a great tool for anyone processing ethnographic data or conducting UX or similar testing based on interviews.

Zooming in on the detail

Another feature Theresa demonstrated was a high definition zooming tool, so that you can see very fine detail in your digital images – lovely for museums and art galleries but costly in terms of storage space and bandwidth. I could see it working well as an in-gallery interactive guide to certain collections. It wouldn’t be so good if you were trying to access it externally from a dodgy wifi or bandwidth-limited connection.

(The British Museum’s Magnificent Maps collection – which I saw on a London IA visit – has an interesting interactive zoom feature that works entirely differently, but was very popular. It worked by using a “magnifying glass” – actually a device with some LED transmitters that send an infrared signal to a webcam to trigger a zoom response through a special display interface.)

Procurement process tips

The other panel members talked through various DAM system procurement processes, from a huge global project for Cambridge University Press that began with a list of 452 vendors, through to a very detailed process for adidas with a smaller initial list but a large number of criteria to be fulfilled. It was pleasing that the panel agreed that cultural fit can be as important as any technical specifications. A state of the art or very large vendor who just doesn’t get your world is very unlikely to provide you with a good solution, but a mid-range vendor who really understands your particular context is much more likely to find or develop something that matches your business processes.

Although the use of personas (popular in the UX world) in procurement is quite unusual, Theresa suggested that user stories could be more effective than requirements spreadsheets. Vendors are likely to tick all the boxes in the spreadsheet without getting to grips with the business processes behind them. It is also hard to explain complex interactions as sets of requirements, but telling a story can make it clear what the system as a whole should provide, e.g. Sue has to research images for marketing campaigns and make sure that editors based in offices around the globe can see them to approve them and designers need to be able to access them remotely and then they need to be output in a variety of formats for publication both in print and online.

It is also worth making sure that any arrangements with outsourced suppliers are checked. Sometimes vendors will provide case studies of a successful implementation but not mention that they have never worked with your supplier before.

I noted the emphasis panellists placed on making sure taxonomies and vocabularies are user-friendly and effective in order to get the best out of any DAM system.

Manage your metadata

Sarah Saunders of Electric Lane discussed the importance of controlled vocabularies and managed metadata for image search and management. Speech-to-text software can’t help with stills collections, or when part of your collection is video without accompanying audio (e.g. a rushes collection – the “spare” footage that wasn’t used in a broadcast and which often has no associated dialogue or voiceover script). She described advances in visual sorting software that use a combination of textual metadata and content-based image retrieval (CBIR) to refine search results. Although CBIR is still in its infancy, when running over a small image set pre-selected by text searching it can be very helpful. CBIR can identify basic features like the colour that is used the most in an image, not much help if you run it over a large image collection with no other metadata (i.e. “give me all the mainly red pictures” will bring up images of everything from fire engines to strawberries – fun if all you want is inspiration, not so good if you have something more specific in mind). However, if you have a set of images of the Eiffel Tower for example, it could distinguish between close-ups and shots with lots of blue sky. If you like the blue sky ones, you can click on one and ask for “more like this” and be offered other mainly blue sky ones.

The second panel will be the subject of my next post.

One response so far

Jun 15 2010

Are you a semantic romantic?

Published by Fran under semantic web

The “semantic web” is an expression that has been used for long enough now that I for one feel I ought to know what it means, but it is hard to know where to start when so much about it is presented in “techspeak”. I am trying to understand it all in my own non-technical terms, so this post is aimed at “semantic wannabes” rather than “semantic aficionados”. It suggests some ways of starting to think about the semantic web and linked open data without worrying about the technicalities.

At a very basic level, the semantic web is something that information professionals have been doing for years. We know about using common formats so that information can be exchanged electronically, from SGML, HTML, and then XML. In the 90s, publishers used “field codes” to identify subject areas so that articles could be held in databases and re-used in multiple publications. In the library world, metadata standards like MARC and Dublin Core were devised to make it easier to share cataloguing data. The semantic web essentially just extends these principles.

So, why all the hype?

There is money to be made and lost on semantic web projects, and investors always want to try to predict the future so they can back winning horses. The recent Pew Report (thanks to Brendan for the link) shows the huge variety of opinions about what the semantic web will become.

On the one extreme, the semantic evangelists are hoping that we can create a highly sophisticated system that can make sense of our content by itself, with the familiar arguments that this will free humans from mundane tasks so that we can do more interesting things, be better informed and connected, and build a better and more intelligent world. They describe systems that “know” that when you book a holiday you need to get from your house to the airport, that you must remember to reschedule an appointment you made for that week, and that you need to send off your passport tomorrow to renew it in time. This is helpful and can seem spookily clever, but is no more mysterious than making sure my holiday booking system is connected to my diary. There are all sorts of commercial applications of such “convenience data management” and lots of ethical implications about privacy and data security too, but we have had these debates many times in the past.

A more business-focused example might be that a search engine will “realise” that when you search for “orange” you mean the mobile phone company, because it “knows” you are a market analyst working in telecoms. It will then work out that documents that contain the words “orange” and “fruit” are unlikely to be what you are after, and so won’t return them in search results. You will also be able to construct more complex questions, for example to query databases containing information on tantalum deposits and compare them with information about civil conflicts, to advise you on whether the price of mobile phone manufacture is likely to increase over the next five years.

Again, this sort of thing can sound almost magical, but is basically just compiling and comparing data from different data sets. This is familiar ground. The key difference is that for semantically tagged datasets much of the processing can be automated, so data crunching exercises that were simply too time-consuming to be worthwhile in the past become possible. The evangelists can make the semantic web project sound overwhelmingly revolutionary and utopian, especially when people start talking in sci-fi sounding phrases like “extended cognition” and “distributed intelligence”, but essentially this is the familiar territory of structuring content, adding metadata, and connecting databases. We have made the cost-benefit arguments for good quality metadata and efficient metadata management many times.

On the other extreme, the semantic web detractors claim that there is no point bothering with standardised metadata, because it is too difficult politically and practically to get people to co-operate and use common standards. In terms familiar to information professionals, you can’t get enough people to add enough good quality metadata to make the system work. Clay Shirky in “Ontology is overrated” argued that there is no point in trying to get commonalty up front, it is just too expensive (there are no “tag police” to tidy up), you just have to let people tag randomly and then try to work out what they meant afterwards. This is a great way of harvesting cheap metadata, but doesn’t help if you need to be sure that you are getting a sensible answer to a question. It only takes one person to have mistagged something, and your dataset is polluted and your complex query will generate false results. Shirky himself declares that he is talking about the web as a whole, which is fun to think about, but how many of us (apart from Google) are actually engaged in trying to sort out the entire web? Most of us just want to sort out our own little corner.

I expect the semantic web to follow all other standardisation projects. There will always be a huge “non-semantic” web that will contain vast quantities of potentially useful information that can’t be accessed by semantic web systems, but that is no different from the situation today where there are huge amounts of content that can’t be found by search engines (the “invisible web” or “dark web”) – from proprietary databases to personal collections in unusual formats. No system has been able to include everything. No archive contains every jotting scrawled on a serviette, no bookshop stocks every photocopied fanzine, no telephone directory lists every phone number in existence. However, they contain enough to be useful for most people most of the time. No standard provides a perfect universal lingua franca, but common languages increase the number of people you can talk to easily. The adoption of XML is not universal, but for everyone who has “opted in” there are commercial benefits. Not everybody uses pdf files, but for many people they have saved hours of time previously spent converting and re-styling documents.

So, should I join in?

What you really need to ask is not “What is the future of the semantic web?” but “Is it worth my while joining in right now?”. How to answer that question depends on your particular context and circumstances. It is much easier to try to think about a project, product, or set of services that is relevant to you than to worry about what everyone else is doing. If you can build a product quickly and cheaply using what is available now, it doesn’t really matter whether the semantic web succeeds in its current form or gets superseded by something else later.

I have made a start by asking myself very basic questions like:

  • What sort of content/data do we have?
  • How much is there?
  • What format is it in at the moment?
  • What proportion of that would we like to share (is it all public domain, do we have some that is commercially sensitive, but some that isn’t, are there data protection or rights restrictions)?

If you have a lot of data in well-structured and open formats (e.g. XML), there is a good chance it will be fairly straightforward to link your own data sets to each other, and link your data to external data. If there are commercial and legal reasons why the data can’t be made public, it may still be worth using semantic web principles, but you might be limited to working with a small data set of your own that you can keep within a “walled garden” – whether or not this is a good idea is another story for another post.

A more creative approach is to ask questions like:

  • What content/data services are we seeking to provide?
  • Who are our key customers/consumers/clients and what could we offer them that we don’t offer now?
  • What new products or services would they like to see?
  • What other sources of information do they access (users usually have good suggestions for connections that wouldn’t occur to us)?

Some more concrete questions would be ones like:

  • What information could be presented on a map?
  • How can marketing data be connected to web usage statistics?
  • Where could we usefully add legacy content to new webpages?

It is also worth investigating what others are already providing:

  • What content/data out there is accessible? (e.g. recently released UK government data)
  • Could any of it work with our content/data?
  • Whose data would it be really interesting to have access to?
  • Who are we already working with who might be willing to share data (even if we aren’t sure yet what sort of joint products/projects we could devise)?

It’s not as scary as it seems

Don’t be put off by talk about RDF, OWL, and SPARQL, how to construct an ontology, and whether or not you need a triple store. The first questions to ask are familiar ones like who you would like to work with, what could you create if you could get your hands on their content, and what new creations might arise if you let them share yours? Once you can see the semantic web in terms of specific projects that make sense for your organisation, you can call on the technical teams to work out the details. What I have found is that the technical teams are desperate to get their hands on high quality structured content – our content – and are more than happy to sort out the practicalities. As content creators and custodians, we are the ones that understand our content and how it works, so we are the ones who ought to be seizing the initiative and starting to be imaginative about what we can create if we link our data.

A bit of further reading:
Linked Data.org
Linked Data is Blooming: Why You Should Care
What can Data.gov.uk do for me?

No responses yet

May 02 2010

The power of parametadata

First we had content, then not long after that we had metadata, although no-one called it that. Now we need parametadata – the metadata about metadata!

Neither metadata nor parametadata are anything new, but what is new is how central they have become to all sorts of business processes. People think there is something modern and techie about metadata, but ever since the first author signed their initials on a piece of work, or added a title, we have had metadata. Librarians are just one group who have been using metadata for centuries.

Thanks to technological advances, there is now a huge amount of processing that can be done with metadata, indeed that needs to be done if we are to have any idea what assets we have available. Metadata has become the active driver of numerous business processes. You couldn’t operate a computer without the metadata that tells you the name of a file, its location, when it was last saved, etc. and this sort of metadata is so ubiquitous that nobody tends to think about it too much. Now metadata is so pervasive, it is becoming increasingly important to talk about it and define different aspects and types.

One key distinction is the one between objective and subjective metadata. Subjective metadata refers to classification, tagging, taxonomies, etc. This metadata is subjective because it is always possible to argue about it. Objective metadata on the other hand is uncontroversial and typically process-driven – a file format is what it is, the time the file was last saved might cause consternation after a PC crash, but is unarguable. However, there is actually surprisingly little uncontroversial metadata. Even something like a title can be edited and changed – what do you do when some content acquires a popular or folk title that is not the same as its official title? This happens a lot with comedy sketches and songs, but can also happen to names of projects, working groups, etc.

Parametadata (or meta-metadata) is another subset of metadata – it is the metadata about the metadata, giving its provenance, date of creation, technical specifications, etc. Once you start to think about metadata as content in its own right, it becomes obvious that just as you wish to track the author, title, and so on of the core content, so too you need to track the author(s), provenance, date of creation and latest update of the metadata as well. For subjective metadata, parametadata becomes hugely useful. Because you can have multiple classifications of an asset, it is very important to track the source – distinguishing between author added keywords, indexer keywords, and folksonomic tags, for example – so that people can tell where a tag has come from.

As long as you know where tags have come from, you can decide whether or not you want to trust in their authority. In an increasingly muddled web, it is helpful to be told the source of a comment or an opinion in order to try to distinguish sound information from propaganda or uninformed speculation. Anecdotally, many people who were initially excited about citizen review sites – rating hotels, etc. – have now given up on them on the grounds that the people who contribute to them tend to have some kind of axe – or worse – to grind, so you can’t take them seriously. Even reviews that aim to be fair may not be relevant if the reviewer is too dissimilar to the reader. The perfect holiday for a group of teenagers is unlikely to be what a retired couple are looking for. So any review needs to carry sufficient information so that the reader can work out how relevant the content is to them. A good review site would carry a range of reviews aimed at different audiences.

Similarly, a rich navigation system needs to offer a range of tags and taxonomies, but these will only be useful when there is sufficient parametadata to tell the user where each scheme or tag came from, who created it, how up to date it is, etc. From a user perspective, being able to choose from a range of well-documented navigation systems means they can make an informed choice about whether to have fun with the randomness of folksonomic tags, to follow a specialist taxonomy in order to learn how a subject is handled by experts, or to use a guide constructed by the content curators for a general audience.

Interface designers can use the parametadata to make different sources of metadata distinct – with different visual or other cues, for example, to indicate different navigation environments. This means you can create a range of different “navigation worlds” and let your users wander to and fro while always making sure they know where – in terms of trust and authority – they are.

9 responses so far

Apr 27 2010

Web Science 2010

Published by Fran under semantic web

There have been lots of interesting presentations at Web Science 2010 in Raleigh. My metadata meerkats were popular - hard to beat charismatic megafauna. The papers and posters are online at The Journal of Web Science.

No responses yet

Apr 04 2010

Using taxonomies to support ontologies

Published by Fran under KO, information architecture

What is an ontology?
Ontologies are emerging from the techie background into the knowledge organisation foreground and - as usually happens - being touted as the new panacea to solve all problems from content management to curing headaches. As with any tool, there are circumstances where they work brilliantly and some where they aren’t right for the job.

Basically, an ontology is a knowledge model (like a taxonomy or a flow chart) that describes relationships between things. The main difference between ontologies and taxonomies is that taxonomies are restricted to broader and narrower relationships whereas ontologies can hold any kind of relationship you give them.

One way of thinking about this is to see taxonomies as vertical navigation and ontologies as horizontal. In practice, they usually work together. When you add cross references to a taxonomy, you are adding horizontal pathways and effectively specifying ontological rather than taxonomical relationships.

The flexibility in the type of relationship that can be defined is what gives ontologies their strength, but is also their weakness in that they are difficult to build well and can be time consuming to manage because there are infinite relationships you could specify and if you are not careful, you will specify ones that keep changing. Ontologies can answer far more questions than taxonomies, but if the questions you wish to ask can be answered by a taxonomy, you may find a taxonomy simpler and easier to handle.

What are the differences between taxonomies and ontologies?
A good rule of thumb is to think of taxonomies as being about narrowing down, refining, and zooming in on precise pieces of information and ontologies as being about broadening out, aggregating, and linking information. So, a typical combination of ontologies and taxonomies would be to use ontologies to aggregate content and with taxonomies overlaid to help people drill down through the mass of content you have pulled together.

Ontologies can also be used as links to join taxonomies together. So, if you have a taxonomy of regions, towns, and villages and a taxonomy of birds and their habitats you could use an ontological relationship of “lives in” to show which birds live in which places. By using a taxonomy to support the ontology, you don’t have to define a relationship between every village and the birds that live there, you can link the birds’ habitats to regions via the ontology and the taxonomy will do the work of including all the relevant villages under that region.

Programmers love ontologies, because they can envisage a world where all sorts of relationships between pieces of content can be described and these diverse relationships can be used to produce lots of interesting collections of content that can’t easily be brought together otherwise. However, they leave it to other people to provide the content and metadata. Specifying all those relationships can be complicated and time-consuming so it is important to work out in advance what you want to link up and why. A good place to start is to choose a focal point of the network of relationships you need. For example, there are numerous ways you could gather content about films. You could focus on the actors so you can bring together the films they have appeared in to create content collections describing their careers, or focus on genres and release dates to create histories of stylistic developments, or you could link films that are adaptations of books to copies of those books. The choices you make determine the metadata you will need.

Know your metadata
At the moment, in practice, ontologies are typically built to string together pre-existing metadata that has been collected for navigational or archival taxonomies, but this is just because that metadata already exists to be harvested. There is a danger in this approach that you end up making connections just because you can, not because they are useful to anybody. As with all metadata-based automated systems, you also need to be careful with the “garbage in garbage out” problem. If the metadata you are harvesting was created for a different purpose, you need to make sure that you do not build false assumptions about its meaning or quality into your ontology - for example, if genre metadata has been created according to the department the commissioning editor worked for, instead of describing the content of the actual programme itself. That may not have been a problem when the genre metadata was used only by audience research to gather ratings information, but does not translate properly when you want to use it in an ontology for content-defining purposes.

Feeding your ontology with accurate and clearly defined taxonomies is likely to give you better results than using whatever metadata just happens to be lying about. Well-defined sets of provenance metadata – parametadata – about your taxonomies and ontologies is becoming more and more valuable so that you can understand what metadata sets were built for, when they were last updated, and who manages them.

Why choose one when you can have both?
Ontologies are very powerful. They perform different work to taxonomies, but ontologies and taxonomies can support and enhance each other. Don’t throw away your taxonomies just because you are moving into the ontology space. Ontologies can be (they aren’t always - see Steve’s comment below) big, tricky, and complicated, so use your taxonomies to support them.

8 responses so far

Mar 14 2010

Taxonomy as an application for an open world

Published by Fran under KO, information architecture

This post is based on the notes I made for the talk I gave at the LIKE dinner on February 25th. It covers a lot of themes I have discussed elsewhere on this blog, but I hope it will be useful as an overview.

Taxonomies have been around for ages
Pretty much the oldest form of recorded human writing is the list, back in ancient Sumeria, the Sumerian King list for example is about 4,000 years old. By the time of the ancient Greeks, taxonomies were familiar. We understand that something is a part of something else, and the notion of zooming in or narrowing down on the information we want is instinctive.
I am frequently frustrated by the limitations of free text search (see my earlier post Google is not perfect). The main limitation is to knowledge discovery - you can’t browse sensibly around a topic area and get any sense of overview of the field. Following link trails can be fun, but they leave out the obscure but important, the non-commercial, the unexpected.

The very brilliant Google staff are working on refining their algorithms all the time, but Google is a big commercial organisation and they are going to follow the money, which isn’t always where we need to be going. Other free text search issues include disambiguation/misspellings – so you need hefty synonym control, “aboutness” – you can’t find something with free text search if it doesn’t mention the word you’ve searched for, and audio-visual retrieval. The killer for heritage archives (and for highly regulated companies like pharmaceutical and law firms) is comprehensiveness – we don’t just want something on the subject, we want to know that we have retrieved everything on a particular subject.

Another myth is that search engines don’t use classification – they do, they use all sorts of classifications, it’s just that you don’t tend to notice them, partly because they are constantly being updated in response to user behaviour, giving the illusion that they don’t really exist. What is Google doing when it serves you up its best guesses, if not classifying the possible search results and serving you the categories it calculates are closest to what you want?

I’m a big fan of Google, it’s a true modern cathedral of intellectual power and I use Google all the time, but I seem to be unusual in that I don’t expect it to solve all my problems.
I also am aware of the fact that we can’t get to look at Google’s taxonomic processes arguably makes Google more political, more manipulable, and more big brother-ish than traditional open library classifications. We may not totally agree with the library classifications nor the viewpoints of their creators, but at least we know what those viewpoints are!

There was a lot of fuss about the rise of folksonomies and free tagging as being able to supersede traditional information management – and in an information overloaded world we need all the help we can get – the trouble is that folksonomies expand, coalesce, and collapse into taxonomies in the end. If they are to be effective – rather than just cheap – they need to do this – and either become self-policing or very frustrating. They are a great way of gathering information, but then you need to do something with it.

Folksonomies, just as much as taxonomies, represent a process of understanding what everyone else is talking about and negotiating some common ground. It may not be easy, but it is a necessary and indispensable part of human communication - not something we can simply outsource or computerise – algorithms just won’t do that for us. Once everything has been tagged with every term associated with every viewpoint, nothing might as well have been tagged at all. Folksonomies, just as much as taxonomies, collapse into giving a single viewpoint – it’s just that it is a viewpoint that is some obscure algorithmic calculation of popularity.

So, despite free text search and folksonomies, structured classification remains a very powerful and necessary part of your information strategy.

It’s an open world
Any information system - whatever retrieval methods it offers - has to meet the needs of its users. Current users can be approached, surveyed, talked to, but how do you meet the needs of future users? The business environment is not a closed, knowable constrained domain, but is an “open world”1 where change is the only certainty. (Open world is an expression from logic. It presumes that you can never have complete knowledge of truth or falsity. It is the opposite of the closed world, which works for constrained domains or tasks where rules can be applied - e.g. rules within a database).

So, how do you find the balance between stability, so your knowledge workers can learn and build on experience over time, while being able to react rapidly to changes?

Once upon a time, not much happened
The early library scientists such as Cutter, Kelley, Ranganathan, and Bliss, argued about which classification methods were the best, but they essentially presumed that it was possible to devise a system that maximised “user friendliness” and that once established, it would remain usable well into the future. By and large, that turned out to be the case, as it took many years for their assumptions about users to be seriously challenged.

Physical constraints tended to dictate the amount of updating that a system could handle. The time and labour required to re-mark books and update a card catalogue meant that it was worth making a huge effort to simply select or devise a classification and stick to it. It was easier to train staff to cope with the clunky technology of the time than adapt the technology to suit users. No doubt in the future, people will say exactly the same things about the clunky Internet and how awful it must have been to have to use keyboards to enter information.

So, it was sensible to plan your information project as one big chunk of upfront effort that would then be left largely alone. It is much easier to build systems based on the assumption that you can know everything in advance - you can have a simple linear project plan and fixed costs. However, it is very rare for this assumption to hold for very long, and the bigger the project, the messier it all gets.

Change now, change more
Everything is changing far more rapidly than it used to – from the development of new technologies to the rapid spread of ideas promoted by the emergence of social media and an “always on” culture. It’s harder than ever to stay cutting edge!

We all like to speak our own language and use our own names for things, and specialists and niche workers as well as fashionistas and trendsetters expect to be able to describe and discuss information in ways that make sense to them. The open philosophy of the Web 2.0 world means that they increasingly take this to be their right, but this is where folksonomic approaches can really help us.

What you need to do is to create a system that can include different pace layers so that you get the benefits of a stable taxonomy, with the rapid reactiveness of folksonomy as well as quick and easy free text search. You can think of your taxonomy as the centre of a coral reef, but coral is alive and grows following the currents and the behaviour of all the crazy fish and other organisms that dart about around it. It’s hard to pin down the crazy fish and other creatures, but they feed the central coral and keep it strong. In practice, this means incorporating multiple taxonomies and folksonomies and mapping them to one another, so that everyone can use the taxonomy and the terminology that they prefer. Taxonomy mapping tools require human training and human supervision, but they can lighten the load of the labour intensive process of mapping one taxonomy to another.

This means that taxonomy strategy does not have to be determined at a fixed point, but taxonomy creation is dynamic and organic. Folksonomies and new taxonomies can be harvested to feed back updates into the central taxonomy, breaking the traditional cycle of expensive major revision, gradual decline until the point of collapse, followed by subsequent expensive major revision…

There is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all? Any arguments then become part of the mapping process – which is a bit of a fudge, but within organisations has the major benefit of removing a lot of politicking that surrounds information and knowledge management. It all becomes “something technical” to do with mapping that nobody other than information professionals is very interested in. Despite this, there is huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

Modern users demand that content is presented to them in a way that they feel comfortable with. The average search is a couple of words typed into Google, but they are willing to browse if they feel that they are closing in on what they want. To increase openness and usage means providing rich search and navigation experiences in a user-friendly way. If your repository is to be promoted to a wider audience future, the classification that will enable the creation of a rich navigation experience needs to be put in place now.

Your users should be able to wander about through the archive collections horizontally and vertically and to leave and delve into other collections, or to arrive at and move through the archive using their own organisation’s taxonomy and to tag where they want to tag, using whatever terms they like. The link points in the mappings provide crossroads in the navigation system for the users.

In this way the taxonomies are leveraged to become “hypertextual taxonomies” that provide rich links both horizontally and vertically.

Taxonomy as a spine
A core taxonomy that acts as an indexing language is the central spine to which other taxonomies can be attached and crucially - detached - as necessary. The automation of the bulk of the mapping process means that incorporating a new taxonomic view
becomes a task of checking the machine output for errors. Automated mapping processes can provide statistical calculations of likelihood of accuracy and so humans only need to examine those with a low likelihood of being correct.

Mapping software has the same problems as autoclassification software, so a mapping methodology, including workflow and approval processes, has to be defined and supported. The more important it is to get a fine-grained mapping, the more effort you will need to make, but a broad level mapping is easier to achieve.

Conclusion
If you start thinking of the taxonomy as an organic system in its own right – more like an open application that you can interact with – bolting on and removing elements as you choose, you do not need to attempt to account for every user viewpoint in the creation of the taxonomy, and that omission of a viewpoint at one stage does not preclude that collection from being incorporated later. Conversely, the mapping process allows “outsiders” to view your assets through their own taxonomies.

Our taxonomies represent huge edifices of intellectual effort. However, we can’t preserve them in aspic – hide them away as locked silos or like grand stately homes that won’t open their doors to the public. If we want them to thrive and grow we need to open them up to the light to let them expand, change and interact with other taxonomies and take in ideas from the outside.

Once you open up your taxonomy, share it and map it to other taxonomies, it becomes stronger. Rather than an isolated knowledge system that seems like a drain on resources, it becomes an embedded part of the information infrastructure, powering interactions between multiple systems. It ceases to be a part of document management, and becomes the way that the organisation interacts with knowledge globally. This means that the taxonomy gains strength from its associations but also gains prestige.
So our taxonomies can remain our friends for a little while longer. We won’t be hand cataloguing as we did in the past because all the wonders of the Google and automated world can be harnessed to help us.

6 responses so far

Feb 15 2010

More on mapping

Published by Fran under KO

When trying to integrate diverse vocabularies and repositories, the way to go is mapping – metadata crosswalks as they are known in the US. I’ve been looking for software that can handle mappings between taxonomies, of which there are a range on the market, but what is really exciting is the development of automated mapping tools to take much of the “heavy lifting” out of the work (for example Synaptica’s AutoMatch).

It seems to me that there is a convergence between semi-automated mapping (we’ll be needing human editorial oversight for some time) and the semantic web project. A combination of auto-mapping and RDF/OWL/SKOS should enable us to cross-navigate repositories using our own terminologies. This is the realisation of the “many perspectives, one repository” approach that should get round many problems of the subjective/objective divide. If you can’t agree on which viewpoint to adopt, why not have them all and save the arguments for the nuances of the mapping process. Within organisations this has immediate benefits, in removing a lot of politicking that surrounds information and knowledge management. However, there is also huge cultural potential when it comes to opening up public repositories and making them interoperate. The Europeana project is a good example.

No responses yet

Jan 31 2010

Re-intermediating research

Published by Fran under KO, libraries and museums, search

A fine example of how much inspiration you can get from randomly talking to the people who are actually engaging with customers was given to me by our Research Guide last week.

She wants a video-tagging tool that includes chat functionality, some kind of interactive “pointing” facility, and plenty of metadata fields for adding and describing tags. When she is helping a customer to find the perfect bit of footage, she often finds herself in quite detailed discussions trying to explain why she thinks a shot meets their needs or in trying to understand what it is they don’t like about a particular scene. If they could both view the same footage in real time linked by some sort of online meeting functionality, they would be able to show each other what they meant and discuss and explain requirements far more easily and precisely.

This struck me as exactly how we should as information professionals be seizing new technologies to “re-intermediate” ourselves into the search process. Discussing bits of video footage is a particularly rich example, but what if an expert information professional could have a look at your search results and give you guidance via a little instant chat window? You could call up a real person to help you when you needed it without leaving your desk, in just the same way that online tech support chats work (I’ve had mixed experiences with those, but the principle is sound). I’m thinking especially of corporate settings, but wouldn’t it be a fantastic service for public libraries to offer?

It seems such a good idea I can’t believe it’s not already being done and would be very pleased to hear from anyone out there who is offering those sorts of services and in particular if there are any tools that support real time remote discussion around audio visual research.

No responses yet

Jan 10 2010

‘We Like Lists Because We Don’t Want to Die’

Published by Fran under KO, culture

I heard Umberto Eco lecture on the search for a perfect language about 20 years ago and still find myself referencing him (trying to create a taxonomy that suits everyone would seem to be a similar quest). The lectures were nothing to do with my course really, so I benefited from that serendipitous knowledge discovery that just happens when you have time and space to explore ideas. So I was pleased when a few weeks ago this interview with Eco in der Spiegel happened upon me in the twittersphere (what’s the protocol for referencing tweets?). In the interview, Eco asserts that ‘We Like Lists Because We Don’t Want to Die’ .

It’s arguable that we do most things because we don’t want to die, but I was struck by the depiction of how fundamental the urge to collect and classify is to culture. At the LIKE dinner in early December, Cerys Hearsy said “we like hierarchies. We understand how they work” and she was talking about modern records management. Jan Wyllie in Taxonomies: Frameworks for Corporate Knowledge points out that taxonomies have been used for millennia (something I also reference frequently). Perhaps we like dualities because our brain has two hemispheres and we dream of a taxonomy of everything because then we would have conquered infinity and death itself, but such ideas are way beyond what I can speculate sensibly about. What I can say is that lists and taxonomies have been useful for so long that anyone who bets they are going to vanish anytime soon is facing very long odds. We will create them differently as technology advances, and we will manage without them in many situations where they would be helpful (if New Scientist had a taxonomy, I might have found the article about duality and the brain), but when we really need to be sure, we will create them.

No responses yet

Dec 30 2009

FUMSI Folio on Taxonomies and Tagging

Published by Fran under KO

FUMSI Shop - FUMSI Folio on Taxonomies and Tagging looks like a useful resource (well, I’m biased - it has one of my articles in it!).

No responses yet

Next »