Dec 14 2009
Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.
The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things - just like Google - they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.
Google is excellent when searching text for something specific and known - pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (”The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.
Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation - sorting Spanish holiday pages from Spanish economy pages - has real value in terms of saved time. This is why enterprise search solutions - where employees’ wasted time is an expense to the company - offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.
Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).
It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).
If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.
Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.
Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.
The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.
The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.