Perhaps I am starting to suffer from “deformation professionelle”, but I am constantly surprised by how often I am still asked “Why do we need classification now we have free text search and Google?”. This post is designed to answer the question. If you are an info pro, it won’t tell you anything you don’t already know, but as always I’d appreciate suggestions and additions.
The question seems to me a bit like asking “Why do we need scalpels now we have invented scissors?”. Scissors are a brilliant invention and they do many wonderful things – just like Google – they make all sorts of cutting quick and easy, but there are also many situations when they are not the right tool for the job. I don’t want a surgeon cutting me open with scissors except in a real emergency.
Google is excellent when searching text for something specific and known – pdf of a tube map of London, “Ode to Autumn by John Keats”; documents that contain the phrase “small furry creatures from alpha centauri”. However, you may get poor results if you don’t spell all the words correctly (or they have not been spelled correctly in your source material) or you get the form of the words wrong (“The Tales of the Arabian Nights”; “The Tales of the Arabian Knights”; “1001 Arabian Nights”; “A Thousand and One Arabian Nights”; etc.). So in order to get good results, you already need to know quite a lot about what you are looking for.
Of course most people chuck in the first couple of words that occur to them and hope for the best. This works fine if you have plenty of time to wade through lots of irrelevant results, think up lots of alternative words if the first ones you tried didn’t work, are prepared to chase around to get to where you are trying to go (sometimes misspellings are linked to correct spellings), and are not particularly fussy about the source (if you just want a rough idea of what the main exports of Ecuador are to settle a pub bet, rather than the most up-to-date analysis to help you to decide whether or not to invest a large sum in a trading company). The sheer volume of information in Google means that almost every search throws up far more results than the casual searcher will need. They may not be the best results, but they’ll usually do.
Disambiguation
It gets messier when the words you are searching on refer to a number of different things (do you mean Titanic the ship, the film, the song, etc.; “budget” and “Spain” as in the Spanish economy, not budget holidays in Spain). This sort of search can produce thousands, if not millions of irrelevant results, so classification that can provide disambiguation – sorting Spanish holiday pages from Spanish economy pages – has real value in terms of saved time. This is why enterprise search solutions – where employees’ wasted time is an expense to the company – offer classification as a fundamental aspect of the service. This is why dictionaries and encyclopedias make clear the difference between Mercury the metal, the Roman god, the planet, etc., depression in economics, meteorology, geography, psychiatry, etc., and is why Wikipedia’s disambiguation pages are so useful.
Imperfect prior knowledge
Google is not very helpful when you don’t know the exact title or an exact phrase in a document (was it Birmingham City Council’s guide to recycling, Birmingham Council guide to waste and recycling, West Midlands waste management policy…?) and practically no help at all when you only have circumstantial information relating to a subject area (what’s that story where they are captured by aliens and only get let out when they build a cage and catch a little animal in it to prove they are intelligent too? are their any laws about importing pet parrots from France? what was that sad music I heard on the radio last night?).
It is a laborious process of elimination to try different sets of search terms in Google, but a classification narrows the scope of your search so making it more likely you will find what you need (short stories >science fiction immediately means you are not searching the whole of literature, a set of documents under the heading EU>laws>animals>pets means you don’t have to wade through all EU agricultural law; radio>date of broadcast>soundtracks means you are not trawling through all the recorded music available on the Internet).
If you are researching an unfamiliar topic you probably don’t know the sort of words that are likely to have been used, so classifications are invaluable in showing you what other things are related to that topic, whether or not they use the only words or phrases you have previously encountered. Educational products have always used classification to aid knowledge discovery.
Aboutness
The words contained within the text may not give a full sense of what that text is about. If you are looking for a poem to read at a wedding, the best poems may never use the word “wedding” or “marriage” or even “love”. You’d be more likely to find a suitable poem using a classification poems>weddings. Synonym and thesaurus functions offer associated results as well as direct searching. Ontologies cluster vocabularies and taxonomies to create concept-based classifications.
Free text search on its own cannot provide the richness of suggestions that a classified system can offer. As far as I know, Google relies on source material to provide useful synonyms. (Incidentally I’ve found it remarkably tricky to find good references to how Google works via searching on Google…)
Complex queries
Google is also not helpful at answering complex queries (what is the fourth largest city in the EU by population? how many countries have majority Muslim populations?) that require combinations of sources. This is a gap spotted by “answer engines” such as True Knowledge and Wolfram Alpha, but both their systems depend on highly crafted classifications (taxonomies and ontologies). +Google Squared is Google’s own version.
Comprehensiveness
Google is not a management system. Because of the vagaries described above, you can’t use Google to tell you how many documents you hold about a particular subject, or which document is the most authoritative or up to date, unless you have been very careful to add consistent metadata to each one. Even then, Google might miss the most up-to-date document because its Page Rank is mainly based on popularity, and popularity takes time to cultivate, especially in niche areas. This is why digital asset management systems have metadata functions that provide controlled and filtered searching.
Sound and vision
Google still is a bit patchy in still image, video, and audio search. Technologies are improving all the time, but we still have to be patient. Most still rely on text attached to images or captured from audio tracks, so all the problems already mentioned with free text searching apply. Companies such as imense are using an interesting range of options in generating keywords to tag images, but still use taxonomies for specialist terminology.
Summary
In short, Google is great when you know what you are looking for, when it’s not that important, and when you have plenty of time. In other words, for casual leisure searching. For any search that requires discovery and exploration, certainty, completeness, and precision, and when you want the right results quickly, you need classification.
The future of classification will be one of increasing automation, but that means the indexer or cataloguer’s job becomes more sophisticated and complex. Indexers of the future will be constructing rules for ontology and taxonomy building, training systems for specialised domains, and investigating errors in the automated systems. This may mark a change in the nature of traditional jobs, but it certainly does not mean the end of classification. Taxonomies have been around for millennia, they aren’t likely to disappear overnight.
The very fact that Google engineers are busily working on content analysis, language processing, and other new methods in order to increase the amount of classification Google can apply to its results (e.g. How can we improve our understanding of low level representations of images that goes beyond bag of words modeling?) shows that even the master of the free text search recognises more can be done.
Excellent summary of the strengths and weaknesses of Google and a rousing call to arms to all metadata and information management professionals to rush to the vanguard of the information revolution.
I think you have summarized this situation nicely. As a librarian who is looking for opportunities in digital asset management and metadata design, I’m glad to have come across your post. It has become commonplace to rely on search, but you articulate the shortcomings inherent in that model — structured information makes learning so much easier, since it acts as a road map to a subject area. One of the more valuable lessons I learned while completing my MLS was that in working with an information seeker, they often cannot put their query into meaningful words because they don’t yet know what they are seeking.
Regards,
Daniel
Great summary, but what do you envision the end result being? I think Google is sufficient for the majority of searches performed on it, and people looking for more specialized information will most likely have access to more specialized databases. The average searcher seems to be just barely within the comfort zone on Google, so do you think its possible for them to navigate a more complex classification system? Or will it be more of an Ask-a-Librarian situation, with a librarian behind every searcher?
Thank you for your kind words Kaylin!
I don’t think there is an “end result” or any one-size-fits-all solution, but that we have been encouraged to think there is by those software vendors who are the most anxious to hype their latest product (Google fans included!). Most marketing includes a hefty dose of oversimplifying and overstating. Combine this with companies looking for savings, and you end up with products bought in the belief that they can meet all your search needs, only to discover too late what they cannot do.
I think that the concept of the “average searcher” doesn’t help as it is misleading without putting it in context. If we are talking about the general public performing leisure searches, then Google is fabulous. However, I think there is a danger in underestimating people’s ability to navigate classifications. The main reason mass classification projects fell out of favour on the web was not because people couldn’t navigate them, but because they are labour intensive and hance slow and expensive to create, so web resources were being produced way faster than they could be classified. Most people actually feel comforted by following navigation links when they know they are “zooming in” on what they need. Automated classification is worth pursuing because it can provide at least some rough and ready classification cheaply. It is the cost of classifying well, not the usability of good classifications, that is the real problem. (People may mistake poor usability for poor classification, but that’s an issue for another day…)
However, if our “average searcher” is a knowledge worker, they may have access to specialised databases at the moment, but the threat to them is that their employers will say “oh you only need Google, let’s stop paying for our specialist resources and sack our librarians”. They may realise too late that this is a false economy, but as with all complex systems, it’s much easier to pull them apart than reconstruct them. Proving ROI on specialist research services is notoriously difficult and claims that “information is free” and “people will organise it for free” don’t help.
The term “average searcher” may cover a huge range of different people with different needs. Types of search differ, contexts differ, risks differ, and so what is worth paying for will differ. If your searchers are dealing with a high-cost, high-risk area such as pharmaceutical law, perhaps you do need enough of a library service to support almost every search. If your company sells a small range of novelty postcards, a handful of simple categories for your folder structure or website will probably suffice. In most cases, you will probably need a selection of tools and systems.
I believe that we need to increase information literacy throughout society generally, which means increasing public understanding of the complexities of search. It should be taught in schools as a fundamental part of education and as information professionals we should be helping people to understand that Google won’t solve everything.
No-one except the most cynical of bankers would argue that kids shouldn’t be taught about debt and interest, mortgages, and taxation because the “average person” finds these topics hard to understand. In whose interests is it to limit people’s skills and knowledge? It’s precisely because the digital world is getting more complex that we need more education, more discussion, and more thinking about search and retrieval than ever before.