Sep 07 2008
Theme: Culture and Identity in Knowledge Organisation
Pre-conference workshop, August 5th:
Everything need not be miscellaneous: controlled vocabularies and classification in a web world
The pre-conference workshop was on SKOS – a “hot topic” for 2008.
After a welcome address by Clément Arsenault, Joan Mitchell, Editor-in-Chief of the Dewy Decimal Classification (DDC), gave the opening addressing, explaining that even in a web world, DDC remains a valuable tool for KO, because of the large body of existing categorised content, existing mappings giving it interoperability with other systems, its pre-built relationships and well-defined structures, all curated and responsive to user feedback. She attempted to dispel some common misconceptions – firstly, the notion that you should choose one system and try to use that for all collections and circumstances. In fact, no one tool will be the best tool in every context, so interoperability in all tools is important. Secondly, the main use of DDC as a physical locator sells short the richness of the relationships that can be constructed within it, and thirdly that it is “stuck in the past”, when in fact it is constantly being updated, monitored, and modified, especially through associations with other KO systems.
Although librarians debate the relative merits of DDC, UDC, and faceted classifications such as Bliss, I was struck by the sense that the custodians of the large schemes like DDC, feel “under attack”. I hope that the address was “preaching to the choir” with the aim of helping librarians argue their corner against outsiders, rather than responding to a loss of faith and sense of history within the community itself.
Marcia Lee Zeng (professor, school of library and information science at Kent State Univeristy, USA) opened the workshop proper by discussing the problems encountered in attempting to transfer the Chinese Classified Thesaurus (CCT) into SKOS. A key issue was the loss of information as the thesaurus was transferred to SKOS.
Specific problems Zeng raised were the assumption that there is a one-to-one translation of concepts across languages, differences in hierarchichal understandings – in the CCT Marxism is a higher level category than philosophy, which is a higher level than social science – and its inability to incorporate facets. The mapping in SKOS is apparently concept- based, but it all depends on choosing specific terms to represent those concepts.
It seems to me that in keeping SKOS “simple” you necessarily can’t do as much with it as you can with a more elaborate system. Loss of detail is inevitable. However, it also struck me that this is true of any index or indexing language, and is really the point of an index – to pick out some key features to act as signposts to the full text, rather than reproduce the full text in all its complexity. However, this leads to a problem – is SKOS meant to be an “index” to resources, or an alternative way of expressing the resources (like XML)? If it is just an index, it seems to me the “lossiness” is acceptable, but if it is supposed to be more than that – which the semantic web evangelists seem to suggest – the “lossiness” presents real problems. Someone remarked to me that unless we have a “tag police”, interoperability is only a dream, because the combination of the simplification required by “SKOSsy indexing” with the variations that different people will use, will render cross-linking by SKOS-index terms highly unreliable. For example, something tagged “artists, modern, British” may or may not contain the same information as something tagged “painters, 20th century, UK”. (I am not actually sure you can use multiple terms within SKOS labels at all.)
Vanda Broughton explained to me that indexing languages tend to work in one direction only – you can turn full text into an indexing language quite effectively and capture the essence of its meaning very well, but it is not possible to turn the indexing language back into full text form. I took this to mean you can turn many into one, but once you’ve done that, you can’t go back from one to many (e.g. it is acceptable to index Bridget Riley and Lucian Freud as “artists, modern, British” but there is no logical way to derive those two names specifically from the phrase “artists, modern, British”).
I am also worried that SKOS is based on the assumption that there are a finite number of concepts in the world and that we can devise a universal controlled vocabulary that will give us a term for everything. When challenged, the SKOS advocates at the workshop admitted that SKOS was created as a technical framework with little thought given to how it would be applied in practice, and that it was intended to work within limited and controlled subject domains, rather than linking the entire Internet.
Diane Vizine-Goetz (senior research scientist, OCLC research) and Michael Panzer (global taxonomy product manager, OCLC) talked about the web services for managing controlled vocabularies that they have been developing. They are attempting to make DDC accessible as linked data using uniform resource identifiers (URIs).
They explained that there is a need for software that can combine and manage existing controlled vocabularies and create new tools. SKOS and other systems facilitate information sharing but only when there are software applications that make this possible. A number of issues arise form this work, including version control – how do you deliver updated versions of a resource to semantic web agents – and how opaque do you want your URIs to be in themselves?
Eric Childress (consulting project manager, OCLC research) gave an overview of FAST (Faceted Application of Subject Terminology): a vocabulary to facilitate faceted browsing. Essentially, FAST is a simplified version of the Library of Congress Subject Headings (LCSH), designed to give quick and easy search and browsing facilities to collections.
This prompted some concern in the audience that the creation of a “cheap and easy” version of LCSH would lead to “dumbing down” with librarians being pushed to make do with anything that would save money.
Michèle Hudon (associate professor, EBSI, Université de Montréal) talked about practical issues and problems encountered in building a faceted classification structure for web-based libraries for the education field.
The day was closed by Joseph T. Tennis (assistant professor, iSchool, University of Washington).
There were over 50 papers presented at the conference in two tracks. I have only written about a few of them. The full conference proceedings are available from the publisher, Ergon-Verlag.
After welcoming remarks, the conference keynote talk was given by Jonathan Furner (associate professor at UCLA) - slides here. Entitled “Interrogating identity: a philosophical approach to an enduring issue in knowledge organisation”, Furner urged us to investigate ways of bringing semantic complexity into information systems and to consider methods of evaluation of KO systems. Pointing out that KO systems need to link diverse communities with differing languages, he considered different levels of identity – cultural, ethnic, numerical, disciplinary, identicality – and issues such as identity theft, privacy, and identity over time (the ship of Theseus problem). He discussed the difference between numerical identity and qualitative indiscernibility, which is more basic terms could be about when we decide one document is “the same” as another one. He argued that identity is something constructed by a subjective agent and that it is a human right to choose one’s own identity.
He pointed out that different people see reality in different ways and every KOS is necessarily “biased” – reflecting the view of reality of its designer, raising the question of whether a KOS can represent the views of everyone or at least of all users?
A communitarian view of social justices requires that KOSs support the distribution of cultural resources without violating the rights or liberties of any particular groups and communities.
KO has traditionally been evaluated according to effectiveness, efficiency and usability. Effectiveness appears to be an objective measure, relying on the degree of correspondence between the model and reality, whereas user-oriented measurements consider the correspondence between the reality as perceived by the designer and the user, without having to worry about reality itself. In a user-oriented evaluation, the system doesn’t have to describe what a document is really about, it only needs to identify the terms the user would use to search for that document.
He proposed a “bill of rights” for autonomous KOS users – which would include the right to use and expect others to understand my own vocabulary, to be able to describe identities differently in different context without affecting retrieval effectiveness.
He ended by calling on information professionals to look for new ways of organising knowledge that make use of new kinds of relationship, better understanding of how categories actually work (after Rosch and Lakoff), in order to meet the needs of future users.
Knowledge organisation in the cross-cultural and multicultural society
Agnes Hajdu Barat (University of Szeged) has been working on a Hungarian thesaurus. She pointed out that a “nation” is not the same as a “culture” but that KO systems need to specify cultural equivalences. Cross-cultural issues that need to be taken into account include different interpretations of colours, form, symbols, metaphors, and language use.
Mixed Translation Models for the Dewey Decimal Classification System (DDC)
Joan Mitchell (OCLC) talked about her work with Ingebjorg Rype (National Library of Norway) and Magdalena Svanberg (National Library of Sweden) on a mixed-translation version of the DDC. The mixed translation model was proposed on the grounds that it would be cheaper to incorporate local-language sections and aspects into a main body of the DDC in English, rather than producing a translation of the entire DDC or translating local-language sections into English. One concern was that the mixed-language system would be hard to use. When questioned about the mixed-translation version, Norwegian respondents generally liked having sections in their own language but found the overall mixed-language system confusing.
Another concern was that the local viewpoint would actually be lost in a mixed-translation. There are various “localisation” issues that affect the DDC, for example in Norway there is a wide range of school health services – such as school dental services – that are not covered by the DDC, presumably because there is no similar provision in other countries.
Discovery and access systems for websites and cultural heritage sites: reconsidering the practical application of facets
Kathryn La Barre (University of Illinois at Urbana-Champaign) discussed facet analysis, content analysis, and faceted ontologies. She observed that there is little evidence of facet analysis having been undertaken in most e-commerce sites – they appear to have just used whatever happened to be existing database headings. She contrasted this with the work being done by Flamenco on a faceted search for art objects in the museums and galleries of San Francisco and the Koha Zoom search tool. She also referred to the importance of tracing pathways through a collection and using facets as a steering device.
Language related problems in the construction of faceted terminologies and their automatic management by Vanda Broughton (UCL). Moving from concept-based to word based systems produces a number of problems. Vocabulary control is practically non-existent in a classification because it is not a priority or particularly important in the way a classification is traditionally used, but becomes an issue once the classification is spun out in to thesaurus format. There are problems with the treatment of very specific individual items which are not present in the classification but would be useful in a thesaurus, which leads to the temptation to add more and more items and related terms. A decision on limits has to be made in a thesaurus as it cannot include everything in the entire world, whereas a classification does not have to confront this so directly.
The problems Vanda and her colleagues have been tackling in this project foreshadow a lot of the issues that appear to be cropping up in the practical applications of SKOS.
Knowledge Organisation Pro and Retrospective Peter Ohly (GESIS, Social Science Information Centre, Germany) talked about concepts as knowledge units and the differences between phylogenetic and ontogenetic content as well as the difference between knowledge and a representation of knowledge. Considering the nature of knowledge, he proposed that it is more than instinct, but is something constructed. Creating a shared viewpoint has an inherent conflict – the process of sharing inevitably involves the sacrificing of some of the individuality of the viewpoint. Compression is another compromise – necessary to prevent information overload. Information seekers are typically too modest and fail to ask for what they really need.
Knowledge and Trust in Epistemology and Social Software/Knowledge Technologies Judith Simon (University of Vienna).
Knowledge itself changes as epistemic environments change and trust is essential in arbitrating knowledge. However, trust may let you down so you need to know when trust is warranted. Even something or someone who was trustworthy at one point may let you down in the future.
Scholarship is a collaborative process that depends on trust, so scientists need to be assessed epistemically as well as morally. Evidence is something that requires trust. The notion of trust has become very important on the Internet. Sites such as epinions.com give ratings to authors as a way of attempting to establish trust. Such systems suffer from “cold starts” – the system has to be populated in the first place – and they are vulnerable to manipulation and “attacks” by people attempting to influence the ratings of themselves or others.
There are various ways of rating on websites, but it is possible to detect underlying epistemological assumptions that are embedded in the software. This becomes clear when controversial users are considered. If half the readers trust the controversial user and half don’t, how do you assess the controversial user’s trustworthiness? If global metrics are used, the opinions are averaged out – so someone totally trusted by one half and totally distrusted by the other half would be rated as “averagely trustworthy”. If local metrics are used, information is provided about who is making the judgement. This means it is far easier to see that one group totally trusts someone and another group totally distrusts them, rather than everybody partially trusting them. This also applies to assessments of relevance – a philosopher might rate a paper by a biochemist as completely irrelevant, but to another biochemist the same paper could be extremely relevant. The global metrics view is a universalist position, assuming that the “truth” can be arrived at by a process of averaging out. The local metrics view is relativist and assumes that knowledge has to be situated within a particular context to be meaningful. This view is widespread in disciplines like cultural anthropology.
Software developers are choosing one or other of these positions – probably without realising – and that viewpoint is then embedded in the software. Once embedded, it is very hard for users to recognise the underlying epistemological assumptions, let alone challenge them. As online ratings become more widespread, such issues of trust are going to become more important, so the fact that different trust metrics produce different results is a pressing issue.
Issues of dominance are also relevant. Kierkegaard asserted that the majority is always wrong and Marx argued that the “trusted idea” in society is always that of the ruling class. Dominant voices in society or in a group situation – as in an online user community – will typically outweigh distinct voices when a universalist position is taken. This means that what is accepted as truth itself will simply reflect the view of the largest group. Including distinct voices – the minority – in assessments of truth therefore becomes an epistemological issue as well as an ethical one. The “fetishisation of the average” as opposed to the situated view of “knowledge in context” is happening through choices made unawares by software developers.
The concept of trust changes through time. Currently there is a fashion for using statistics to try to explain social phenomena but new paradigms may emerge in the future.
4W vocabulary mapping across diverse reference genres by Michael Buckland (University of California, Berkeley).
There seems to be a gap in research in the USA into digital reference collections. The 4Ws are “what, where, when, who”. They present certain issues for automatic retrieval. Place is a cultural context, not a GPS reference (e.g. Prussia). People rarely talk about specific dates – they tend to refer to eras (e.g. World War II) but retrieval systems tend to expect people to search for specific dates.
Facets should be separate and mutually exclusive (the 4Ws work well). These four facets tend to be included in reference works such as dictionaries and encyclopedia entries. Vertical mappings can be used to extend semantic links in vocabularies, and horizontal mappings can provide more context.
At the moment digital libraries tend to replicate the failings of the physical environment and are failing to exploit the potential offered by digital resources.
Indexes that exploit faceted structures would be an improvement.
Top-level categories on the structure of corporate websites by Abdus Sattar Chaudhry (Wee Kim Wee School of Communication and Information, Singapore).
In communicating Knowledge Management too much emphasis on structures and standards is less helpful than building ontologies that can support information systems.
Information Architecture of organisational sites was studied. The word “taxonomy” was deliberately avoided and the term “categories” was used instead. Many people do not recognise the word taxonomy and there is much debate over its precise meaning.
Categories that were common across the sites studied were:
Contact us (Email; Locations; Offices etc.)
User groups (where the companies wanted to promote products to customers and initiatives to employees)
Amongst hardware, software, and automobile companies there were lots of categories and the taxonomies and ontologies were very specific.
On consumer, electronics, and furniture sites there were few top levels and few customer support options, as these are less important for such sites than for computer software sites, where support is more significant.
The first 7 websites that were examined had very similar categories, but when the sample was expanded to 28 sites, 54 new categories were found. These were mainly things like “Careers and Jobs” sections, corporate culture, finance, etc., and served a PR function for the companies as corporate entities, rather than simply selling through e-commerce.
“Contact us” could also be divided into contact us directly for sales, and contact us to access press rooms for the media.
Only one of the companies surveyed was making extensive use of metadata.
Taxonomy projects tend to be about anticipated benefits and can be highly pressured. When developing a corporate taxonomy what is important is that user interests are supported. Multiple approaches that can provide context specific information are typically successful. The number of categories also needs to be considered – 600 categories is far too many for easy navigation.
Each different objective has its own costs and compromise is necessary. There are different focuses depending on the context – so, for example – local records management does not have to perform a PR function.
Systems need to be usable across organisations. Information systems are usually big investments, so a strategy of compromise based on initial projects is advisable.
There are differences between industries, but users want stability so that they can easily compare products.
There is always a trade-off between keeping a taxonomy flexible and adaptive and maintaining stability so that people can learn and remember how to use it effectively.
Testing the Assumptions of Folksonomies: Reality in del.icio.us by Nicolas George (Indiana University).
The power of folksonomies is that keywords are self-defining. The ambiguity of tags is seen as a strength. Vocabulary tends to stabilise around a common perspective. A study of certain tags relating to technical subjects was undertaken. It showed that the top tags tended to stabilise over time, but there was a huge variation from month to month. Short snapshots of vocabulary used in folksonomies tend to be unrepresentative of overall vocabulary usage. It was not possible to determine whether variations in frequency of a term represented a variation in terminology used to describe a concept or variation in the concepts that were being tagged.
Noesis: Perception and Every Day Classification by Richard P. Smiraglia (Long Island University).
Knowledge Organisation systems require perceptual conformance. Tag clouds create a “bandwagon” effect but there is typically much variability around the core.
Rigid categories prevent cross-disciplinary communication.
An object becomes a “noeme” when we have perceived it. “Noesis” is the process of realising what an object is. It is what happens between perceiving the object and working out what it is. This is most obvious when in an alien culture – for example letter boxes in different countries look very different. A tourist may see a letter box but because it is an unfamiliar shape and colour not recognise it immediately.
There is the possibility of error in perception and it is also influenced by the ego. According to semiotic theory, what we know is dynamic. What we know is therefore culturally moderated and affected by who we are and what we already know.
Deliberate Bias in Knowledge Organisation? by Birger Hjørland (Royal School of Library and Information Science, Denmark).
(The entire presentation is available online, thanks to the University of Arizona.)
Bias is something that is considered negatively – usually because people try to eliminate it in pursuit of the “noble dream” of objectivity. Novick asserts that objectivity has been conceptualised differently at different times in history.
The field of Knowledge Organisation suffers from a shortage of philosophical analysis but it should be based on arguments which include arguments about epistemology. Olson, Campbell, Furner, Andersen, and Feinberg have taken pragmatist stances.
“Knowledge Organisation has by tradition treated its bibliographic tools as more or less value free.” (Dahlström, 2006, p. 291)
A positivist stance emphasises being “unbiased” as a goal, whereas a pragmatist stance emphasises creating a rhetorical argument based on explicated values and goals. A pragmatic view looks at which questions are best served by which KO system. However, pragmatism tends to be imposed upon another approach, like a process of lamination.
Librarians and other knowledge organisers have a moral duty to reveal the hidden biases in KO systems. No system will be free of bias, so a pragmatic approach will assess the amount of bias. It is important to remember the distinction between pragmatism as a philosophical position and being pragmatic in the sense of doing something because it works.
Positivism stands in contrast to hermeneutics. The logical positivists began to impose a positivist approach on the humanities, but recently – post Kuhn – a hermeneutic approach has started to be imposed on scientists.
Structure in Information Organisation by Joseph Tennis.
It is the discourses and work practices of a domain that determine the structure of its information organisation framework. It is important to begin with a robust theory. A lack of definition will result in a lack of structure.
Examples of specific KO systems include Connotea and Mary Daly’s Wickedary.
Knowledge Organisation as a Cultural Form by Jack Andersen (Royal School of Library and Information Science, Denmark).
Until very recently the dominant cultural form has been the narrative, but now the language of new media is starting to emerge. The database is now crucial in digital media both as a mode of media production and cultural expression. When new media is just a bunch of items, the database becomes the centre of the creative process.
Performativity of digital media is now increasingly relevant. Knowledge Organisation has become everyday practice in digital media. So KO in new media is an act of performance. Modern life is now audienced and performed. KO is media design, presentation, cultural expression, creative cultural and aesthetic practice. There is therefore an aesthetics of databases.
I also thoroughly enjoyed the ISKO banquet. In fact, the food throughout the conference was fantastic.
Panel session – Conceptual Models of Aboutness
Moderator and organiser: Marcia Lee Zeng (Kent State University)
Presenters: Maja Zumer (University of Ljubljana); Athena Salaba (Kent State University); Leda Bultirnin (ARPA Lazio)
Reactors: Joseph T Tennis (university of Washington); Hemalata Iyer (State University of New York at Albany); Keven Wei Liu (Shanghai Library); Haiqing Lin (University of Auckland)
A notion of “aboutness” is needed in certain (all?) knowledge organisation systems – for example in the Functional Requirements for Subject Authority Records (FSAR).
Users find divisions between concrete and abstract very difficult.
FSAR introduced the concepts of “nomen” and “thema” as terms to distinguish between the title and the subject of something, because “aboutness” is not easily translated into many languages. Using a made-up term (thema) enabled FSAR to avoid bringing in local associations connected to particular translations of aboutness.
Does every work have to have a thema?
The Italian model looks at “units of thought” and is a positivist view of subject analysis. Its purpose is to determine the assignment of a subject.
“Caveat fractal” was an expression adopted as a warning that classification schemes can become ever more complex and intertwined on themselves.
Genre is not a subject.
Tags in the Code4Lib community were compared. The tags were divided into 7 semantic categories. There seemed to be no evidence that people tag “socially” – they just don’t care that other people can look at their tags. Slides available online.
Tags used by enthusiasts of the Tektonik killer dance craze mainly popular amongst teenagers in France were studied. Dancers typically video themselves dancing, then upload the videos to video sharing sites such as You Tube for other people to comment on and judge. Differences between tags used by the people doing the uploading and people visiting and watching the videos were compared.
People uploaded videos accompanied by text commentary and tags. Visitors added their views, comments, and ratings. Tags by the uploader about themselves were described as “endo-tags” and those by others as ”exo-tags”.
Uploaders were clearly aware of the advertising function of self-tagging and had a desire to be found, particularly within the Tektonik killer community. There were an average of 13 endo-tags per video and these were typically very detailed. There was a clear awareness of the need to add variant spellings to be found and mentioning famous or “cool” members of the community to gain reputation by association was also common.
In contrast, there were typically 2.5 exo-tags and they were generally short and casual.
The group exhibited a “reputation economy” where someone’s reputation within the community was considered a valuable commodity. Exo-taggers were typically tagging for themselves. There was no evidence that they tagged with a view to benefiting or educating the community. Exo-taggers rarely used variant spellings.
However, there was also evidence of efforts to “hide” by endo-taggers for example by using shibboleths and teenage slang. By using obscure terminology that only a few people would know, “cool” members of the community could form an “inner circle” that could not easily be accessed by outsiders.
Do Tags help users find things? by Margaret Kipp (Faculty of Information and Media Studies, University of Western Ontario)
Assumptions were that users would use similar terminology from one information retrieval attempt to the next. If convergence occurs, other users will benefit. If convergence is assumed, the more people that use a system, the better it will get.
Questions: 1. Do tags appear to enhance resource discovery?
2. How does the site with the tagging facility compare with the traditional site?
3. How useful are the tags?
The two sites compared were PubMed and CiteuLike. Screen captures and semi-structured interviews were undertaken. Trials were randomised to avoid learning effects.
It was found that five keywords were used frequently but there was a long tail of variants. Quite a lot of people used the tags as a kind of cross reference linking system.
Problems include false drops – a lot of noisy tags drowning the signal and “tag spam”.
However it did appear that taggers used similar tags to both indexers and authors.
That was the last presentation I attended!
Some personal notes on objectivity (cf Norvik, Hjorland – DLIS archives) The dominant voice will always be the most usable. There should be a “benevolent dictatorship” of the librarian unearthing minority viewpoints and presenting them regardless of whether this is considered “difficult” for the user – users need may need to be suppressed for their own good. Is this paradoxical?