Oct 09 2011

More than a schedule, give me an index

Published by Fran under KO, archives, semantic web

People have started to talk about the death of the schedule, often in the context of complaining that broadcasters are ill-prepared for this inevitability and schedulers complaining that no-one appreciates their skills in placing programmes appropriately and in context. One example is “hammocking” – making sure that viewers receive a “varied diet” across an evening, perhaps placing the news between two lighthearted pop culture programmes.

Meanwhile, the anti-schedulists point out that given the choice, some people will download and watch an entire series in one marathon session (people have “Torchwood weekends”), so that they don’t have to commit to being in front of the TV at 9pm every Thursday, or will watch a film broken down into 20 minute sections on their mobile phone while commuting. Schedulism and anti-schedulism can seem like major culture clash, but is easily resolved when you think purely in terms of knowledge organisation.

A schedule is just metadata

A schedule is merely a set of metadata about programmes. It used to be the most important set of metadata for most people (along with the programme title!) as it was the key to not missing the programme. Now that we have catchup services and archives, knowing exactly when a programme will be broadcast or was broadcast may be less significant for finding that programme again, leading some people to claim that schedules are no longer needed. However, there are plenty of people who don’t want to look for specific programmes but want to sit down and be entertained for the evening. For them, schedules remain vital as they outline what is available. Scheduling in this sense is editorial selection, with all the craftsmanship and judgement that implies.

People are fascinated to know what was broadcast on the day they were born, and which programmes went out together, and schedules offer all sorts of socio-political and cultural information, giving snapshots of what were popular topics or contentious issues over time.

Schedule data is less significant in a vast online digital archive, but it is still useful. For example, you might want to find an episode you missed in a long-running series. You probably won’t know that it was episode 12 of 26, but you might remember that the reason you missed it was because you were out celebrating a friend’s birthday, which is a date you know. This may be a lot quicker than reading through the episode descriptions, which are usually too vague to be helpful, as the writers don’t want to give away “spoilers”, such as the final cliffhanger, which is often the part of the episode you remember the best. The programme descriptions are intended to entice you to watch the programme, not help you work out whether or not you have already seen it.

Don’t ditch the schedule, add to it

What is important to bear in mind is that digital archives can offer schedule data almost effortlessly, but can offer many more metadata streams as well. These metadata streams are in many ways innovative and can lead to fascinating new ways of grouping programmes and promoting content. Rich subject metadata (such as a subject index) becomes an engine by which you can drive all sorts of automatically created content channels. You can group programmes by theme or topic as well as series and genre. So you don’t have to rely on when something was shown, you can use an index to gather together all programmes about fishing, or harpsichords, or the miners’ strike – bringing together documentaries (Heart of the Matter, Panorama), news and current affairs (also Question time, Newsnight, even The Money Programme), as well as plays (The Price of Coal), or even comedies (The Comic Strip Presents.. The Strike).

Such subjective metadata also gives you extra contextual information, for example in the case of the Miners’ Strike, it shows you that there were miners’ strikes in 1921, 1926, 1955, 1972, 1974, 1981, as well as in 1984, and that miners around the world have gone on strike at various times. This historic perspective is hard to pick out from schedule data. (Even if you could see programmes about miners’ strikes had been broadcast in these years, you would have to do further research to find out if they were covering contemporary events.) If the programmes have such metadata attached, anyone – any user of the archive – can effectively build rich personalised channels on their favourite topics or themes, and share those with others who have similar interests.

Metadata advertises content

If the metadata is in a Linked and Open format, the associative trails can wander beyond your collection to others, reaching new audiences, perhaps via social networks. This releases the “long tail” of content that is otherwise hard to find and re-use, as well as putting popular content into context. Making your metadata available more widely means more people will have more and more routes in to exploring your archive, even if you choose to restrict this to in-house teams or paying subscribers.

Either way, if you want to sell individual programmes or parts of programmes, knowing not just when you transmitted them but knowing exactly what they are about - via the rich semantic metadata you have added - offers a very useful sales and marketing tool.

No responses yet

Mar 06 2011

UK Archives Discovery Forum

Published by Fran under KO, archives, cataloguing, search, semantic web

I very much enjoyed the UKAD UK Archives Discovery Forum event at the National Archives. There were three tracks as well as plenary sessions, so I couldn’t attend everything.

Linked Data and archives

After an introduction from Oliver Morley, John Sheridan opened by talking about the National Archives and Linked Data. Although not as detailed as the talk he gave at the Online Information Conference last December, he still gave the rallying call for opening up data and spoke of a “new breed” of IT professionals who put the emphasis on the I rather than the T. He spoke about Henry Maudslay who invented the screw-cutting lathe, which enabled standardisation of nuts and bolts. This basically enabled the industrial revolution to happen. Previously, all nuts and bolts were made individually as matching pairs, but because the process was manual, each pair was unique and not interchangeable. If you lost the bolt, you needed a new pair. This created huge amounts of management and cataloguing of individual pairs, especially if a machine had to be taken apart and re-assembled, and meant interoperability of machinery was almost impossible. Sheridan asserted that we are at that stage with data – all our data ought to fit together but at the moment, all the nuts and bolts have to be hand crafted. Linked Data is a way of standardising so that we can make our data interchangeable with other people’s. (I like the analogy because it makes clear the importance of interoperability, but obviously getting the nuts and bolts to fit is only a very small part of what makes a successful machine, let alone a whole factory or production line. Similarly Linked Data isn’t going to solve broad publishing or creative and design problems, but it makes those big problems easy to work on collaboratively.)

Richard Wallis from Talis spoke about Linked Data. He likes to joke that you haven’t been to a Linked Data presentation unless you’ve seen the Linked Open Data cloud diagram. My version is that you haven’t been to a Linked Data event unless at least one of the presenters was from Talis! Always an engaging speaker, his descriptions of compartmentalisation of content and distinctions between Linked Data, Open Data, and Linked Open Data were very helpful. He likes to predict evangelically that the effects of linking data will be more profound to the way we do business than the changes brought about by the web itself. Chatting to him over tea, he has the impression that a year ago people were curious about Linked Data and just wanted to find out what it could do, but this year they are now feeling a bit more comfortable with the concepts and are starting to ask about how they can put them into practice. There certainly seemed to be a lot of enthusiasm in the archive sector, which is generally cash-strapped, but highly co-operative, with a lot of people passionate about their collections and their data and eager to reach as wide an audience as possible.

A Vision of Britain

Humphrey Southall introduced us to A Vision of Britain, which is a well-curated online gazetteer of Britain, with neat functions for providing alternative spellings of placenames, and ways of tackling the problems of boundaries, especially of administrative divisions, that move over time. I’m fascinated by maps, and they have built in some interesting historical map functionality too.

JISC and federated history archives

David Flanders from JISC talked about how JISC and its Resource Discovery Task Force can provide help and support to educational collections especially in federation and Linked Data projects. He called on archives managers to use hard times to skill up, so that when more money becomes available staff are full of knowledge, skills, and ideas and ready to act. He also pointed out how much can be done in the Linked Data arena with very little investment in technology.

I really enjoyed Mike Pidd’s talk about the JISC-funded Connected Histories Project. They have adopted a very pragmatic approach to bringing together various archives and superimposing a federated search system based on metadata rationalisation. Although all they are attempting in terms of search and browse functionality is a simple set of concept extractions to pick out people, places, and dates, they are having numerous quality control issues even with those. However, getting all the data into a single format is a good start. I was impressed that one of their data sets took 27 days to process and they still take delivery of data on drives through the post. They found this was much easier to manage than ftp or other electronic transfer, just because of the terabyte volumes involved (something that many people tend to forget when scaling up from little pilot projects to bulk processes). Mike cautioned against using RDF and MySql as processing formats. They found that MySql couldn’t handle the volumes, and RDF they found too “verbose”. They chose to use a fully Lucene solution, which enabled them to bolt in new indexes, rather than reprocess whole data sets when they wanted to make changes. They can still publish out to RDF.

Historypin

Nick Stanhope enchanted the audience with Historypin, an offering from wearewhatwedo.org. Historypin allows people to upload old photos, and soon also audio and video, and set them in Google streetview. Although flickr has some similar functions, historypin has volunteers who help to place the image in exactly the right place, and Google have been offering support and are working on image recognition techniques to help place photos precisely. This allows rich historical street views to be built up. What impressed me most, however, was that Nick made the distinction between subjective and objective metadata, with his definition being objective metadata is metadata that can be corrected and subjective metadata is data that can’t. So, he sees objective metadata as the time and the place that a photo was taken - if it is wrong someone might know better and be able to correct it, and subjective metadata as the stories, comments, and opinions that people have about the content, which others cannot correct - if you upload a story or a memory, no-one else can tell you that it is wrong. We could split hairs over this definition, but the point is apposite when it comes to provenance tracking. He also made the astute observation that people very often note the location that a photo is “of”, but it is far more unusual for them to note where it was taken “from”. However, where it was taken from is often more use for augmented reality and other applications that try to create virtual models or images of the world. Speaking to him afterwards, I asked about parametadata, provenance tracking, etc. and he said these are important issues they are striving to work through.

Women’s history

Theresa Doherty from the Women’s Library ended the day with a call to stay enthusiastic and committed despite the recession, pointing out that it is an achievement that archives are still running despite the cuts, and that this shows how valued data and archives are in the national infrastructure, how important recording our history is, and that while archivists continue to value their collections, enjoy their visitors and users, and continue to want their data to reach a wider audience the sector will continue to progress. She described how federating the Genesis project within the Archives hub had boosted use of their collections, but pointed out that funders of archives need to recognise that online usage of collections is just as valid as getting people to physically turn up. At the moment funding typically is allocated on visitor numbers through the doors, and that this puts too much emphasis on trying to drag people in off the street at the expense of trying to reach a potentially vast global audience online.

One response so far

Nov 02 2009

World Audio Visual Archives Heritage day

Published by Fran under culture

I went to an interesting event last Monday night for UNESCO World Audio Visual Archives Heritage day, held at BAFTA in London.

Professor John Ellis (Department of Media Arts, Royal Holloway, University of London) talked about the growing use of TV archives, particularly news footage, in academia, pointing out that over time such material becomes increasingly valuable in such diverse areas as physiology - for example in studying the effects of ageing by analysing footage of presenters and actors who have had long careers, and town planning, as footage can reveal the buildings that previously occupied a site being considered for redevelopment.

As UK law permits academic institutions to record and keep TV and radio broadcasts for purely educational purposes, a database of material has been collected. Academia remains currently a verbal rather than visual culture, but this seems to be changing. All politicians, for example, are now so TV literate that to study them without reference to their TV appearances would be strange.

Fiona Maxwell (Director of Operations at ITV Global Entertainment), then talked about the painstaking restoration of the 1948 film The Red Shoes. She provided lots of technical details about removing mould and correcting registration errors, but also showed “before and after” clips so we could see the huge improvements.

One response so far

Oct 05 2008

Web archiving

Published by Fran under culture, libraries and museums

I went to an excellent Anglo-French scientific discussion seminar on web archiving on Friday at the Institut Français Cultural Centre in London. The speakers were Gildas Illien of the Bibliothèque Nationale de France (BnF) (Paris) and Dr Stephen Bury of the British Library (BL).

Gildas Ilien described the web archiving project being undertaken by the BnF, using the Heritrix open source crawler to harvest the web from “seeds” (URLs). The crawler was charmingly illustrated with a picture of a “robot” (as people like to be able to see the “robot”), but the “robot” is a bit stupid - he sometimes misses things out and sometimes falls into traps and collects the same thing over and over again. The “robot” generates a lot of code for the librarians to assess and problems include the short lifespan of websites - one figure puts this as only 44 days (although whether that refers to sites disappearing altogether or just changing through updates wasn’t clear) and the “twilight zone” of what is public and what is private. In France the Legal Deposit Act was extended in 2006 to cover the web, so the BnF can collect any French website it wants to without having to ask permission. However, librarians have to choose whether to try to collect everything or just sites that are noteworthy in some way. It is also hard to guess who the future users of the archive will be and what sort of site they will want to access.

So far some 130 terabytes of data have been collected, and some 12 billion files stored.

Harvesting is done in three stages - bulk harvesting once a year; focused crawls of specific sites; and collections of e-deposits (such as e-books) directly from publishers. Some sites would be harvested occasionally - such as the website of the Festival du Cannes - which only needs to be collected once per year - and newspaper sites, which are collected more frequently.

The archive can be searched by URL or by text, although the text search is rudimentary at present.

Classification is another challenge, as traditional library classifications are not appropriate for much web content. For example, election campaign websites were classified by what the politicians were saying about themselves and by what the public were saying about them, as this was thought to be a useful distinction.

However, the problems of how to provide full and useful access to the collection and how to catalogue it properly remain unresolved.

The process was an interesting merging of traditional library skills and software engineering skills, with some stages clearly being either one or the other but a number of stages being “midway” requiring a cross-skilled approach.

Dr Stephen Bury explained that the BL is somewhat of a latecomer to web archiving, with the BnF, the Internet Archive, and the national libraries of Sweden and Australia all having more advanced web archiving programmes. Partly this is due to the state of UK legal deposit law, which has not yet been extended to include websites.

Just as there are many books about books and books about libraries, so there are many websites about the web. It is a very self-referential medium. However, there is a paradox in the BL’s current programme. Because the BL has to seek permission to collect each and every site, it may collect sites that it cannot then provide access to at all, and it cannot provide any access to sites except to readers in its reading rooms. To be able to collect the web but then not to be able to serve it back up to people through the web seems very strange.

Another issue of preservation is that the appearance of websites is browser-dependent, so a site may not look the same to people using different technology.

It is important that online information is preserved, as now websites are considered to be authentic sources of information - cited in PhDs for example - and so some way of verfiying that they existed and what content they contained is needed.

Reports have been produced by JISC and the Wellcome Trust (2002 Collecting and Preserving the World Wide Web) and (2002 Legal issues relating to the
archiving of Internet resources in the UK, EU, USA and Australia
by Andrew Charlesworth).

The BL undertook a Domain UK project to establish what the scope of a web archiving project might be. The BL used Australian PANDAS software. The UK web Archiving Consortium (UKWAC) was set up in 2003 but the need to obtain permissions has seriously limited its scope, as most website owners simply do not respond to permissions requests (very few actively refuse permission), presumably most ignore the request as spam or simply fail to reply.

The data has now been migrated from the PANDAS format to WARC and an access tool is in development. There are some 6 million UK websites, growing at a rate of 16% per year, and they are also growing in size (on average they are about 25Mb, increasing at a rate of 5% per year).

Decisions have to be made on frequency of collection, depth of collection, and quality. There are other peripheral legal issues, such as sites that fall under terrorism-related legislation. At present the BL can collect these sites but not provide access to them.

Resource discovery remains a major challenge, including how to combine cataloguing and search engine technology. So far, a thematic approach to organisation has been taken. Scalability is also a big issue. What works for a few thousand sites will not necessarily work for a few million.

This means that the nature of the “collecting institution” is changing. It is much harder to decide if a site is in or out of scope. A site may have parts that are clearly in scope and parts that clearly aren’t or it may change through time, sometimes being in scope and sometimes not.

The Digital Lives Project in association with UCL and the University of Bristol is looking at how the web is becoming an everyday part of our social and personal lifestyles.

The talks were followed by a question and answer session. I asked for more detail about the “twilight zone” of public and private websites. Both speakers agreed that there is a great need for more education on digital awareness, so that young people appreciate that putting things up on the Internet really is a form of publishing and their blogs and comments in public forums are not just private “chats” with friends. However, in France there has been little resistance to such personal material being collected. Most people are proud to have their sites considered to be part of the national heritage. A lot of outreach work has been done by the BnF to explain the aims of the archive and discuss any concerns. Gildas Ilien also pointed out that people do not necessailry have “the right to be forgotten” and that this is in fact not new. It has happened in the past that people have asked for books and other information to be removed from libraries, perhaps because they have changed their political viewpoint, and that a library would not simply remove a book from its shelves because the author decided that they had changed their mind about something in it.

There is a recent interview with Gildas Ilien (in French) on You Tube called L’archivage d’Internet, un défi pour les bibliothécaires.

One response so far