Apr 23 2012

To embed or not to embed – metadata and IDs

One of the problems with the word metadata (apart from the fact that no-one can decide whether it should be singular or plural - as a former classicist I am quite happy to use it in the Anglicized singular form!) is that the word covers such a wide range of data required for a huge variety of uses.

At a recent presentation I gave as part of a “knowledge share” session at the digital design agency Tobias and Tobias, I was rightly challenged by Patrick from Golant Media Ventures, when I said that you should not embed metadata in your content, but manage it separately. He pointed out that for copyright and rights management purposes embedded metadata is extremely useful and in fact many content creators are actively campaigning to make sure that software and service providers do not strip metadata out of content when it is transferred or transcoded.

Embedding information versus embedding IDs

He is quite right, but I was right too – just in a different sense. It is a complex and important point, so I thought it was worth expanding on. I was talking about not embedding metadata structures in assets when you can manage structures of primarily semantic metadata separately. You can do this by embedding only IDs in the assets, and then using those as lookups to access the structure as and when you need to, picking up the structure “on the fly”. The principle remains the same whether you are talking about “private” localised IDs or “public” IDs, such as Linked Open Data dereferenceable URIs (i.e. website addresses you can look up). Such an approach allows you to manage the structures and meanings contextualising those IDs separately from managing the assets themselves.

The reason is mainly technical. If you wish to add to or edit the structure of your taxonomy (or ontology) or change the information your URI points to, it is far easier to do this in one place than it is to find all the assets containing that metadata and re-index them all individually every time you make a change. So, if you store taxonomy pathways as hard-coded text strings in a piece of content, but then you decide to alter the hierarchy, you have to go back to each and every occurrence of that text string applied to content and update it, in each and every asset record that contains it. Sometimes this might be fine – if you know that you are hardly ever going to change the structure or if you have very few assets, or if you have a very powerful and sophisticated re-indexing service. Generally, however, given that language is constantly evolving and asset collections are constantly growing and changing, the “hard-coding” approach is going to require an awful lot of processing and so will be very resource hungry.

If, on the other hand, all you embed in your asset record is an ID, you can use an external system to provide the context for that ID – the pathways of the taxonomy, the relationships of an ontology, the semantic sense of a URI. You can then alter your taxonomy’s hierarchies (e.g. adding and moving concept nodes) or develop your ontology (e.g. adding new classes and relationships) in one centralised system without having to go back to every individual indexed asset in turn. This also means that you can de-couple your taxonomy or ontology management system from your digital asset management or content management system. This is important if you want more sophisticated metadata management than standard DAM, search, or CMS software provides, or if you want to future proof your semantic structures.

Modular systems are more future proof

By keeping asset management and metadata management separate you can upgrade either part without having to upgrade the other. As semantic technologies – such as ontology editing systems – are going through a rapid phase of development, and in general evolving faster than search, DAM, and other consuming systems, maintaining your semantic structures in as transferable and system agnostic form as possible shows foresight. Conversely, you may want to invest only a little in a DAM system, with the hope that business will grow and you will be able to upgrade as your content collections increase. If you have a separate metadata management system you should be able to keep that, while changing your DAM system.

Rights management is different

However, all this primarily concerns internal content and metadata management. Where embedding metadata in the asset itself makes most sense is when that metadata is metadata that you want to remain fixed to that asset and be published with it - for example, details of where a photo was taken, who owns the copyright and how to get in touch with them to licence re-use of that photo. This is because making that information hard to strip out means that when your asset wanders out into the public world of the Internet and frequent uncontrollable copying, you want users to be able to find out easily the origins of the image and its ownership.

A huge problem for collection of royalties and licensing payments is that people who would be willing to pay simply don’t know who to pay. Deliberate piracy will always be a cost – just as shops will always have to allow for a certain amount of “shrinkage” due to shoplifting, but physical shops tend to be pretty good at making sure customers who are willing to pay can find plenty of checkout tills, self-service checkouts, or sales assistants. Keeping rights information embedded in assets is the equivalent of the checkout, not the security camera.

How important is being up to date?

Of course, the problem of updating remains – so if copyrights are transferred, all those assets that have gone out with old embedded metadata contain out of date information. So, rights managers are increasingly moving towards a system of embedding dereferenceable IDs as well. One example is the EIDR system that uses this method (as well as other techniques) to manage rights. By embedding an ID that links to a centralised rights registry, information can be updated once within that central registry, and then whenever someone looks up that ID, they get the most up to date details.

So, we are both right in a way. Embedding IDs and managing metadata separately to managing assets has many advantages. Embedding the metadata itself can also be useful, especially if it is rights information of assets that will be released onto the public Internet and is information that you may not need to update, but that you do not want to be lost when the asset is copied.

No responses yet

Feb 12 2012

Data: The New Black Gold?

Published by Fran under culture, information management

Last week I attended a seminar organised by The British Screen Advisory Council and Intellect, the technology trade association, and hosted by the law firm SNR Denton. The panellists included Derek Wyatt, internet visionary and former politician, Dr Rob Reid, Science Policy Adviser, Which?, Nick Graham, of SNR Denton, Steve Taylor, creative mentor, Donna Whitehead, Government Affairs Manager, Microsoft, Theo Bertram, UK Policy Manager, Google, David Boyle, Head of Insight, Zeebox, and Louisa Wong, Aegis Media.

Data as oil

The event was chaired by Adam Singer, BSAC chairman, who explored the metaphor of “data as oil”. Like oil, raw data is a valuable commodity, but usually needs processing and refining before it can be used, especially by individual consumers. Like oil, data can leak and spill, and if mishandled can be toxic.

It struck me through the course of the evening, that just like oil, we are in danger of allowing control of data to fall into the hands of a very small number of companies, who could easily form cartels and lock out competition. It became increasingly obvious during the seminar that Google has immense power because of the size of the “data fields” it controls, with Facebook and others trying to stake their claims. All the power Big Data offers - through data mining, analytics, etc. - is dependent on scale. If you don’t have access to data on a huge scale, you cannot get statistically significant results, so you cannot fine tune your algorithms in the way that Google can. The implication is that individual companies will never be able to compete in the Big Data arena, because no matter how much data they gather on their customers, they will only ever have data on a comparatively small number of people.

How much is my data worth?

At a individual level, people seemed to think that “their” data had a value, but could not really see how they could get any benefit from it, other than by trading it for “free” services in an essentially hugely asymmetrical arrangement. The value of “my” data on its own - i.e. what I could sell it for as an individual - is little, but when aggregated, as on Facebook, the whole becomes worth far more than the sum of its parts.

At the same time, the issue of who actually owns data becomes commercially significant. Do I have any rights to data about my shopping habits, for example? There are many facts about ourselves that are simply public, whether we like it or not. If I walk down a public street, anybody can see how tall I am, guess my age, weight, probably work out my gender, social status, where I buy my clothes, even such “personal” details as whether I am confident or nervous. If they then observe that I go into a certain supermarket and purchase several bags of shopping, do I have any right to demand that they “forget” or do not use such observations?

New data, new laws?

It was repeatedly stated that the law as it stands is not keeping up with the implications of technological change. It was suggested that we need to re-think laws about privacy, intellectual property, and personal data.

It occurred to me that we may need laws that deal with malicious use of data, rather than ownership of data. I don’t mind people merely seeing me when I walk down the street, but I don’t want them shouting out observations about me, following me home, or trying to sell me things, as in the “Minority Report” scenario of street signs acting like market hawkers, calling out your name as you walk by.

What sort of a place is the Internet?

Technological change has always provoked psychological and political unease, and some speakers mentioned that younger people are simply adapting to the idea that the online space is a completely open public space. The idea that “on the Internet, no-one knows you are a dog” will be seen as a temporary quirk - a rather quaint notion amongst a few early idealists. Nowadays, not only does everyone know you are a dog, they know which other dogs you hang out with, what your favourite dog food is, and when you last went to the vet.

The focus of the evening seemed to be on how to make marketing more effective, with a few mentions of using Big Data to drive business process efficiencies. A few examples of how Big Data analytics can be used to promote social goods, such as monitoring outbreaks of disease, were also offered.

There were clear differences in attitudes. Some people wanted to keep their data private, and accept in return less personalised marketing. They also seemed to be more willing to pay for ad-free services. Others were far more concerned that data about them should be accurate and they wanted easy ways of correcting their own records. This was not just to ensure factual accuracies, but also because they wanted targeted, personalised advertising and so actively wanted to engage with companies to tell them their preferences and interests. They were quite happy with “Minority Report” style personalisation, provided that it was really good at offering them products they genuinely wanted. They were remarkably intolerant of “mistakes”. The complaint “I bought a book as a present for a friend on Amazon about something I have no interest in, now all it recommends to me are more books on that subject” was common. Off-target recommendations seemed to upset people far more than the thought of companies amassing vast data sets in the first place.

Lifting the lid of the Big Data black box

The issue that I like to raise in these discussions is one that Knowledge Organisation theorists have been concerned about for some time - that we build hidden biases so deeply into our data collection methods, our algorithms, and processes, that our analyses of Big Data only ever give us answers we already knew.

We already know you are more likely to sell luxury cars to people who live in affluent areas, and we already know where those areas are. If all our Big Data analysis does is refine the granularity of this information, it probably won’t gain us that many more sales or improve our lives. If we want Big Data to do more for us, we need to ask better questions - questions that will challenge rather than confirm our existing prejudices and assumptions and promote innovation and creativity, not easy questions that merely consolidate the status quo.

No responses yet

Jun 08 2011

There’s no such thing as originality

Published by Fran under culture

Back in 1995 my brother wrote his MA dissertation on copyright: Of Cows and Calves: An Analysis of Copyright and Authorship (with Implications for Future Developments in Communications Media) or How I Learned To Stop Worrying and Love Home-Taping. It is interesting how relevant it remains today, especially in the light of the Hargreaves Report delivered to the government in May. Essentially, nothing much has changed in the intervening 16 years. My brother reflected the predictions of the profound changes that digital technologies would make - and were already making - to the creative industries. Although details such as the excitement over ISDN lines and no mention of mobile technologies date his work, the core issues he covers - who owns and idea and who should get paid for it – remain remarkably current.

I’ve written a brief overview of the Hargreaves Report for Information Today, Europe. The two aspects of most interest to me are the proposals for a Digital Copyright Exchange and for handling of orphan works.

Ideas as objects

My brother argues that creative works are all part of an ongoing cultural dialogue that no one individual can really “own” and that copyright only made sense for the short period of time where technology reified ideas as artefacts that could be traded as commodities (like potatoes or coal). The business model of “content” as “physical item” started to fail with the invention of the printing press, as the process of copying ceased to be a creative act, so each individual copy was not a “new” work in its own right. Copyright law was developed to commodify the “idea” within a book, not the physical book and was enforceable for only as long as access to the copying technologies - printing presses - could be limited. The digital age has made control of the copying process impossible, as the computer replaced the printing press, so one could exist on every desk, and now, thanks to mobile technology, in every pocket. He notes that in pre-literate societies, authorship of a myth or a folklore was not important, and I find it interesting that crowd sourcing (e.g. Wikipedia, citizen journalism) has in some ways returned us to the notion of a culturally held store of knowledge contributed and curated by volunteers, rather than by paid professionals.

Music is not the only art

Many of the discussions of free v. paid for content seem to run to extremes, and seem to be coloured by the popular music industry’s taste for excess. The music industry inflated commodity prices far beyond what consumers were willing to pay just as cheap copying technologies became widely available, making the pirates feel morally justified. It is hard to feel sympathy for people living a millionaire rockstar lifestyle. The inevitable increase in piracy was met not by lower prices, but by the industry issuing alarmist statements about home taping killing music. It didn’t! Music industry profits have risen steadily. The industry has simply turned its attention to charging more for merchandise and live events. The lesson to be learned is that people are willing to pay for experiences, services, and commodities that they perceive as being worth the price and better than the alternatives. Most music fans would rather pay for an easy, virus-free reliable download service than deal with illegal download sites, just as back in the 1980s sticking a microphone in front of the radio to record the charts wasn’t as good as buying the vinyl. The effects on the reference publishing industry were very different, affecting many small businesses and people on far less than rockstar wages, but most displaced people found ways to transfer their skills to numerous new areas of work – obvious examples are content strategy or user experience design – that simply didn’t exist in the 1990s.

It has become a bit of a cliche that there aren’t any business models that work, or have been shown to work, in the new digital economy, but are things really so different? In order to have a business somebody somewhere has to be persuaded to pay for something. Everything else is just a complication. If your free content is supported by advertising, it just means that someone needs to be persuaded to pay for the advertised product, instead of the free bit that appears on the way. Similarly, “freemium” is really just old-style free samples and loss leaders. You can’t have the “free” bit without paying for the “premium”. The two key questions for producers remains, as they have always been, how do you produce content that is so useful, entertaining, or attractive that people are willing to pay for it, and how do you deliver it in ways that make the buying process as easy as possible?

Hargreaves suggests a light touch towards enforcement of rights and anti-piracy, firmly supporting the view that if content and services are good enough, people will pay, and that education about why artists have a right to be paid for their work is as important as catching the pirates. Attempting to “lock” copies with Digital Rights Management systems certainly don’t seem to have been very successful. They are expensive to implement, unpopular, and pirates always manage to hack them. Watermarking doesn’t attempt to prevent copying but does help prove origin if a breach is discovered. Piracy is less of a worry for business-to-business trade, as most legitimate businesses want to be sure they have the correct rights and licences for content they use, rather than face embarrassing and expensive lawsuits, and a simplified, secure Digital Copyright Exchange would presumably be in their interests.

Digital Copyright Exchange

Hargreaves proposes the Digital Copyright Exchange as a mechanism to make the buying and selling of rights far easier. At the moment, piracy can be a temptation because of the time and effort required to attempt to purchase rights. Collections agencies and the law form layers of bureaucracy that hamper start-ups from developing new products and simply confuse ordinary users. This represents real lost revenue to the content providers.

Metadata analyst and music fan Sam Kuper suggested an interesting proposal for setting fair prices – that artists should put a “reserve price” on their work, with an initial fee for purchasers. Once the “reserve price” has been reached, any subsequent purchase is shared between the artist and the early purchasers. This would guarantee a level of income for artists, allow keen fans to get hold of new material quickly, and allow those less sure to wait to see if the price drops before purchase. Such a system sounds complex, but could work through some kind of centralised system, so that “returns” to early purchasers would be returned as credits to their accounts.

From an archive point of view, Hargreaves’s call to allow the digitisation and release of orphan works without endless detective work in trying to trace origins would be a huge boon.

So, there is much to think about in the Hargreaves report and some very sensible practical suggestions, but much detail to be worked out as well. I wonder if in another 16 years, my brother and I will have seen any real change or will we still be going through the same debate?

No responses yet

Jan 27 2011

Identifiers for asset management

Published by Fran under Digital Asset Management

I met Rob Wilson of RCWS consulting at the DAM London conference last summer, where he talked about digital identifiers. He has long career working on information architecture for range of organisations and industry standards bodies, including the ultimately doomed e-GMS project, which was an attempt to unify government metadata standards for sharing and interoperability.

Identifiers are a hot topic at the moment, and so I was pleased Rob was willing to talk to me and some of my colleagues about some of the problems and attempts at solving them in the wider industry. Rob described the need for stable and persistent IDs for asset management and that this is not a new problem but one that has been worked on by various groups for many years. He distinguished between ID management and metadata management, pointing out that metadata may change but the ID of an asset can be kept stable. In just the same way that managing metadata as a distinct element of content is important, so IDs need to be managed as distinct from other metadata.

Rob is an independent consultant and has no commercial affiliation with the EIDR content asset identifier system, but suggested it is a robust model in many ways. The EIDR system embeds IDs within asset files, using a combination of steganography, strong encryption, and watermarking and “triangulation” of combined aspects to create a single ID. The idea is that the ID is so deeply ingrained and fragmented within the structure of the file that it cannot easily be removed and can be recovered from a small fraction of the file - e.g. if someone takes a clip from a video, the “parent” ID is discoverable from the clip. Rob thought the EIDR system was better than digital rights management (DRM) methods, which rely on trying to prevent distribution, and “locking” content after a certain amount of time or use, which gives an incentive to everyone who has such content to try to break the DRM system. If an individual can still access their illegally held content despite the ID, they have little incentive to try to remove the ID.

The system does not try to prevent theft from source, but helps to prove when copyright has been breached, because the ID - the ownership - remains within the file. It is intended as a tool to deter systematic copyright theft by large organisations, by making legal cases easy to win. Large organisations typically have the funds, or insurances, to cover large claims, unlike individuals, so it is unlikely to be cost-effective to pursue individuals.

The EIDR system also has an ID resolver that manages rights and supply authentication (as well as ownership) so that when content is accessed, the system checks that the appropriate rights and licences have been obtained before delivery.

Rob also outlined the elements of information architecture that need to be considered for unified organisational information management - establish standards, develop models, devise policies, select tools, maintain governance, etc. He emphasised that IDs are not a “silver bullet” to solve all issues, but that if a few problematic key use cases are known, they can be investigated to see if robust federated ID management architecture would help.

One response so far

Aug 01 2010

Content Identifiers for Digital Rights Persistence

This is another write-up from the Henry Stewart DAM London conference.

Identity and identification

Robin Wilson discussed the issue of content identifiers, which are vitally important for digital rights management, but yet tend to be overlooked. He argued that although people become engaged in debates about titles and the language used in labels and classification systems, people overlook the need to achieve consensus on basic identification.

(I was quite surprised, as I have always thought that people would argue passionately about what something should be called and how using the wrong terminology affects usability, but that they would settle on machine-readable IDs quite happily. Perhaps it is the neutrality of such codes that makes the politics intractable. If you have invested huge amounts of money in a database that demands certain codes, you will argue that those codes are used by everyone else to save you the costs of translation or acquiring a compatible system, and there are no appeals to usability, or brokerage via editorial policy, that can be made. It simply becomes a matter of whoever shouts the loudest gets to spend the least money in the short term. )

Robin argued that the only way to create an efficient digital marketplace is to have a trusted authority oversee a system of digital identifiers that are tightly bound within the digital asset, so they cannot easily be stripped out even when an asset is divided, split, shared, and copied. The authority needs to be trusted by consumers and creators/publishers in terms of political neutrality, stability, etc.

(I could understand how this system would make it easier for people who are willing to pay for content to see what rights they need to buy and who they should pay, but I couldn’t see how the system could help content owners identify plagiarism without an active search mechanism. Presumably a digital watermark would persist throughout copies of an asset, provided that it wasn’t being deliberately stripped, but if the user simply decided not to pay, I don’t see how the system would help identify rights breaches. Robin mentioned in conversation Turnitin’s plagiarism management, which has become more lucrative than their original work on content analysis, but it requires an active process instigated by the content owner to search for unauthorised use of their content. This is fine for the major publishers of the world, who can afford to pay for such services, but is less appealing to individuals, whether professional freelances or amateur content creators, who would need a cheap and easy solution that would alert them to breaches of copyright without their having to spend time searching.)

The identifiers themselves need to be independent of any specific technology. At the moment, DAM systems are often proprietary and therefore identifiers and metadata cannot easily flow from one system to another. Some systems even strip away any metadata associated with a file on import and export.

Robin described five types of identifier currently being used or developed:

  • Uniform Resource Name (URN)
  • Handle System
  • Digital Object Identifier
  • Persistent URL (PURL)
  • ARK (Archival Resource Key).

He outlined three essential qualities for identifiers - that they be unique, globally registered, and locally resolved.

So why don’t we share?

Robin argued that it is easier for DAM vendors to build “safe” systems that lock all content within an enterprise environment, only those with a public service/archival remit tend to be collaborative and open. DAM vendors resist a federated approach online and prefer to use a one-to-one or directly intermediated transaction model. Federated identifier management services exist but vendors and customers don’t trust them. The problem is mainly social, not technological.

One of the problems is agreeing to share the costs of services, such as infrastructure, registration and validation, governance and development of the system, administration, and outreach and marketing.

(Efforts to standardise may well benefit the big players more than the small players and so there is a strong argument for them bearing the initial costs and offering support for smaller players to join. Once enough people opt in, the system gains critical mass and it becomes both easier to join and costs of joining become less of an unquantifiable risk – you can benefit from the experiences of others. The semantic web is currently attempting to acquire this “critical mass”. As marketers realise the potential of semantic web technology to make money, no doubt we will see an upsurge in interest. Facebook’s “like” button may well be heralding the advent of the ad-driven semantic web, which will probably drive uptake far faster than the worthy efforts of academics to improve the world by sharing research data!)

4 responses so far