Taxing taxonomies

Article 13 of 26
M-iD, February 2005
View a PDF of the original article ~ 1.1MB

A growing number of information managers are implementing taxonomies in a bid to improve customer retention and employee efficiency.

Page 1 | Page 2 | Page 3 | All 3 Pages

Categorise every document in an organisation? It sounds like a lot of work for little return. Yet proponents of “taxonomies” would have many information managers do just that, arguing that the savings made in employee and customer time will more than compensate for the time, effort and money spent implementing corporate information taxonomies.

Until now, taxonomies have primarily been adopted by companies in highly regulated environments and those with enormous amounts of information, such as pharmaceutical companies. However, with all companies facing increasing levels of corporate governance, and with customers expecting ever more personalised and responsive service, the use of taxonomies is spreading.

Engineering company Arup, for example, which was responsible for the Sydney Opera House and the Millennium Bridge, had an information overload only a taxonomy could sort out. When it bids for business, it needs to know if it has worked on a similar project in the same or a different location; who has experience of the challenges it will face on the project; and whether it has ever designed a similar system.

But with 120,000 project records, searching would have been almost impossible without a taxonomy. Following the application of a taxonomy to its project management and financial systems, it can now search, for example, for all the projects done in UK universities for the last three years and locate experts within minutes.

But to implement a useful taxonomy takes several things: a good understanding of both the information within the organisation and how people will try to access it; properly categorised documents; processes to ensure that the taxonomy adapts to organisational changes; and an appreciation of the complexities of taxonomies.

It is not without reason that many of the organisations that implement taxonomies hire information professionals, such as librarians, to develop them. If the project is not managed carefully, it is very easy to create an irrelevant or excessively large categorisation system.

Choosing a taxonomy
The first step is to conduct a content audit. This will categorise the types of content (email, business documents and so on) and their data types (text, graphics, video). It will also provide an understanding of the technical scale of the problem.

The next step is to decide how to generate the taxonomy: start from scratch, create a partial taxonomy and use automatic indexing to add to it, or to buy in a pre-existing taxonomy. For many organisations, this is the most daunting step. Unless they have a skills base in information sciences or the project is very small, most analysts advise companies not to try to generate their own taxonomy from scratch. In most situations, starting with a pre-existing taxonomy and customising it either manually or automatically is the best way forward, say analysts.

There are a number of vendors on the market selling specialised taxonomies. Some have been developed by information management specialists, although these are typically intended for use by librarians rather than in IT systems, and some have been developed by a consortia of companies from a specific vertical industry sector, such as the oil and gas industry. More commonly now, however, taxonomies are being sold with search engines and enterprise content management (ECM) systems - the results of work by the vendors for previous customers.

According to Peter Ahearn, senior IT consultant at PA Consulting, organisations should evaluate these commercial taxonomies the same way as they would with any other software. “I would put this out with a formal procurement process. Go to tender, look at a number of options, and get the companies to prove to you that their taxonomy works using a part of the taxonomy that's difficult or different.”

Usually, a bought-in taxonomy will either be too large, since it is intended for a range of potential customers, or will not quite match the organisation's business, so some customisation will be necessary. There is also likely to be a need for some customisation on an ongoing basis, to create a more relevant subset of the taxonomy or to expand it slightly.

“Some companies are put off by any kind of manual work to add bits to taxonomies. But I do think that once a taxonomy is there, it can be updated by various content experts around an organisation. It doesn't have to be centralised, so it's not really such a huge job,” says Ahearn.

In particular, employees outside the IT department need to be involved in constructing taxonomies if they are to reflect the way the company actually does business, he adds.

The downside to this manual approach to construction, however, is that an organisation may need more than one taxonomy. A PC maker, for instance, would need one taxonomy to filter results for its call centres, another for its hardware repairs and yet another for sales department.

Buying in a larger base taxonomy can, therefore, often prove more cost-effective, since it can be broken down into several potential classification systems. But as the number of different classes of worker needing to access corporate data proliferate, so the number of taxonomies could increase.

Personal agents
This is where personal agent software from companies such as Autonomy comes in. This can generate personalised taxonomies by learning from the kinds of search each user performs: for technical details on products it will create a taxonomy suitable for hardware repairs, while searches for product features will create a more sales-friendly profile.

Torstein Thorsen, vice president of technical sales at search engine company Fast Search & Transfer, says that many companies prefer this computer-generated approach to taxonomies. “People are moving away from huge, manually created, structured taxonomies towards computer-created, flatter taxonomies: they provide more ease-of-use for normal information users.”

However, Ahearn argues that personalisation of taxonomies creates a problem when guiding others to the same documents: a user can no longer be sure a document will be in the same place in another user's search, or even if he or she tries the same search on a different machine. So automated taxonomy generation almost always needs to be allied with manual taxonomies, typically by providing higher-level categories for the automated systems to generate sub-categories, where possible.

The other main aspect of implementing a taxonomy is document categorisation. Each document needs to be classified as belonging to particular categories in any given taxonomy. With an ECM system, it can be quite easy to ensure that all new documents are automatically categorised, since the system can enforce document metadata tagging by staff.

But this does not help classify existing documents. While manual tagging of existing documents is a possible solution, it is usually highly impractical for organisations of any size and age, particularly those with a high information investment. Forrester Research analyst Laura Ramos says that consequently, organisations that adopt a manual tagging approach rarely attempt to deal with legacy content.

Automated tagging, with some degree of human oversight, is therefore the most effective way, she says. Like Google and other modern search engines, automated tagging systems attempt to 'understand' the content of documents to see how they fit into the taxonomy. 'Bayesian' techniques from Autonomy and others, suggest that the most infrequent words in a document give the best indication of its meaning and use those to categorise the document. Others, such as Fast, use linguistic analysis to try to get the documents' meaning.

Both approaches also take into account document metadata, which can often prove more enlightening than the document content itself. Although this automated approach works best with flatter taxonomies that have few sub-levels, it is often surprisingly effective when combined with a suitable thesaurus that can match synonyms of words to their appropriate taxonomies.

But the thesaurus needs to be developed by the organisation to suit its own taxonomy, since it needs to take into account the organisation's culture and processes. For example, while “UK” could potentially be a synonym of “Europe” for the European Commission, it would not be for the anti-European UK Independence Party.

Keeping up-to-date
Even the best taxonomy will lose relevance and become ineffective if it is not kept up-to-date, so regular updating is a core part of taxonomy management. Systems that generate taxonomies and taxonomy maintenance tools can reduce the workload, although the former needs monitoring to ensure the taxonomy has not lost touch with business requirements.

Taxonomies intended to reflect businesses will need to be changed whenever changes to the business occur, such as mergers and new product releases. These changes will need human intervention, but by distributing responsibility for certain aspects of the taxonomies to relevant staff, the workload can be kept to a minimum.

While the workload involved may initially seem big, most organisations are more than capable of implementing and maintaining their own classification systems - and the savings, they say, can handsomely repay the effort.

Page 1 | Page 2 | Page 3 | All 3 Pages

Rob Buckley – Freelance Journalist and Editor

Semantic Rubbish

Watching brief

Taxing taxonomies