Testing taxonomies

Article 18 of 26
M-iD, May 2005

A taxonomy can help an organisation to classify and later find information. But how should it be implemented?

Page 1 | Page 2 | Page 3 | Page 4 | All 4 Pages

Fortunately, there is a growing community of taxonomy specialists and consultants, such as TFPL and Factiva, that can advise on taxonomy development.

Pre-fabricated Savings can be made by using a pre-built taxonomy. There are many taxonomies that are commercially available or even available for free, from an increasing number of companies and organisations.

Unless the taxonomy is intended for generic information classification, organisations should look for taxonomies specific to their industry rather than all-encompassing classification systems. Information-rich industries such as the legal and pharmaceutical markets are particularly well catered for by industry taxonomies.

Time should be spent evaluating any pre-built taxonomy's fit with the organisation and its data. "My advice would be to look at this from two angles," says Imam Hoque, head of the technology innovation group at consultancy Detica.

"What are you trying to get from the business processes and what is the data telling you about itself?" he asks. Although some degree of change to processes may be necessary and even desirable, trying to implement changes at the same time as a taxonomy can make the project more likely to fail.

So a taxonomy that already works with existing processes and which clusters data similarly to any existing clustering is desirable. "Try to get an evaluation of the taxonomy," Hoque advises. "Take a subset of your data, as random as possible and see if you can create rules for automatically classifying your documents within this taxonomy."

This is the only way in which the advantages and shortcomings of a given taxonomy can be tested. "It's only when you try that you'll discover the issues. For example, you may design a taxonomy where a couple of nodes simply end up with too many documents and others are extremely sparse," adds Hoque.

After a suitable taxonomy is found or developed, it can be then be deployed. The organisation must choose at this point whether to classify existing documents using the taxonomy or to classify only new documents. Classifying existing documents can be a highly expensive procedure. Many can avoid it for this reason. Various automated systems can automatically classify documents within the taxonomy, but many organisations are uncertain of their accuracy and efficiency.

APR Smartlogik director Richard Pinder argues that neither manual nor automated is superior to the other - both require work. "If you ask ten people to classify a document against a taxonomy, some will tag it as three separate things. One client told me that he bought our automated tagger for consistency. 'It may be consistently wrong to start, but at least it's consistent. I can raise the bar later and there will never be any lack of consistency due to interpretation,'" says Pindar.

Depending on the application of the taxonomy, errors in classification may only have small side effects, so organisations need to balance the cost of perfect classification against the cost of errors before deciding how much time and money to spend improving classification.

Page 1 | Page 2 | Page 3 | Page 4 | All 4 Pages

Rob Buckley – Freelance Journalist and Editor

Fax management

Hard driving

Testing taxonomies