Testing taxonomies

Article 18 of 26
M-iD, May 2005

A taxonomy can help an organisation to classify and later find information. But how should it be implemented?

Page 1 | Page 2 | Page 3 | Page 4 | All 4 Pages

The more information an organisation or web site has, the harder it becomes for people to find exactly what they want. And while search engines can help, without some sort of context for the engines to 'understand' the collected data, the results they return can be poor. What is needed is a map of the information for both people and search engines to cut down search time and improve the results.

'Taxonomies' provide such maps. Looking like inverted trees with a single root at the top and successive branches and leaves (or 'nodes') cascading down, taxonomies are hierarchies of categories for classifying information and objects.

A taxonomy of company processes might, for example, start with the company at the top, sub-departments such as human resources, finance and administration on the next level, posts within each department on the next level and jobs performed by each post on the last level.

When used in the right context and done correctly, taxonomies can bring much benefit. Yell.co.uk, for instance, started off as the familiar Yellow Pages phone directory.

However, the Yellow Pages only classified advertisers according to a simple system: 2,000 categories and alphabetical order. This was inadequate for a web-based search system where expectations are higher. Users might, for example, expect the directory to provide information on all the Corgi-accredited plumbers in Reading available 24 hours a day to fix boilers. By creating a rich taxonomy that matches the kind of categories people use for search, Yell has been able to make searching faster and more accurate - and therefore more profitable for both itself and its customers.

To create a useful taxonomy, however, requires a significant investment of both time and money, as well as expertise in information science. Nevertheless, there are steps that organisations can take to ensure that their taxonomies match their needs without stretching budgets too much.

Asking questions The most important step is for the organisation to decide what it really wants the taxonomy for in the first place.

"You'd be surprised how many organisations say, 'Ah, we hadn't thought of that'," says Simon Alterman, vice president of content at Factiva. "A taxonomy is a tool that suits a particular business purpose or problem, so it's really important to get to the main business purpose before embarking on taxonomy design."

A good place to start are surveys and workshops with existing users of information systems. These should investigate how they perform their job, their frustrations with the existing system and the kind of information they search for. If there is a need for a taxonomy, as with most major IT projects, buy-in should be obtained at a senior level. This almost goes without saying, but being able to justify such a project can be harder than with other projects that can offer more tangible results.

Tales of deals that were lost because of failures in search - not just raw statistics - can also be useful catalysts in obtaining the necessary high-level buy-in. It is also wise to avoid creating a taxonomy entirely from scratch. Without help from specialists, many organisations will create a taxonomy that simply mirrors their file plan, which will not help search in the slightest.

Fortunately, there is a growing community of taxonomy specialists and consultants, such as TFPL and Factiva, that can advise on taxonomy development.

Pre-fabricated Savings can be made by using a pre-built taxonomy. There are many taxonomies that are commercially available or even available for free, from an increasing number of companies and organisations.

Unless the taxonomy is intended for generic information classification, organisations should look for taxonomies specific to their industry rather than all-encompassing classification systems. Information-rich industries such as the legal and pharmaceutical markets are particularly well catered for by industry taxonomies.

Time should be spent evaluating any pre-built taxonomy's fit with the organisation and its data. "My advice would be to look at this from two angles," says Imam Hoque, head of the technology innovation group at consultancy Detica.

"What are you trying to get from the business processes and what is the data telling you about itself?" he asks. Although some degree of change to processes may be necessary and even desirable, trying to implement changes at the same time as a taxonomy can make the project more likely to fail.

So a taxonomy that already works with existing processes and which clusters data similarly to any existing clustering is desirable. "Try to get an evaluation of the taxonomy," Hoque advises. "Take a subset of your data, as random as possible and see if you can create rules for automatically classifying your documents within this taxonomy."

This is the only way in which the advantages and shortcomings of a given taxonomy can be tested. "It's only when you try that you'll discover the issues. For example, you may design a taxonomy where a couple of nodes simply end up with too many documents and others are extremely sparse," adds Hoque.

After a suitable taxonomy is found or developed, it can be then be deployed. The organisation must choose at this point whether to classify existing documents using the taxonomy or to classify only new documents. Classifying existing documents can be a highly expensive procedure. Many can avoid it for this reason. Various automated systems can automatically classify documents within the taxonomy, but many organisations are uncertain of their accuracy and efficiency.

APR Smartlogik director Richard Pinder argues that neither manual nor automated is superior to the other - both require work. "If you ask ten people to classify a document against a taxonomy, some will tag it as three separate things. One client told me that he bought our automated tagger for consistency. 'It may be consistently wrong to start, but at least it's consistent. I can raise the bar later and there will never be any lack of consistency due to interpretation,'" says Pindar.

Depending on the application of the taxonomy, errors in classification may only have small side effects, so organisations need to balance the cost of perfect classification against the cost of errors before deciding how much time and money to spend improving classification.

With a taxonomy in place, a process needs to be developed for maintaining the taxonomy and ensuring it remains current. Maintenance can be manual, with key people in the organisation suggesting changes, which are then approved by someone responsible for maintaining the taxonomy.

'Prompted maintenance' uses technology to re-cluster documents and see how the general nature of the content changes over time. This information can then be reviewed by the person maintaining the taxonomy and incorporated where appropriate. Richard Roth, chief research officer of The Hackett Group, which develops a taxonomy used by companies such as networking giant Cisco Systems for corporate performance monitoring, says that most organisations come back for updates every two or three years.

When they do, they make suggestions as to how the taxonomy should be changed. Hackett keeps a record of those recommendations and reviews them after a year. When they make suggestions for new nodes, Roth says, the group will gather 15 to 20 company executives and work out with them what processes need to be in the node, as well as their activities and sub-processes.

Specialist taxonomy maintenance software can be helpful when making the changes to the taxonomy. "Specialist software really changes the lives of people doing this," says Factiva's Alderman. But if the taxonomy will only change slightly over time, no specialist software is necessary.

Implementing a taxonomy is an activity best suited to information specialists. By using their expertise and the investment already made by organisations in industry taxonomies, development can often be made cheaper and easier. The rewards of implementation will be much greater as a result.

Tying the knot

Kent Connects has deployed a taxonomy to tie together the web sites of many disparate public sector bodies in Kent.

Kent Connects is a partnership of Kent's local authorities and emergency services, but its fragmented nature meant that Kent residents would visit many different members' web sites looking for information on services.

However, they would often visit the wrong site, expecting, for instance, to find out about education at their local council's site when it was the county council that provided the service. As a result, residents were not able to get the information they wanted.

In response, Kent Connects decided to implement a portal that would aggregate the information contained on the individual web sites in an attempt to make it easier for Kent residents to access the information.

But it soon became clear that this was almost impossible: each of the sites used different content management systems - if they used any at all - and few used any metadata to mark up their content. Where they had, they had used incompatible systems. The partners concluded that to provide a consistent and thorough search experience for its residents, they would need to tag all content according to a standard metadata framework and implement a taxonomy that they could apply to both existing and new documents.

After putting out a request for interest in November 2003, Kent Connects trialled APR Smartlogik's Semaphore product. "One of the requirements was to provide a search engine, but with the rigour of a taxonomy underneath, as well as automatic tagging," says Ralph Sperring, project manager at Kingshurst Consulting, which provided consultancy services for the project. Automated tagging was important, since one member authority alone estimated it would take one employee a month to tag its existing pages manually.

By March 2004, the pilot programme was complete and the various partners all agreed to implement the system. The partnership then created a centralised taxonomy management service to avoid duplication and to achieve a common language the partners could use. Based partly on a taxonomy provided by APR Smartlogik, the taxonomy developed was a mixture of the Government Category List and the Local Government Category List, as well as additional, local categories and synonyms particular to Kent. For metadata tagging the organisation adopted the government's eGIF standard.

With the taxonomy developed, the partners then had to implement it. The number of different content management systems used meant that the partners had to go back to their content management software suppliers to obtain application programming interfaces (APIs) to connect them with the APR Smartlogik system.

Many partners, however, had no content management system and delayed entering the project until they have one, rather than re-index all their content. To avoid system overloads, individual partner web sites are 'spidered' at night - searched by the master system - with batches of pages being categorised at a time until the whole site is within the taxonomy. About one-fifth of pages cannot be automatically spidered, mainly because they have too little text or stray too far from corporate publishing standards to be categorised correctly. Sperring hopes this figure will decrease in time.

As well as improving search facilities for web site users and ensuring the partnership is well on course to meet its e-government targets by the end of the year, the taxonomy has also made it easier for council call centre staff to locate information more quickly when taking calls from the public. Each operative can now take more calls per hour, reducing costs.

With more than two-thirds of partners' sites included in the portal search, Sperring hopes the addition of the remaining one-third by the end of the year will reduce costs further. "Maintaining a taxonomy can be an incredibly large overhead. But as we work on developing a more regional taxonomy, the more people we put in the pot, the more widely it's spread and the cheaper it gets. And customers will still be able to search over a wider area and get similar returns," he says.

Page 1 | Page 2 | Page 3 | Page 4 | All 4 Pages

Rob Buckley – Freelance Journalist and Editor

Fax management

Hard driving

Testing taxonomies

Tying the knot