Electric Dreams

The Bayesian haze

Article 1 of 25
Infoconomist, March 2001
View a PDF of the original article ~ 696K
View the original article online

Knowledge management is an area in which Europe has been doing very well. The star of the sector is Autonomy, but is its heady valuation deserved?

Page 1 | Page 2 | Page 3 | All 3 Pages

THE BAYESIAN WARS
The atmosphere between SmartLogik and Autonomy is now icy. “I've dealt a lot with analysts and some journalists,” says SmartLogik CEO Stephen Hill. “They all say certain other companies are extraordinarily arrogant and rather difficult to do business with. I'd hate to be labelled as that.”

In spite of their common roots, Autonomy and SmartLogik's technologies have diverged over the years, which means there is plenty of scope for these companies – and others joining the fray – to get involved in the Holy War over which search technology is best. The issue is not so much speed, or volume of documents retrieved, but the one that will achieve the most relevant pages for any given search.

Among those in the opposite side of the court to Autonomy is Verity, a former market leader and the biggest proponent of standard 'keyword' searches versus Bayesian 'pattern recognition'. “The problem with Bayesian is it's automated,” says UK CEO Simon Atkinson. “You can't control what it does.”

Autonomy spokesman Simon Fletcher counters that Verity's claims are not very meaningful. “Why would you want to control the system? When you do that, you get in all sorts of trouble. It's not economical, having a team of 50 tagging up documents.”

But John Western, a technical consultant with Verity, argues that there are other problems too: “Bayesian has difficulties with fine-grain distinctions, particularly as you add more documents to the system. Every new document you add depends on what's happened to the previous documents. If you order your work differently, you'll get different results. That implies you have to do more work [to organise the database] and therefore the system doesn't scale as well.” Categorisation of documents by humans or by rules is a better system, believes Verity, because the intelligent agents used by Autonomy, et al cannot understand the documents — “only we can”.

Fletcher, of course, fundamentally disagrees. “How many times do you want a specific document, when you really want something that gives you information?” he counters. “The only reason we're fixated by the idea of the exact result is because of years of conditioning by keyword search systems.” On this issue, fellow Bayesian John Snyder, who now runs an Internet search company called Webtop, sides with his rival at Autonomy. The keyword approach means that people “spend about 10 minutes a day searching, according to our research. But 75% of searches don't deliver useful information. People have learnt that if they type in 10 keywords, they get nothing, but if they type in one keyword, they get lots. It's not really helpful. Probabilistic information retrieval, particularly with relevance feedback from the user, gets better results, as proved by the TREC tests.”

Another issue, he adds, is that when a new concept that cannot be included in current categorisations crops up, Bayesian methods can still cope while keyword methods have to wait until the system is updated. Autonomy's own research suggests that the time wasted by employees on searches amounts to nearly an hour a day, or B#17 billion a year.

There is a third way of approaching the problem: semantic information retrieval - trying to break down queries into parts of speech in an attempt to understand the question. “For example, 'murder of a child', 'murder by a child' and 'murder with a child' are three very different legal concepts, yet Boolean, keyword and even probabilistic searches probably won't be able to pick up the difference. So far, research has had limited results, but a key exponent, 3F, another UK company, was acquired in March 2000 by Mindmaker, a Californian company that specialises in speech technology and in intelligent assistant products.

”Academically, you always want a pure solution, but a business solution needs to be timely, accurate and fast. Semantic analysis is important, but it's expensive in processing terms,“ says Snyder. Smartlogik CTO John Challis believes that semantic technology, while it works well in some niche areas, is not ready for prime time yet. ”It has problems when working with documents it hasn't come across before.“ And, of course, with documents in a different language, it has as much difficulty understanding them as a human would. Bayesian, in contrast, ”is language independent and automatically recognises terms it hasn't seen before.“

BAYESIANS MULTIPLY
While academic interest in the semantic approach continues, the Bayesian group is also continuing to expand. Cambridge based NCorp, another spin-off from Lynch's Cambridge Neurodynamics, specialises in searching heavily structured databases using Bayesian techniques. Another company, Applied Psychological Research, set up by a group of London's City University academics in 1998, uses Bayesian techniques to build up profiles of users by seeing how they rate documents. ”We focus heavily on the error in any search and try to reduce the consequences of that error as much as possible by finding out as much as possible on the individual making the search,“ says CEO Daniel Brown.

Page 1 | Page 2 | Page 3 | All 3 Pages

Rob Buckley – Freelance Journalist and Editor

Electric Dreams

The Bayesian haze