Logo Rob Buckley – Freelance Journalist and Editor

10,100,000 documents returned*

10,100,000 documents returned*

Enterprise search engines need more work if they are to satisfy user demands for accurate, relevant results.

Page 1 | Page 2 | All 2 Pages

Dissatisfaction with enterprise search engines is rife. Users searching corporate websites or their employer's intranet system frequently find that they get results that are irrelevant or too numerous to be helpful. Organisations that invest in search technology, meanwhile, end up with disappointing online sales, web sites that appear under-populated and intranets that 'hide' much-needed information from employees.

At the heart of the problem is a simple fact: enterprise search engines on web sites and intranets need to solve different problems from those used to perform more general Internet searches. The Google search engine, for instance, does not have to produce a specific web page in response to a general query; it only has to produce a web page that answers the query. For users interested in Manchester United, for example, Google does not necessarily have to send them to the official Manchester United web site, only a site that provides information about Manchester United.

An intranet or public-facing corporate search engine is expected to provide a more tailored set of results, but therein lies a number of problems: this kind of search engine may have fewer pages to choose from but many of them will be on similar topics. It has to look at far more than just web pages, too, since most corporate documents are in PDF, Word or other non-HTML formats. It must respect the security of documents that are intended only for authorised readers and not for general consumption. It must understand a range of terms that might be used by different users seeking the same information; a user searching a council web site for “dole” needs to be pointed to documents that refer to the jobseekers' allowance, for example. And unlike Google, which uses the number of links to a page to determine how important or useful that page is, a corporate search engine has no such human vetting of documents from the outset.

Destined for disappointment?
Search-engine developers have used a variety of techniques to produce useful corporate search results to meet these challenges (see table, Understanding search). Natural language querying (sometimes known as natural language processing), which enables queries to be composed in regular English rather than using just keywords, is arguably the newest, state-of-the-art addition to the tools available to meet the first challenge. It works by applying grammatical rules to find and understand words in a particular category, such as product names. The ultimate goal of NLP is to design and build software that can analyse, understand and generate language that humans use naturally, so that, eventually, users can address their computer in the same way that they would address another person.

But this approach has flaws, according to Simon Harvey, Open Text's product marketing manager in Europe, the Middle East and Africa. “If I asked for something like 'discussions I had with James about tax returns', I would still get back a mass of results talking about 'discussions' and 'James' and 'tax returns',” he says. Open Text is one of several vendors looking at making searches 'context-aware', says Harvey, so that search engines know that the user is looking for something about tax returns; a discussion is a type of activity, so should exclude content from documents, news headlines and so on; and that James is a person, so it should only look for items that he has posted in a discussion forum, for instance.

According to Glenn Kelman, founder and vice president of product management and marketing at portal software vendor Plumtree, most search techniques will produce similar results, regardless of their approach, and the enthusiasm of academic researchers does not translate into a radically better real-world experience. “Search engine guys always talk about algorithms.

They talk about 60GB of this with an index of 20GB, Bayesian this and that. But text search is a standard PhD project. The pragmatic reality is that algorithms are a dime a dozen.” Independent benchmarking, moreover, seems to back this up (see box, Similar but different?).

Perhaps acknowledging that no matter what the algorithm, most searches produce the same results (or that it is simply too hard a problem for current technology or technology in the foreseeable future to produce the perfect search), the majority of vendors are focused on two main areas for improving their software further: categorisation and interface.

Two-pronged approach
Categorisation exists at two levels: the initial categorisation of the documents that are being searched; and the representation of these documents in the results of searches. Creating an initial taxonomy or categorisation scheme for documents enables organisations to 'teach' their search engines about the different kinds of documents they have and how they relate to each other. In conjunction with a 'thesaurus' of words that may indicate to which category a document belongs, the search engine can return documents that it decides are relevant. Creating an interface that makes it easy for users to navigate through these categories so that they can discard irrelevant documents or results goes hand-in-hand with this approach. “Rather than be presented with a hit list of 100,000 items [from a two-keyword search], of which only the top 50 will be read, it is far more beneficial to have the hit list fragmented into logical groups,” says Phil Lewis, international technical director at search engine company Convera.

Taxonomies can be useful in enabling an organisation to construct an initial search facility, but they can also require substantial amounts of ongoing manual work to make them useful. Jack Jia, chief technology officer of content management software company Interwoven, highlights the case of IBM, which he says developed a relatively simple taxonomy with 50 primary categories but took a year and a half to arrive at a final list.

It is possible to generate taxonomies automatically - for example, by looking at the contents of a database field and generating a list of all the possible entries - but Jia says there are pros and cons to both automatic and manual or semi-manual approaches. “The beauty of the automatic solution is that it's quick, and does not need human intervention. But the problem is the quality suffers so you may not get a very good taxonomy. A computer can certainly do algorithms and do all kinds of information retrieval based on the taxonomy, but at the end, it may not match with what people want to see or what the organisation is willing to accept,” he says. While the development of an off-the-shelf search engine that can generate taxonomies may one day be possible (it may even be specific to the purchasing company's vertical industry), he adds, at least 50% of getting it to work properly will still come from fine-tuning and developing manual taxonomies in conjunction with information science specialist such as librarians.

Page 1 | Page 2 | All 2 Pages

Interested in commissioning a similar article? Please contact me to discuss details. Alternatively, return to the main gallery or search for another article: