Double trouble

Article 23 of 26
M-iD, September 2005

Duplicate records undermine the credibility of any records management system, but preventing slip-ups is a complex challenge.

Page 1 | Page 2 | Page 3 | All 3 Pages

Duplicate records are a nightmare for any records manager. Not only do they bring into question the authority and reliability of the records, they demonstrate that records management processes within the company are failing, they increase storage requirements, make it difficult to demonstrate compliance with legislation and make it harder for users to find the information they want.

The problem is sufficiently big for Gartner Research analysts Debra Logan and Mark Gilbert to maintain that having duplicate records in a records management system is as bad as not having a system at all.

Avoiding duplication and eliminating it when it is found should, therefore, be high on any records manager's priority list. But with no well-documented standards or methods for de-duplication, it is hard to know where to begin.

The biggest concern with any duplication of records is that records should only be deleted according to deletion schedules. Convincing an auditor that a record had to be deleted because it was identical to another can be a very hard task, and depends on the reliability of the audit log, the hashing algorithm used, the process used and so on.

It is far better to remove duplicates before anyone in the organisation declares them as records. Experts differ, however, on whether duplications are better prevented through technology or process.

Mark J Lewis, storage company EMC's marketing manager in EMEA for its Centera system, naturally argues that the solution must be technological. "You can introduce a process, but human nature dictates the people will naturally save the same file under a different name. Process is a way, but it's not really foolproof."

However, Computacenter consultancy practice leader Simon Gay maintains that processes do need to be put in place to avoid duplication. "Even with the smartest of tools, if your processes are wrong or the staff are not dealing with issues correctly and records aren't being put in in the right way, then tools won't catch up with things. You should always start with people and procedures."

Start simply

The kinds of processes that organisations need to put in place differ according to the types of documents produced and processed and the size of the organisation. If the organisation scans paper documents for eventual storage in a records management system, it will need a process to ensure that no document is scanned twice: any document, no matter how perfect the scanner, will produce a different digital document each time it is scanned.

This process might involve forms management. By adding a barcode or ID that an OCR system can determine, systems will then be able to flag up documents that have already been scanned. A barcode-based system will need a bigger investment in software than a simple ID system, but is better for larger volumes of scanning.

This system won't work with documents that did not originate within the organisation, however, so other processes will be needed. Smaller businesses can ensure that all documents are scanned in a single mail room, for example, but a large multi-national has no such option, so the processes chosen will vary from organisation to organisation. Typically, they will involve some form of inventory system that tracks documents. If maintained at a departmental level or in a small business, the inventory may be something as simple as a piece of paper or a spreadsheet.

Says Computacenter's Gay, "It's easy to criticise, but if people can implement it and make it work for a small team of people, very, very quickly, actually there's a lot to be said for it. Scale will determine the outcome there."

The main requirement for these capture-related processes is that all staff involved know of the processes, and they are well documented and easy to follow.

Content management control

When dealing with electronic documents, a content management system will be able to restrict users' abilities to duplicate documents. It can also impose versioning controls and a taxonomy system. This will enable users to classify documents according to their themes so that duplicates, if they do arise, will be easier to spot.

Many content management systems, such as Open Text's LiveLink and IBM DB2 Common Store, will also use linking to prevent document duplication. Tracy Caughell, product manager at Open Text, explains: "If a user is unaware that a document with the exact same data is already in the system but under a different classification, LiveLink will only store one version and then use a pointer to the other document instead. That all happens without the user even knowing."

Content management systems will often allow users to check out documents so they can use them on their laptops. Without appropriate controls, this can often lead to duplication as different people work on copies of the same document before checking them back in. Tight policies that only allow a single instance of the document (or none at all) to be checked out can prevent this.

Imam Hoque, head of the technology innovation group at Detica, advises organisations to use read-only document formats such as PDF to prevent alterations being made to documents if they're checked out. "You should try where possible to change business processes so that people find it easier not to send original versions of documents."

Andy Maurice, head of consulting at Iron Mountain, recommends creating a list of 'registered' and 'unregistered' documents. The registered documents are the master documents that need to be kept whereas unregistered documents are working versions that can be deleted. The important thing, he says, is that people are aware that the unregistered versions carry no weight and that any copy they make of a registered document is unregistered and should be destroyed when finished with -certainly, the copies should almost never be declared as records.

Similar versus same

Deciding how similar documents need to be before they're deemed duplicates is something every organisation needs to consider, sometimes case-by-case in highly regulated environments. Documents identical in every way are clearly duplicates, but documents that may be identical in content but differ in metadata could be duplicates for organisations working in one market, but not for organisations in another. "Theoretically, if the metadata associated with a document is used as part of a business process, that could change the context and meaning in which the document is being used," says oque. "Something like the date of approval might be an important event for auditors."

Certain systems will throw up these fuzzy duplicates as a matter of course. Emails sent to multiple recipients will create different copies of the same mail for each recipient. These might differ in metadata, such as the exact path taken to reach the recipient, the order of recipients in the to: field and so on. Yet many organisations will regard them as duplicates that take up vital space on a mail server.

Some systems can take care of this duplication automatically. IBM's Common Store system provides additional folders in Outlook and Lotus Notes, into which users can drag any mail messages they want to declare as records. At the point of declaration, the system will check for an existing copy of the message and warn the user if there is one. LiveLink, by contrast, won't try to prevent duplication of mail messages, but will use hashing algorithms to work out if email attachments in the system are identical and then consolidate them.

Deletion dangers

Sooner or later, however, duplicates will arise and organisations need to consider how best to deal with them. An automated approach with some manual intervention is the best option for most organisations. Some of the newer content management systems have de-duplication functions built in, and there are products such as Mobius's ViewDirect-ABS that claim to reduce duplication as well. In older systems, reporting tools that sort documents by metadata may be able to produce an inventory of similar documents.

Iron Mountain's Maurice advises his clients to set up the inventory of master documents first, using metadata attributes such as creator, creation date, modification date and so on to identify the master documents. They can then be sure which is the duplicate or the redundant version and which is the document they should retain.

Grant Edgar, managing consultant at Open Text Global Services, says that most organisations should use the identification of duplicates as a chance to understand where processes are going wrong, rather than try to fix the problem retroactively unless it is a significant storage problem or an impediment to navigation. "They can then change the process and instruct the staff accordingly," he says.

If two duplicate documents are both declared records, organisations need to be very careful before deleting them. If the two documents are identical, then creating a hash of one of the documents and storing that along with an audit log of the deletion should be enough to convince an auditor that the organisation deleted the record properly. For absolute peace of mind, the record metadata of the deleted document might be kept with a pointer indicating the remaining document contains the file content.

However, once documents start to differ in metadata or content, it becomes far harder to argue convincingly that they and their record data needed to be deleted. Open Text's Caughell says that the key thing auditors look for is consistency in how organisations conduct processes, and provided that consistency is maintained, deletion of 'fuzzy' duplicates may be possible.

However, Mike Blake, of IBM's information management software division, says that by the time a duplicate has been declared a record, it's too late to delete it. "In records management, you really are trying to remove the power of deletion except through policy. Once something's declared a record, it's sacrosanct, it's in the system and it's not going away."

Product	Vendor	URL	Comments
De-duping software vendors
AXS-One compliance platform	AXS-One	www.axsone.com	While not removing duplicates, unifies search so that duplicate items are merged into a single view
Content Migrator	Active navigation	www.activenavigation.com	Uses natural language processing to create summaries of documents and tag terms, then compares the results from different documents to identify duplicates
DB2 Common Store	IBM	www.ibm.com	Identifies duplicates using metadata and hashing
EMC Centera	EMC	www.emc.com	Uses content addressed storage to avoid duplicate documents being saved
Livelink Records Server	Open Text	www.opentext.com	Uses hashing to identify duplicate documents
matchIT	helpIT Systems	www.helpit.com	Uses fuzzy matching algorithms to spot duplicate data that may differ only by spelling mistakes, etc
ViewDirect-ABS	Mobius	www.mobius.com	Has a duplicate item detection facility that can spot duplicate emails in different email systems. Also works in other applications, such as duplicate cheque processing.
Vignette Records & Documents	Vignette	www.vignette.com	Provides hashing, records pointers and structured language for taxonomies

The benefits of de-duping

The Crown Prosecution Service was recently able to learn the benefits of de-duping first hand. The organisation had first tried to identify duplicates by cleaning up its taxonomies, but had abandoned the project after dedicating 120 staff hours to the task. It then deployed Active Navigation's Content Migrator to analyse the tens of thousands of documents it had. After 20 days, it had discovered that up to 30% of the documents it had stored were potential duplicates.

The deletions resulted in the organisation reducing the time taken to make back-ups by 15% and workplace efficiency increasing, with one worker's productivity improving 20% from the improved information retrieval of the system.