Double trouble

Article 23 of 26
M-iD, September 2005

Duplicate records undermine the credibility of any records management system, but preventing slip-ups is a complex challenge.

Page 1 | Page 2 | Page 3 | All 3 Pages

Sooner or later, however, duplicates will arise and organisations need to consider how best to deal with them. An automated approach with some manual intervention is the best option for most organisations. Some of the newer content management systems have de-duplication functions built in, and there are products such as Mobius's ViewDirect-ABS that claim to reduce duplication as well. In older systems, reporting tools that sort documents by metadata may be able to produce an inventory of similar documents.

Iron Mountain's Maurice advises his clients to set up the inventory of master documents first, using metadata attributes such as creator, creation date, modification date and so on to identify the master documents. They can then be sure which is the duplicate or the redundant version and which is the document they should retain.

Grant Edgar, managing consultant at Open Text Global Services, says that most organisations should use the identification of duplicates as a chance to understand where processes are going wrong, rather than try to fix the problem retroactively unless it is a significant storage problem or an impediment to navigation. "They can then change the process and instruct the staff accordingly," he says.

If two duplicate documents are both declared records, organisations need to be very careful before deleting them. If the two documents are identical, then creating a hash of one of the documents and storing that along with an audit log of the deletion should be enough to convince an auditor that the organisation deleted the record properly. For absolute peace of mind, the record metadata of the deleted document might be kept with a pointer indicating the remaining document contains the file content.

However, once documents start to differ in metadata or content, it becomes far harder to argue convincingly that they and their record data needed to be deleted. Open Text's Caughell says that the key thing auditors look for is consistency in how organisations conduct processes, and provided that consistency is maintained, deletion of 'fuzzy' duplicates may be possible.

However, Mike Blake, of IBM's information management software division, says that by the time a duplicate has been declared a record, it's too late to delete it. "In records management, you really are trying to remove the power of deletion except through policy. Once something's declared a record, it's sacrosanct, it's in the system and it's not going away."

Product	Vendor	URL	Comments
De-duping software vendors
AXS-One compliance platform	AXS-One	www.axsone.com	While not removing duplicates, unifies search so that duplicate items are merged into a single view
Content Migrator	Active navigation	www.activenavigation.com	Uses natural language processing to create summaries of documents and tag terms, then compares the results from different documents to identify duplicates
DB2 Common Store	IBM	www.ibm.com	Identifies duplicates using metadata and hashing
EMC Centera	EMC	www.emc.com	Uses content addressed storage to avoid duplicate documents being saved
Livelink Records Server	Open Text	www.opentext.com	Uses hashing to identify duplicate documents
matchIT	helpIT Systems	www.helpit.com	Uses fuzzy matching algorithms to spot duplicate data that may differ only by spelling mistakes, etc
ViewDirect-ABS	Mobius	www.mobius.com	Has a duplicate item detection facility that can spot duplicate emails in different email systems. Also works in other applications, such as duplicate cheque processing.
Vignette Records & Documents	Vignette	www.vignette.com	Provides hashing, records pointers and structured language for taxonomies

The benefits of de-duping

The Crown Prosecution Service was recently able to learn the benefits of de-duping first hand. The organisation had first tried to identify duplicates by cleaning up its taxonomies, but had abandoned the project after dedicating 120 staff hours to the task. It then deployed Active Navigation's Content Migrator to analyse the tens of thousands of documents it had. After 20 days, it had discovered that up to 30% of the documents it had stored were potential duplicates.

The deletions resulted in the organisation reducing the time taken to make back-ups by 15% and workplace efficiency increasing, with one worker's productivity improving 20% from the improved information retrieval of the system.

Page 1 | Page 2 | Page 3 | All 3 Pages

Rob Buckley – Freelance Journalist and Editor

Moving home

Emergency service

Double trouble

The benefits of de-duping