Raw e-data as collected contains a tremendous amount of duplication. Emails are sent to multiple recipients. Copies of the same file are found in multiple locations, such as on an employee’s desktop, file server, flash drive and backup tapes. Identifying duplicates so that just one copy of each item progresses through the process – especially to ensure only one copy is reviewed – is key to a timely and cost effective project.
Our approach is to remove redundant items from the workflow, but maintain the information about those duplicates for auditing, defensibility, and to provide for flexibility in review and production later on. We find exact duplicates by comparing the hash values of your documents and emails. And we also find near duplicates differing only in minor non substantive ways.
De-duplication can be performed globally throughout all project data, or can be performed within the scope of a custodian’s records or any other dataset.
Keeping the information about all copies of globally de-duped data is important. For example, let’s say both Paul and Roger had in their collections a certain document deemed to be responsive. Through de-duplication, Paul’s copy is not reviewed, but Roger’s is. However, the information that the item had duplicates, and who was the custodian of those duplicates is preserved. This information is available to the reviewer. And this information can be used during production, when it may be necessary to produce the de-duped item from Paul’s collection, even though it was only reviewed in Roger’s.