De-duplication (“de-duping”) is the process of comparing electronic records based on their content and characteristics and removing duplicate records from the data set so that only one instance of an electronic record is produced when there two or more identical copies. De-duplicating a data set is a smart way to reduce volume and increase efficiencies of review.  There are three types of de-duplication: case, custodian, and production de-duplication.

Case de-duplication involves retaining only single copies of documents per case irrespective of custodian.  This is sometimes referred to as de-duplication across custodians. For example, if an identical document resides with Mr. A, Ms. B and Miss. C, only the first occurrence of the file will be processed (Mr. A’s) for review/production.  Assuming those same facts, if one were to apply custodian-level de-duplication (i.e., de-duplication within a custodian) the system will maintain one copy for each of Mr. A, Ms. B, and Miss C – or, one copy per custodian.  Finally, if multiple copies of a document reside within the same production set, de-duplication at the production level ensures that only one of those documents are produced.

De-duplication is an important step to implement because file systems can contain many copies of the same document.  For example, each time an email is sent it typically creates two additional copies of the email and its attachments, one in the sender’s sent-items folder and once in the recipient’s inbox. An email may also be sent to multiple recipients, thereby creating more copies.  To review each of these documents, code them consistently, and produce multiple copies of an identical document creates inefficiencies and avoidable costs.  And so, it is important to evaluate de-duplication efforts.