Lock Down Your eDiscovery Deduplication Policy

Posted by Jeremy Greer | Tue, Feb 19, 2019

What is deduping?

De-duplication is the removal of duplicate documents from your data set. Deduping in and of itself is an easy decision – do I want to review five documents or one? In most cases deduping can prevent reviewers from reviewing the same document many times. Many of our clients don’t want to be overwhelmed by the number of documents that they have.

However, you must fully understand your deduping policy and the implications in order to maintain the integrity of your document review. For example: depending on how you define a dupe, three different dupes could have slightly different metadata. Who viewed each version of the dupe? When was the last time each dupe was modified? This topic can be as hard as understanding privilege and is most difficult when you are reviewing for privilege.

All these questions play into how you create your attorney work product - that is your marks, issue codes and protective order designations (PODs). When you set up marks and issue codes, you decide whether each type of mark should propagate to duplicate documents. We call this your "propagation policy". The value of deduping in Digital WarRoom is that you can mark a document once and propagate decisions to all identical documents.

Your goal is to set up a defensible deduping and propagation policy that provides the highest degree of efficiency to avoid duplicative review. With that said, you must make good decisions with your data. If not, deduping and propagation can work against you.

How to define your deduping policy

Before you put your documents in review, Digital WarRoom will recognize the duplicate documents in your corpus.


Decide how you would like to define a dupe.

There are two choices –forensic fingerprint and pith. A forensic fingerprint is an exact dupe in every aspect. A pith dupe is near-dupe. Here is an example of a pith dupe:

1) Pith dupes are not required to have the same “To” or “From”. This takes into account different email viewers with slightly different formatting of those metadata fields.

2) Pith dupes ignore the time at which the email was sent. This means that pith dupes take into account different time zones, or long load times for receiving an email.

3) The subject and content of two pith dupes must have the exact same plain text. For example the following words all have the same plain text: Hello, Hello, Hello. Pith dupes take into account that different email viewers may format your content differently with tags, links, carriage returns, etc. If you send an email from outlook, it may appear differently on your smartphone. You can be confident that two pith dupes have the same words, even if they may be formatted differently in their native viewers.

Strict Recommendation: The default deduping policy is set to Pith. In almost all cases, you should NOT change this setting.

How do you want your dupes to populate in your review environment? There are three options.

1) Dedupe All

This option will put only one dupe in review. The advantage to this is that you save the most possible time by only reviewing every document only one time. The downside of this setting is that you will lose the recorded information of who has touched this document and the different contexts in which the document may have been referenced. For example: if you leave only one dupe in review, you won’t know that Joe had custody of that document in the 2nd dupe. Context is crucial to build your claim or counter claim. Deduping across the entire matter may also create issues with path. If you have a preferred version of the document with a specific file path – the “my docs” folder, there is a chance that this specific dupe is not in your review set.

2) Dedupe within custodians

This option will return one copy of every document for each custodian. Let’s say you collected from three sources, Jeremy’s desktop, Jeremy’s email and Bill’s email. Let’s say you are considering your review of a PDF contract. How many times might you have to review this document if you dedupe within custodians? Let’s say the custodian “Jeremy” has one copy on his desktop and attached another copy to an email. Let’s say Bill was the recipient of that email. Digital WarRoom will return one document for Jeremy and one for Bill. If you propagate marks, you will only need to apply decisions to the document once. Because you deduped within custodians, you will be able to filter by either the custodian “Jeremy” or “Bill” and know that they both people had access to that contract. If the custodian Jeremy had ten dupes of that same document, only one would go into review. In this case, you would know that Jeremy had the document, but you may not know the last time he modified the document.

This workflow is especially useful if you take advantage of assigning custodians in Digital WarRoom. A simple way to segment your review is to review documents from one custodian at a time. In this way, you will have a linear understanding of the timeline and knowledge of each custodian. What are all the documents this person was involved with? What did they know and when?

3) Dedupe across selected (advanced)

You have the ability to dedupe across a custom selection using any of the standard filters available in Digital WarRoom. This is for advanced users who have a strong idea of the implications of deduping.



We recommend deduping on pith and deduping within custodians. If you chose this, you can rely on your propagation policy to ensure that you do not review documents multiple times. Here is an example:

When you get to the point where you may come across the 2nd or 3rd dupe of a document, you should set up your columns to quickly see whether or not a decision has been applied. To view the status of your work product, there are two relevant fields: marks and issues. We recommend you put these columns next to each other and order them so that they are highly visible. If you like, you can pin these columns to the left. This is similar to “freezing” rows or columns in excel. Once these fields are pinned, you will easily know if documents in the document list have already had a decision made on them.

If propagation is not used, we can run reports called collisions – different decisions on identical documents. If you propagate marks or issue codes across dupes, you will never have collisions.

Deduping and Propagation When Reviewing For Privilege

Deduping within custodians is particularly useful for determining privilege because you will get the full context of everyone who had access to each document. Furthermore, we recommend using propagation unless the mark you are propagating is in the category of privilege. In this case, you should examine your workflow, review your legal requirements and decide if propagation makes sense.

Here is an example of why you should not always propagate a privilege mark:

Let’s say there is a privileged email correspondence between a client and an attorney. The client attaches a document to that email. Based on this situation, you may be tempted to also mark the attached document as privileged. Upon further review, you noticed a dupe of this attached document that reveals the document is completely public. Therefore, the document (and all dupes) cannot be privileged.

Context matters for privilege. Therefore, you may need to review the context of all dupes of a document in order to safely determine privilege.


If you found this article interesting, be sure to subscribe you and your team to our monthly blog distribution email. This email list is solely for blog distribution purposes and we promise to only send one email per month. To subscribe, simply scroll down and fill out the "Subscribe" form below the comment box.


Topics: Best Practices