eDiscovery Document Triage Guide - Exceptions, File Types & OCR

Posted by Jeremy Greer | Mon, Jan 21, 2019

What steps should you consider directly after processing?

  • Review exceptions and triage documents

  • Know which files are viewable

  • Analyze the makeup of unindexed documents

  • Consider unindexed multimedia

Process Documents Updated 

In order to become self sufficient in eDiscovery, you will have to accept the following facts.

1) Regardless of eDiscovery platform, not all documents can be opened.

2) Regardless of eDiscovery platform, not all documents can be indexed

 

Exceptions Header

Intro to Exceptions

An exception refers to an error message which explains why Digital WarRoom was unable to open the file and extract the expected metadata or text. Exceptions are typical in any processing job, as password protected files, corrupted files and encrypted files tend to accumulate in any set of documents. When DWR comes across an exception, the tool will acknowledge the file, extract the keywords it can, throw an exception and make a best guess at why DWR could not complete the job of processing. If you have a significant amount of exceptions, please understand that the tool is not “broken”. For instance, If you were to process a large collection in which 40% of documents were password protected or encrypted, DWR would technically complete the job of processing, but you would have many exceptions. Similarly, if you throw rocks in a laundry machine and press start, the laundry machine will run, but you will probably not get the desired result.

Tip: After processing a collection of documents, immediately look at your exceptions. If you skip this step, you might overlook documents missing from your collection.

Here are some example exceptions from our demo database using files from a marketing email box. There are 32 exceptions. About 15 of them are emails involving a “send and sign” software for Digital WarRoom contracts. These emails are rightly encrypted – designed solely for the use of the recipient. About 6 of the exceptions are password protected health care documents and about 11 of them are password protected pay stubs.

This begins the triage step. Ask yourself: what are the nature of my exceptions? Decide next steps in figuring out how to open a document that threw in exception. Can you tell by the path and file names that the likelihood that these files have probative value is low? In this case, you can make a legal risk management decision to not include those documents in review. If you would like to explore options to open files that threw an exception, you may need to be a bit creative. There is no consistent workflow for dealing with exceptions. However, here are some recommended courses of action for dealing with different issues relating to exceptions:

The most common exceptions are a password protected file, an encrypted file and a corrupted file. If you notice one of these issues, you may have to talk to the custodian in order to attempt to retrieve a password, unencrypt the file or determine the source of the corruption.

Tip – Take the exceptions report and export it to an Excel sheet. Send the report to the supervising attorney or the attorney in charge of preservation. “I have these problems. Should we recollect? Should we get passwords from clients?”

Recommendation: Acknowledge that your exceptions exist and document what you’ve done about them. The best way to avoid sanctions is to keep detailed, timestamped notes.

Tip: try testing the native files in their original applications. For example: check if you can open a native PST file in outlook, or a word document reported as corrupt in MS Word.

 

Here are some exception examples

  • Password-protected files
  • Encrypted files
  • Corrupted Files
  • Container files that cannot be opened
  • The file was locked or inaccessible
  • The location of the documents to catalog was not accessible.

 
Reviewable Documents Header

File Types

1) Check the file type make up of your processed collection

What are the file types of the documents just processed? These statistics are available in the reports view. You may be asked to reproduced the files in native format. If the exception is unusual and you need access to the files, some additional triage work may be needed. Sometimes you need consultants to do that.

 

2) After processing, consider that not all of your data will make it through to review.

You can edit your policy to post any file type you want into review. Typical files such as documents, images and multimedia are always included by default. Digital WarRoom Cloud offerings can view 300+ file types using the industry standard Avantstar viewing software. For individual DWR Pro clients, we recommend QuickView Plus, from the same software development company.

In rare cases if you have file types outside of the list of viewable files, you will only be able to open that specific file if you have installed a proper viewer that supports the file type in question. For example: if you have a set of unusual and proprietary CAD files, you may need to install (or have us install) a specific viewer from that company into your environment to open the file. When you double click the CAD file document in the document list, the file will be viewable.

By default, Digital WarRoom will exclude files types from review which are not viewable by nature such as binary files. Before you get to review, you can use the policy wizard to create a custom list of file types to exclude. If you wanted, you could edit the defaults and upload a database file into review, but you wouldn’t be able to review much content.

If a document is a wacky file type or a zero kb file, you may be able to immediately recognize that the likelihood of probative value is low. For example, It is possible that a file like this was created by a computer program. Digital WarRoom uses anti-virus software but even so, you don’t want to download a virus in the middle of review. In most cases, you won't need to edit the list of file types which will be posted into review.

 

Unindexed Documents Header 

Index and OCR

During processing, Digital WarRoom will attempt to index all files. Successfully indexed files are files in which the tool can find plain text such as in a word document or an excel sheet. If you have a weird document like a QuickBooks file, Digital WarRoom will still attempt to index the words in that document, but the results may come out as gibberish. In the case of an image, PDF, or photoshop file, you will have to OCR the document in order for the text to be searchable in review. Here is an example:

After you process a collection, look at the percentage of unindexed documents in that collection. The more multimedia or nonstandard file types in your collection, the higher the unindexed percentage will be.

Recommended workflow: Should you OCR those documents right away? Depends on data itself. We recommend you wait to add the data to review in order to view and check the makeup of those unindexed documents. Once in review, search using the “Unindexed Files” filter (nested under “Special Filters”) and use your judgement to determine why those documents were not indexed. If you see something you did not expect, you may need to pursue the issue further.

Generally, if you find a significant enough portion of your documents are unindexable – over 1-3%, you should OCR that collection. Before you OCR, remember that you cannot OCR video and audio files (otherwise known as multimedia). Here is an example filter that will return files you should OCR. Include “Unindexable Files”, exclude “Multimedia” extensions. After searching on this filter, select the documents you want to OCR, or select all documents using Ctrl+A. Now right click one of your highlighted documents and select “OCR/Language Analysis” from the menu.

Unindexed Mutlimedia Header

Audio and video files - like any file can be opened and reviewed individually when conducting eDiscovery. If the duration of these clips are particularly long, Digital WarRoom consultants can assist in using a voice to text software to record and index the words in each audio clip. It is up to you to make the call on the value of these audio files to your investigation.

Topics: Best Practices

Written by Jeremy Greer

Leave a Comment