The e-discovery term “de-duplication” has understandable popularity among
lawyers and legal services providers. It conjures images of a reduction of identical documents to review, the associated time and cost-savings, and maybe even a weekend off!! It seems quite obvious then, that de-duplication is a vital selling point for e-discovery providers.Unfortunately, in their eagerness to avoid redundant review, many firms
request “de-duplication” without considering all of the choices, or potential implications.
This article reviews the basics of de-duplication: the types of de-duplication, the strategic options available to optimize efficiencies through de-duplication, and some of the common missteps that occur during the de-duplication stage of e-discovery.De-duplication is not a singular process, but rather a series of processes designed to manage duplicate documents in an efficient manner. For many practitioners, failure to maximize the benefits of de-duplication begins with the failure to plan.
Handing several hard drives to your e-discovery provider and asking for “de-duplication” is analogous to presenting a patient to a surgeon with a simple request for an “operation”. At the initial stage of every document production, determining what is to be de-duplicated is an essential, yet often overlooked starting point.
If the client requires comprehensive email de-duplication, rather than a narrower focus on attachments and files, the client must be informed of the potential for loss of relevant data. A thorough understanding of the risks and rewards of de-duplication will be vital to the client’s determination of the best course of action.
Lawyers can benefit most from de-duplication when time is spent prior to e-discovery doing some predictive analysis of how the discovery process will progress. The resulting forecast may benefit from perspectives on which individuals at a firm are key players, or on which issues are likely foci, or even on the particular keywords that may be used in searches. Such foresight, and close collaboration with e-discovery professionals can result in valuable prioritization/organization of the discovery process, which in turn can minimize costs while enhancing efficiencies in production.
Integral to the discovery planning process, lawyers need to be clear on the level of de-duplication appropriate for each individual whose data is being examined. De-duplication can be performed at an individual level, which would involve removing duplicates found within one individual’s email, attachment, and file collection. It can also be conducted on a case-wide basis, removing duplicate data found in the email, etc., collections of a number of individuals, when collections are compared against one another.
The level of de-duplication a client chooses may be contingent on the quality of the information-gathering to that point. If the client is unaware of the key players being investigated, broader parameters may be set, and a “case-wide” de-duplication could best serve the discovery goals, by optimally reducing the data universe for efficient review. At the opposite end of the spectrum, individual de-duplication (not de-duplicating across a number of individuals’ data collections), may not reduce that data universe enough to maximize efficiency. To visualize, think about group emails, and the number of times certain emails are forwarded…and how many times a reviewer would come across the same email!!
This is a juncture at which lawyers may decide to attempt to hybridize de-duplication processes, thereby inserting their own forecast of the value of different evidence sources. Lawyers may request that certain individuals’ data be de-duplicated on individual bases, while other data is de-duplicated on a case-wide basis. This can often result in costly time delays, delays that could be avoided through better planning and communication between e-discovery professionals and clients.
Precise Definitions Lead to Success
The definition of what constitutes a duplicate must be established at the outset of the e-discovery process. The grading of duplicates from “exact”, to “near”, to “partial”, and the disparate treatment each type will require, can have enormous implications for the effectiveness and efficiency of the de-duplication process.Use of a “near and partial duplicates” focus rather than an “exact duplicates” focus requires more expert understanding of de-duplication, because the latter are more clearly evident, and therefore easier to eliminate. A near duplicate is determined by similar data such as date, file name, or other metadata (e.g., file size, time stamp, etc.). Partial duplicates, on the other hand, may be identical email with different recipients, different dates, forwards, or cc:’s. Partial duplicates can even be as remotely linked as emails that share common headers (e.g., legal disclaimers), or footers (e.g., standard salutations), or blocks of text.
The most common form of de-duplication eliminates exact duplicates. Exact duplicates
are identical files or blocks of text which are determined to be identical because
they share the same “hash code”. A hash code is a unique numerical value assigned to
a file by a mathematical operation that transforms the data string of text in the file into a
shorter, fixed-length value. The value of documents all having unique hash codes is in the resulting precision of document identification. Changing one character of that file’s content will result in the assignment of a completely different hash code. Thus, using hash code algorithms as a basis for de-duplication ensures that only exact duplicates will be eliminated.
Although most legal professionals may not see the value of adding a hash code field, this one step can provide significant advantages, as the data-collection within a case grows. Imagine overseeing a large data production with a well-conceived de-duplication plan, only to have unknown data, not similarly managed, added to the database at the eleventh hour by an alternate provider. Having the hash code as a hidden field within the database, means that new data can be easily incorporated without compromising the de-duplicated data, or the de-duplication effort.Understanding Production
A large part of optimizing efficiency in complex de-duplication productions is deciding
how duplicates will be handled during production. One option is blind duplicate elimination. Here, the final production includes no reference to duplicates which were identified and eliminated during production. Thus, in a global de-duplication, some individual email user's data may be significantly lacking in content, because duplicates were detected elsewhere.While saving large amounts of reviewer time, paper, and storage medium space, the drawbacks to duplication elimination are obvious, especially where partial duplicates are concerned. Certain areas of interest may only become clear as a review progresses
Did a given email contain the standard company header? Who in the office received the forwarded email with its attachment? If such questions only arise mid-way into a review, finding answers can be costly and time-consuming.
Because of the risks inherent in duplicate elimination, most de-duplication productions
use duplicate identification instead. Here, duplicates are identified and logged as such,
including reference to the original file. Logs are also essential to finding mistakes within
de-duplication productions, locating files that did not open, error messages, and corrupted data. More complex operations require more complex logs. Many different
styles of logs exist within the duplicate identification option, and lawyers should understand exactly how the production will look and function before production beginsLawyers need to be concerned with the processing power required for both de-duplication, and during the review process. The greater the scope of de-duplication requested, the more reliant the reviewers will be on a central index or master log. As part of the point-and-click generation, most lawyers envision being able to scroll through the production, effortlessly clicking on documents that refer to other documents that refer to still more documents. Such functionality, however, requires tremendous disc space and processing power far greater than most firms have in-house.
For this reason, some legal services providers are starting to offer online review tools to
increase the functionality of their indices. Most productions and reviews, however, still
take place using a local database application at the law firm. Understanding the
limitations of review prior to production, can be critically important to lawyers when determining the scope of review appropriate for their particular investigation.Successful De-Duplication Requires a Uniform Strategy
The above discussion demonstrates how the complexity inherent in de-duplication operations quickly becomes apparent. The underlying message in all this complexity, however, is quite clear: begin planning for de-duplication as early as possible. Most problems arise when lawyers do not fully understand what they have, what they want from the data, how the output will look and function, and the resources required for that type of production. These problems can be avoided by assembling a team of project managers, attorneys, and legal services providers for a planning period before production begins.
A practice that is gradually taking hold in the industry is the requirement that clients agree to a “statement of de-duplication”, or contract, that establishes the exact type of de-duplication to be performed, what the output will look like, and the order in which the data will be processed. This will minimize the chance of costly surprises during and after the production process.Currently, there are no industry-wide standards for de-duplication, and as noted, there is a great degree of inconsistency with regard to how de-duplication is implemented. Thus, it is up to lawyers and legal services providers to lay a clear roadmap for each case. Legal services providers must be clear about how output will look and function, the order in which data will be processed, and the implications of the chosen strategy.
The level of service E-discovery providers can offer is really defined by the information they receive from their clients. Therefore, lawyers need to spend adequate time to educate themselves about de-duplication, to forecast how the review process will proceed, and to fit de-duplication into their overall litigation strategy.
The clear communication of objectives, desired output, and timelines during the planning process, is essential. If those foundations for success are present, de-duplication will be a vital asset to firms’ e-discovery efforts.
------------------------------------
- “Panel Shares Tips for Dealing with Investigations,” Industry Focus, March 2005, http://www.businessinsurance.com/cgi-bin/industryFocus.pl?articleId=16373&issueDate=2005-03-20
- “Panel Shares Tips for Dealing with Investigations,” Industry Focus, March 2005, http://www.businessinsurance.com/cgi-bin/industryFocus.pl?articleId=16373&issueDate=2005-03-20, and “Responding to Government Investigations,” Association of Corporate Counsel, Sept. 2004, http://secure.acca.com/infopaks/govtinvest.html
- “Panel Shares Tips for Dealing with Investigations,” Industry Focus, March 2005, http://www.businessinsurance.com/cgi-bin/industryFocus.pl?articleId=16373&issueDate=2005-03-20
Copyright ©2007 discover-e Legal, LLC All Rights Reserved