Introduction
As the volume of data used in corporate activities continues to increase at an accelerating pace, and as new data preservation tools are developed, the volume of data involved in the eDiscovery process is also multiplying. Consequently, many organizations are concerned about reducing the cost of document review which typically accounts for more than half of the cost of eDiscovery. For more than a decade, companies and legal professionals have been exploring the use of Technology Assisted Review (TAR) and advanced technologies.
TAR is a way of reviewing documents using machine algorithms to classify documents based on human inputs, accepted by courts in many countries. As technology evolved from Sample Based Learning to robust Continuous Active Learning, it now allows for automated technology.
Why is it difficult to set an industry standard?
While TAR has been part of our industry for almost twenty years[1], the operational method behind TAR is not standardized due to the following challenges:
- Continuous development of new tools. eDiscovery tools, including forensic tools for collecting data and platforms for reviewing documents, are continuously being developed and improved upon, including analytic and TAR tools. As these tools were adapted by courts, advanced tools were further developed to improve efficiency and accuracy. What may be considered best practice at one time may quickly be superseded as tools are enhanced, and new tools are invented.
- Standards could be different by region. As the use of TAR and analytic technology acceptance varies worldwide, the ways TAR is utilised depend on the country or the case. For example, in Japan, where typical discovery still exists, most lawyers are unfamiliar with TAR unless they have engaged in foreign litigation or regulatory investigation. While there is room to utilize TAR for document review regarding internal investigation matters, not all lawyers and in-house counsel are willing to adopt new technology.
Due to rapid changes in TAR and advanced eDiscovery technologies, experts in the field can provide valuable insight on the most current best practices and how to tailor them to a project.
What technology is currently used?
Despite the complexity of advanced eDiscovery technologies, it is beneficial for lawyers and in-house counsel to understand the concepts. With knowledge, they can direct external eDiscovery experts in achieving the overall objectives, accelerating the expert's time to focus on designing and troubleshooting specific aspects of a workflow. Here are several general TAR concepts:
- Sample Based Training. The initial form of TAR is Sample Based Training, sometimes referred to as TAR 1.0, simple passive learning (SPL), or simple active learning (SAL). This process starts with a training round called the “control set,” where a reviewer knowledgeable about the case reviews a few thousand randomly sampled documents. The control set is used as a representative sample of the review pool to benchmark the model at each step. Additional random samples are then reviewed and called the “training sets.” The training sets are used to train the model, which classifies all the documents. The model predictions are then checked in a “QC round,” and metrics such as precision, recall, and overturns are assessed, and the performance of the model is validated, or additional review rounds are completed until the model is validated.
- Continuous Active Learning. The second generation of TAR (TAR 2.0) is Continuous Active Learning. In this method, there is no separation between training and review because the computer actively learns. At the same time, the reviewer is coding such that all documents predicted to be relevant would have been reviewed by the end of the training process. An active learning review usually requires a certain number of documents to be binary-coded before propagating the relevance score. For example, Relativity Active Learning needs at least five documents marked positive and five marked negative before scoring. Once the model has a suitable number of relevant documents manually coded, it will begin training and continuously update the relevance score based on documents coded by reviewers while serving up documents that are predicted to be relevant. Each coded document improves the accuracy of the model’s predictions and helps serve like relevant documents. This continues until the model finds no more relevant documents. Like Sample Based Training, Continuous Active Learning typically ends with a validation test where a sample set of unreviewed documents is reviewed to estimate how many if any, relevant documents would be missed.
Each method has its pros and cons – for example, Sample Based Learning can effectively predict the whole review scope at an earlier stage. Continuous Active Learning may generate smaller review populations because of its continuous model update. Continuous Active Learning is chosen when the client adopts the TAR process. One of the reasons is that Continuous Active Learning can be set up relatively easily without specific cost or prior expert input while reaping the following benefits:
- Reducing the Scope. The main reason for using Continuous Active Learning is to reduce the review scope by eliminating non-relevant documents. As set out above, the AI model tries to generate the likely relevant documents for review and stops when it cannot find any more relevant documents. This way, it can reduce the review cost and time by attempting to remove most presumably non-relevant documents from manual review.
- Prioritization. Once the AI model has enough documents to stabilize its judgment, it will prioritize documents it deems more likely relevant among the remaining documents for human review. In other words, as the model is being updated, documents with a high Relevance Score are prioritized for review. This prioritization is especially useful in detecting hot documents at an early stage and is often utilized in internal investigations where the overall picture of the case may not be clear at the beginning.
- Quality Control. Continuous Active Learning can be used for quality control during and after the linear review. The Relevance Score assigned to each document plays an important role. For example, a document with a high Relevance Score but marked as not relevant during the linear review or a document with a low Relevance Score with high relevance during linear review plays an important role. For example, a document with a high Relevance Score but marked as not relevant during the linear review or a document with a low Relevance Score but tagged as Relevant during the linear review would constitute a part of quality control scope.
As set out above, the benefits of utilizing Continuous Active learning are 1) reducing the cost and 2) pursuing accuracy. One can either continue to review documents until the machine stops serving up likely relevant documents or manually stop reviewing when satisfactory results are obtained. Either way, time and cost are reduced by eliminating a large part of the review pool from manual review.
Continuous Active Learning also has several limitations:
- Base Volume. TAR methodologies, in general, and Continuous Active Learning, in particular, are not effective for small volumes of documents. While using TAR for a small document population is possible, the resulting cost and time saving would likely be insignificant compared to a manual review of the entire scope.
- Document Type. Continuous Active Learning is built based on the textual information of each document. Documents without textual information, such as images, audio files, and movies, are not suitable for the active learning process. Further, the AI models have difficulty handling documents with an extremely small or large amount of text. Moreover, spreadsheets such as Excel files are often excluded from the process because the AI models have difficulty grasping the context of highly structured documents. Documents excluded for the above reasons must be manually reviewed separately from the Continuous Active Learning process.
- Document Family. A “document family” is a group of associated files (e.g., an email and its attachments belong to the same document family). In most litigation cases and regulatory investigations, a party is requested to produce documents on a family basis. Accordingly, linear document review is often conducted by the family. However, while using Continuous Active Learning, the Relevance Score is assigned and sent for review on an individual document basis. While a reviewer may access and review the family members of a document on the review platform, the efficiency of the review is compromised in this way. Some Continuous Active Learning tools can present entire families in the review queue and still present the AI’s relevance prediction on a document-by-document basis, so the accuracy and usability of a by-family queue may not be as good as a by-document queue. Most experts would recommend using Continuous Active Learning on a by-document queue.
Conclusion
Several tools offer Continuous Active Learning solutions, including Relativity, Brainspace, NexLP, Everlaw and others. Despite minor differences in the detailed features, these tools have common among these tools. The bottom line is not which tool to use but whether Continuous Active Learning suits a certain case and how it should be utilized to meet review needs. These decisions require insight into the technology and experience with actual cases. Trained professionals can assist law firms and in-house departments in designing a predictive coding workflow strategy and utilizing technology to achieve the best practice.
[1] Anne Kershaw authored “Automated Document Review Proves Its Reliability” in 2005; the National Institute of Standards and Technology (NIST) began providing information about TAR in 2006.