When tasked with a document review project, there are various analytic tools available to streamline the process in order to improve efficiency and accuracy. We’ve already discussed certain of these tools (see April 26 post discussing predictive coding and May 16 post discussing email threading). Today’s post focuses on another, interrelated tool: document clustering.
What is Document Clustering?
As you can imagine, the way in which a cache of documents is organized for review can make a tremendous difference in not only the efficiency of the review, but also the accuracy of the review itself. Clustering software examines the text in documents, determines which documents are related to each other, and groups them into clusters. Clustering performs the electronic equivalent of putting your documents into labeled boxes so that things only end up in the same box if they belong together. Clustering groups similar documents together and then assigns those document to the same reviewer(s), allowing for a more efficient review as related documents can be reviewed together. Clustering organizes the documents according to the structure that arises naturally, without query terms. It labels each cluster with a set of keywords, providing a quick overview of the cluster; basically telling you, the project lead, what the documents have in common at a conceptual level. The keywords give a quick idea of what each cluster is about, allowing you to easily identify the themes of your document set. For example, if you are a litigator looking for information about a particular contract and the cluster is about the Company’s summer softball team, documents in that cluster are not relevant. During review, you can, with a single mouse click, categorize or tag a single document, a cluster of documents, or a set of clusters containing a specific combination of keywords. *
*Certain clustering software has an automatic categorization capability, where all documents sufficiently similar to a set of documents can be categorized the same way, greatly reducing the amount of labor needed when new documents are added to a case. It enables you to leverage the labor you’ve put into categorizing the earlier documents.