Data-driven

Natural language processing

Natural language processing is a field on the intersection of linguistics and computer science that focuses on computational techniques that allow the processing of (i.e analysis and generation of) natural language in the form of text or speech.

Related terms: regular expressions, machine learning

Typology examples: Jus Mundi, Automatic Catchphrase Identification from Legal Court Case Documents (Mandal et al. 2017), Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS

Further reading

Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)
Jurafsky, D., & Martin, J. H. ‘Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing.’ Upper Saddle River, NJ: Prentice Hall. (2008) (3rd edition draft available online)
Indurkhya, N., & Damerau, F. J. Handbook of natural language processing. Chapman and Hall/CRC. (2010)

Top

Machine learning

Machine learning is an omnibus term for a computer program or model that uses historical data to make predictions for new (i.e. unseen) data. In more technical terms, machine learning is the process of approximating a function that maps the model’s input (e.g., the text of legal documents, converted into a computer-suitable numerical representation) to the labels (e.g., ‘defendant won’ or ‘prosecution won’, or the amount of a fine). There are different types of machine learning, e.g., supervised, unsupervised, reinforcement learning. The most common approach in the legal domain is supervised machine learning.

Typology examples: Moonlit, Lex Machina, Chinese AI and Law dataset

Further reading

Ashley, K. D. Artificial intelligence and legal analytics: new tools for law practice in the digital age. Cambridge University Press. (2017)

Top

Supervised machine learning

Supervised machine learning is a subcategory of machine learning, that requires labelled data. Within supervised machine learning there are two types of predictive models: classification models, and regression models. Classification is the task of predicting a discrete class label (e.g., whether or not a case will result in an eviction) for a specific input, whereas regression is the task of predicting a continuous quantity (e.g., the length of a prison sentence, in months).

Typology examples: Predicting Brazilian court decisions (Lage-Freitas et al. 2019), Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS, JusticeBRD

Further reading

Chapter 6 of Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)

Top

Unsupervised machine learning

Unsupervised machine learning is a subcategory of machine learning for analysing and clustering unlabelled data. Such algorithms are design to find patterns and group data together without having assigned labels to rely on.

Related terms: supervised machine learning

Typology examples: Statutory Article Retrieval Dataset (BSARD), Automatic Catchphrase Identification from Legal Court Case Documents (Mandal et al. 2017), Squirro

Further reading

Blogpost by Divish Dayal. Unsupervised NLP : Methods and Intuitions behind working with unstructured texts

Top

Features

Features are the input which a machine learning model uses to determine its classification. In the case of legal judgement classification, features are extracted from the text and may be (for example) individual words or sequences of words. In order to make these interpretable for a machine learning algorithm, each feature is converted into a series of numbers (i.e. a vector). The features are most commonly extracted automatically (using a large piece of text), but can also be chosen manually. For instance, if one wants to predict court decisions these could be specific variables, such as judges, precedents, articles of law, etc.

Related terms: machine learning, tf-idf

Typology examples: Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS

Further reading

Blogpost by Danny Butvinik Feature Selection — Exhaustive Overview

Top

N-grams

N-gram is a sequence of words (or characters). Single words are called unigrams, sequences of two words are dubbed bigrams, and sequences of three consecutive words are called trigrams. If we split the sentence By a decision of 4 March 2003 the Chamber declared the application admissible. into bigrams (i.e. two consecutive words) the extracted features consist of:

By a, a decision, decision of, of 4, 4 March, March 2003, 2003 the, the Chamber, Chamber declared, declared the, the application, application admissible, admissible .

For trigrams, the features consist of:

By a decision, a decision of, decision of 4, of 4 March, 4 March 2003, March 2003 the, 2003 the Chamber, the Chamber declared, Chamber declared the, declared the application, the application admissible, application admissible .

Related terms: features, tf-idf

Typology examples: Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS

Further reading

Chapter 3 of Jurafsky, D., & Martin, J. H. ‘Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing.’ Upper Saddle River, NJ: Prentice Hall. (2008) (3rd edition draft available online)

Top

Tf-idf

The underlying idea is that the n-grams characteristic of a certain case will only occur in a few cases, whereas common, uncharacteristic n-grams will occur in many cases.

Related terms: features, n-grams

Typology examples: Jus Mundi, Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS

Further reading

Chapter 6 of Jurafsky, D., & Martin, J. H. ‘Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing.’ Upper Saddle River, NJ: Prentice Hall. (2008) (3rd edition draft available online)

Top

Support Vector Machines

Support Vector Machines or SVMs is a supervised machine learning classifier commonly used in natural language processing (i.e. when working with text). To classify the data an SVM tries to split the data points, based on their labels in the dataset (i.e. the training data). Specifically, it will determine the simplest (linear) equation that separates differently-labelled data points from each other with the least amount of error. To make the prediction the decides on the best hyperplane (i.e. a line in multiple dimensions) to separate the data. The support vectors are the data points nearest to this line. The goal of the SVM algorithm is to choose the position of the hyperplane in such a way that the largest possible margin with respect to the support vectors is achieved. This allows for a greater chance to classify new (i.e. unseen) data correctly. While non-linear SVMs are possible, linear SVMs are most commonly used in text classification.

Related terms: supervised machine learning

Typology examples: Exploring the Use of Text Classification in the Legal Domain (Sulea et al. 2017), JURI SAYS

Further reading

Blogpost by Rushikesh Pupale. Support Vector Machines(SVM) — An Overview

Top

Naive Bayes

Naive Bayes (NB) is a frequently-used classification technique in NLP. The NB classifier is based on Bayes’ rule. This rule states that the posterior probability of an outcome O given a word W, is equal to the likelihood of a word being present in the text of a certain outcome, multiplied by the (prior) probability of that outcome. This value then is divided by the probability of the word. However, this latter probability is usually held constant, as the goal is to determine the optimal outcome given a predetermined set of words. The NB classifier calculates the probability of each feature using Bayes rule, but under the (naive) assumption that all features are independent. This means that the NB algorithm does not take the word order into account.

Related terms: supervised machine learning

Typology examples: JusticeBRD

Further reading

Chapter 6 of Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)
Blogpost by By Nagesh Singh Chauhan. Naïve Bayes Algorithm: Everything You Need to Know

Top

Neural Networks

A neural network consists of nodes, which together form layers, and weights (connecting the nodes), which are learned during training. A neural network has to contain an input layer, an output layer, and a hidden layer. However, neural networks can have many more hidden layers; the more layers, the deeper the network. The feature vectors from the training dataset are fed into the model. They are then adjusted in each layer of the model, using weights (i.e. values by which the input from the previous layer is multiplied) and activation functions, which are mathematical functions that convert the nodes input into a new value. The activation function may differ, depending on the specific architecture of the model. Moreover, the size of the hidden layer does not have to be the same as that of the input layer, and the amount of nodes in the hidden layer is chosen before training. Each value from the input layer is adjusted according to the weights, which contributes to each node in the hidden layer.

An example of multilayer perceptron neural network

Related terms: machine learning, supervised machine learning

Typology examples: Mapping BITs RNN framework and dedicated tool, Polisis, Casetext

Further reading

Chapter 7 of Jurafsky, D., & Martin, J. H. ‘Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing.’ Upper Saddle River, NJ: Prentice Hall. (2008) (3rd edition draft available online)
Aikenhead, M. (1996). Uses and abuses of neural networks in law, the. Santa Clara Computer & High Tech. LJ, 12, 31.

Top

Language model

A language model is a probability distribution over words in a text. It calculates these probabilities by using two strategies at the same time. The first strategy is masking words (i.e. replacing random words with a placeholder, such as [MASK]), trying to predict the missing word based on other words in the sentence, and then calculating its probability. The second strategy uses two sentences, and tries to predict whether the second one follows the first in the text. The advantage of this model is that it can be fine-tuned (i.e. adapted) for a specific task, such as classifying court decisions, by adding an additional layer on top.

Related terms: transformers

Typology examples: Della, Casetext, Contract Understanding Atticus Dataset

Further reading

Blogpost by Synced. Language Model: A Survey of the State-of-the-Art Technology
Blogpost by Ömer Faruk Tuna. Introduction to Language Modelling and Deep Neural Network Based Text Generation

Top

Attention

Attention allows the neural network to focus on, or pay attention to, more important parts of the input sequence by assigning higher weights, and thereby suppressing other (less important) parts.

Related terms: transformers

Typology examples: Della, Casetext, Contract Understanding Atticus Dataset

Further reading

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. ‘Attention is all you need.’ Advances in neural information processing systems, 30. (2017)
Blogpost by Shashank Yadav. Understanding Attention Mechanism

Top

Transformers

Transformers consist of a sequence-to-sequence architecture combined with attention (see above). This sequence-to-sequence approach is fundamental for tasks such as machine translation, since both the input and output must be a sequence (of words). An example of a popular legal domain task that often uses a sequenceto-sequence approach is legal text summarisation. One of the most well-known transformer models is BERT.

Related terms: language model

Typology examples: Westlaw Edge

Further reading

Blogpost by Samia Khalid. BERT Explained: A Complete Guide with Theory and Tutorial
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’ In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Top

F1-score

F1-score is the harmonic mean (a type of average) of the two measures precision and recall. For a class, precision is the number of documents for which the assigned label is correct. Recall is the percentage of documents with a specific outcome, which are classified correctly by the system.

Related terms: machine learning

Typology examples: Predicting Brazilian court decisions (Lage-Freitas et al. 2019), Della, Automatic Catchphrase Identification from Legal Court Case Documents (Mandal et al. 2017)

Further reading

Chapter 4 of Jurafsky, D., & Martin, J. H. ‘Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing.’ Upper Saddle River, NJ: Prentice Hall. (2008) (3rd edition draft available online)

Top

Elasticsearch

Elasticsearch is a search engine that is commonly used as the basis for legal search engines. Elasticsearch index document is represented as a combination of tf-idf word representations (see above) and meta-information that has been created by different machine learning systems (e.g. citations, thematic tags, names/dates, etc.) Each type of information is stored in a different ‘field’. When a query is compared to the Elasticsearch index, it is matched against the relevant fields. The sum of all the matching scores constitutes the document score and thus its relevant rank compared to the other matched documents. Elasticsearch (https://www.elastic.co/) is an open-source framework to construct fast and highly scalable search engines.

Related terms: tf-idf

Typology examples: Casetext, Squirro

Further reading

Elasticsearch

Top

Decision tree

A decision tree is a tree where each node represents a feature, each branch represents a decision rule, and each leaf represents the predicted label. A decision tree can be created manually or devised using machine learning.

Related terms: supervised machine learning

Typology examples: TreeAge, Moonlit

Further reading

Chapter 6 of Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)
Blogpost by Prashant Gupta. Decision Trees in Machine Learning

Top

Regular expressions

Regular expression is a series of text and special characters that specifies a pattern to be searched in the text. Regular expressions (often abbreviated as grep or regex) can be used in most text editors.

Related terms: natural language processing

Typology examples: Lex Machina, Chinese AI and Law dataset, Statutory Article Retrieval Dataset (BSARD)

Further reading

RegExr: Learn, Build, & Test RegEx
Chapter 3 of Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)

Top

Named Entity Recognition

Named entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Named entity recognition aims to find highly variable information (such as names) that tend to occur in similar contexts. For example the word ‘Mr.’ is often followed by a name.

Related terms: machine learning, supervised machine learning

Typology examples: Lex Machina

Further reading

Chapter 7 of Bird, S., Klein, E., & Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”. (2009)
Vardhan, H., Surana, N., & Tripathy, B. K. Named-entity recognition for legal documents. In International Conference on Advanced Machine Learning Technologies and Applications (pp. 469-479). Springer, Singapore. (2020)

Network Analysis

Network Analysis is a technique that is used to understand large-scale complex connected systems. A network is represented by nodes and edges, where the nodes represent objects (e.g. court decisions), and edges represent relationship between them (e.g. one court decision refers to another court decision). Network Analysis is then used to analyse the structure within networks and find patterns. For instance, one may be able to find most influential (i.e. most cited cases) in a specific legal domain. A network also allows to explore the path from one node to another rather than a single nodes and its edges. This can, for instance, be used to explore how the law evolved over time.

Network analysis can, but does not necessarily have to, involve machine learning.

Typology examples: Moonlit, Westlaw Edge

Further reading

van Kuppevelt, D., & van Dijck, G. Answering Legal Research Questions About Dutch Case Law with Network Analysis and Visualization. In A. Wyner, & G. Casini (Eds.), Legal Knowledge and Information Systems (Vol. 302, pp. 95-100). IOS Press. Frontiers in Artificial Intelligence and Applications. (2017)
Coupette, C., Beckedorf, J., Hartung, D., Bommarito, M., & Katz, D. M. Measuring Law Over Time: A network analytical framework with an application to statutes and regulations in the United States and Germany. Frontiers in Physics, 9, 658463. (2021)

Top

Project Publications

Data-driven

On this page

Natural language processing

Machine learning

Supervised machine learning

Unsupervised machine learning

Features

N-grams

Tf-idf

Support Vector Machines

Naive Bayes

Neural Networks

Language model

Attention

Transformers

F1-score

Elasticsearch

Decision tree

Regular expressions

Named Entity Recognition

Network Analysis