Chinese AI and Law dataset (CAIL2018)
- What does it claim to do?
- Substantiation of claims & potential issues
- Is it currently in use?
- The creators
What does it claim to do?
CAIL2018 is the Chinese AI and Law challenge dataset. It was created for the purposes of encouraging research into how :machine learning can assist in the process of Legal Judgment Prediction (LJP). For the authors, LJP is about enabling machines to predict the outcome of legal cases by reference to the descriptions of fact set out in those cases. The dataset was released in 2018 as part of the CAIL2018 competition. The competition, which attracted more than 200 participants, focussed on how :natural language processing improves performance in LJP tasks. It presented competitors with three subtasks. These were the (1) prediction of applicable law articles, (2) charges, and (3) prison terms by reference to the descriptions of facts for the cases forming part of the training data of the CAIL2018 dataset.
Claimed essential features
- Create a large-scale dataset contaning processed data of China Judgments Online, an online repository established by the Supreme People’s Court of China.
- Provide a dataset of charges, law articles and prison terms used in Chinese criminal cases.
“CAIL2018 contains 2,676,075 criminal cases, which are annotated with 183 criminal law articles and 202 criminal charges.” (Xiao, C. et al, 2018)
Claimed rationale and benefits
- To facilitate further research in the field of legal judgment prediction.
” … there is not a publicly accessible high-quality dataset for LJP yet. Therefore, we collect and release the first large-scale dataset for LJP, i.e., CAIL2018, to encourage further explorations on this task and other advanced legal intelligence algorithms.” (Xiao, C. et al, 2018)
“Comparing with other datasets used by existing LJP works, CAIL2018 is on a larger scale and reserves richer annotations of judgment results.” (Xiao, C. et al, 2018)
Claimed design choices
- Each datapoint consists of a case description and three target attributes (labels) the law article cited, charges, and the prison term. The three target attributes correspond to the three subtasks in the CAIL competition. The target attributes are extracted from the original case description using :regular expressions.
- Law article prediction and charge prediction are framed as text classification tasks, prison term prediction is framed as a regression task in the CAIL competition.
- Only criminal cases were selected from China Judgments Online.
- The cases that would have very infrequent charge or law articles labels are filtered out.
- Cases with multiple defendants were also filtered out to reduce the complexity of the LJP task.
- The dataset includes the fact description (used as input in the LJP task) and the target attributes namely applicable law articles, charges, and prison terms.
Substantiation of claims & potential issues
- Much research in the field of ‘legal judgment prediction’ does not tackle prediction (in the sense of forecasting) at all. The CAIL2018 dataset does not offer data which enables the prediction of court decisions that have not yet been made. The term ‘prediction’ may mislead lawyers and policymakers into thinking the field of forecasting judgments is more advanced than it in fact is.
- The original documents already contain information about the labels (legal norms cited, charges, and prison term), so the value to legal practitioners of predicting those existing labels is not evident.
- The descriptions of the facts come from the court judgments, which may not be representative of the facts as set out prior to judgment. They may therefore be an incomplete or partial account of what actually happened.
- The dataset does not include the time period in which the judgments were made, suggesting that predictions made using this dataset cannot take into account that legal norms and their interpretations change over time.
The dataset is described in two papers (Xiao, C. et al, 2018; Zhong, H. et al, 2018) and on the Github page for the 2018 Chinese AI and Law Challenge Competition, where the dataset can be downloaded. A preview of the dataset is available on Hugging Face.
- The dataset consists of data collected from China Judgments Online, published by the Supreme People’s Court of China.
- The time span of the data is not specified.
- The data are stored in a JSON dataset format.
- A preview is available on Hugging Face (archived).
- The full dataset is available on Github (archived).
- “There are two parts of our dataset called CAIL2018-Small and CAIL2018-Large.” (Chinese AI and Law Challenge Competition; archived), that contain 196,000 and 1.5 million cases respectively.
The authors provide some information about how the dataset was constructed. However, no information is provided about how the data was collected (whether, for example, it was scraped from China Judgments Online or downloaded in batch). No information is provided about whether, and if so how, the data was cleaned. The authors provide no information about the completeness of China Judgments Online as a data source.
The dataset has been constructed as follows:
- 5,730,302 criminal documents were collected from Chinese judgments.
- The data is filtered on ‘judgment’ documents, using available metadata.
- The data was filtered to remove cases with more than a single defendant; cases “with those charges and law articles whose frequency is smaller than 30”; and law articles and charges associated with the “top 102 law articles” in Chinese criminal law. (Xiao, C. et al, 2018)
- The target attributes (law articles, charges and prison terms) are constructed using :regular expressions on the text. It is not known if there is a quality assessment step in case of contradictory candidates or if these data samples were automatically excluded.
- The final dataset: “CAIL2018 contains 2,676,075 criminal cases, which are annotated with 183 criminal law articles [labels] and 202 criminal charge [labels].” The prison terms range from 0 to 25 with information on
death_penalty. (Xiao, C. et al, 2018)
- One of the attributes is the name of the defendant. The data does not appear to have been anonymised.
- It is not known on which criteria the final set is split into the CAIL2018-small and CAIL2018-large datasets.
The attributes of the dataset, along with a short textual description, are set out in Figure 1 below.
An example of the data is shown in Figure 2 below.
The authors also provide an example in tabular form (Figure 3):
Figure 3: an example of the data in tabular form (Xiao, C. et al, 2018)
The dataset is used for a Chinese AI and Law Competition in predicting charges, relevant articles and term of penalty.
- The examples of the data don’t show a specific focus on the time period in which the judgment is made. This suggests that any system used to make predictions using this dataset cannot take into account that the laws and interpretations of law change over time.
The original documents already contain the information about the labels, so it is not clear how predicting those labels is helpful for a legal professional.
The authors do not provide an explanantion of how this experiment could be used to predict actual decisions that will be made by the Chinese courts in the future.
Court judgments are generally compiled after the decision has been made, therefore the facts of the case are not necessarily representative of the description of the facts prior to the final judgment.
- The authors do not provide any data to be able to predict decisions of the court that have not been made yet.
Rationale and benefits
- Given the data used for this text classification task it is clear that the system is unable to actually predict future cases. The authors present a dataset of facts from already made judgments. In order to actually forecast future decisions of the court the system would require data that was available before the ‘predicted’ judgment was made (e.g. case law from a lower court).
- Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z., Wang, H. and Xu, J., 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478.
- Zhong, H., Xiao, C., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z., Wang, H. and Xu, J., 2018. Overview of CAIL2018: Legal judgment prediction competition. arXiv preprint arXiv:1810.05851.
- Chinese AI and Law Challenge Competition; archived
Is it in current use?
Under active maintenance?
Available on Huggingface datasets?
Available on PapersWithCode?
Citations of the dataset’s paper in Google Scholar
94 citations (Google Scholar; archived).Top
Authors and affiliation, year of construction
The dataset was constructed in 2018.
The authors of the dataset are
- Chaojun Xiao, Department of Computer Science and Technology, Tsinghua University, China
- Haoxi Zhong, Department of Computer Science and Technology, Tsinghua University, China
- Zhipeng Guo, Department of Computer Science and Technology, Tsinghua University, China
- Cunchao Tu, Department of Computer Science and Technology, Tsinghua University, China
- Zhiyuan Liu, Department of Computer Science and Technology, Tsinghua University, China
- Maosong Sun, Department of Computer Science and Technology, Tsinghua University, China
- Yansong Feng Institute of Computer Science and Technology, Peking University, China
- Xianpei Han, Institute of Software, Chinese Academy of Sciences, China
- Zhen Hu, China Justice Big Data Institute
- Heng Wang, China Justice Big Data Institute
- Jianfeng Xu, Supreme People’s Court, China
According to the Astana International Financial Centre, Dr Jianfeng Xu is ‘the Director of the Information Center of the Supreme People’s Court of China, the Chairman of China Judicial Big Data Research Institute, and co-dean of the School of Information Management for Law of China University of Political Science and Law.’ His background is in Computer Science. (Jianfeng Xu; archived) The dataset is associated with the CAIL2018 competition.Top
Background of developers
Target legal domains