Multilabel Thai property-related offences

Name: Multilabel Thai property-related offences
Creator: Sirawit Chokphantavee
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Artificial Intelligence

Citation Author(s):: Sirawit Chokphantavee (Sirindhorn International Institute of Technology, Thammasat University)

Sorawit Chokphantavee (Sirindhorn International Institute of Technology, Thammasat University)
Submitted by:: Sirawit Chokphantavee
Last updated:: Mon, 02/10/2025 - 11:44
DOI:: 10.21227/chhq-e465
Data Format:: *.csv

21 views

Categories:

Artificial Intelligence

Keywords:

NLP

artificial intelligence

law

ACCESS DATASET CITE

Abstract

Legal analysis utilizing natural language processing and machine learning technologies is a difficult undertaking that has recently sparked the interest of many academics and industries. Using a human-annotated dataset summarized into colloquial Thai from Supreme Court decisions, this work investigates a different combination of NLP, ML, and rule-based techniques for accurate legal case analysis as per Thai law, especially property-related offences, with the intuition to imitate the lawyer's cognitive process. We experimented with two major tasks, binary and multi-label classification, evaluated using a five-fold cross-validation method. We achieved exceptional performance for the former task for average accuracy and F1-score, reaching 94.2\% and 96.7\%, respectively, together with an intriguing finding that solely vanilla fastText, a static embedding, is enough for such a task. For the part of multi-label classification, we obtained a remarkable result of 82\% in average zero-one accuracy and 92\% in average hamming accuracy, with the fine-tuned joint embedding classification pipeline incorporating rule-based post-processing, showing an improvement from without the rule-based technique. This highlights the possibility of integrating the symbolic information from a rule-based algorithm together with the statistical computation from machine learning techniques in performing a complex legal analysis task.

Instructions:

Dataset Name: Thai Property-Offence Dataset

Description:
This dataset contains 120 legal case descriptions related to property-related offences in Thailand. Each entry includes a reference to a Supreme Court decision, a text prompt describing the case, and binary labels indicating the presence of specific legal provisions. The dataset is useful for legal NLP tasks such as classification and case analysis.

Columns:

Supreme Court Decision No. – The reference number of the Supreme Court ruling.
Prompt – A textual description of the case, written in Thai.
Section 334 – Binary indicator (1 or 0) for whether the case involves theft under Section 334 of the Thai Criminal Code.
Section 336 – Binary indicator for whether the case involves snatching under Section 336.
Section 339 – Binary indicator for whether the case involves robbery under Section 339.
Section 340 – Binary indicator for whether the case involves gang robbery under Section 340.