Datasets
Standard Dataset
Multilabel Thai property-related offences
![](https://ieee-dataport.org/sites/default/files/styles/3x2/public/tags/images/artificial-intelligence-2167835_1920.jpg?itok=wAd0kf8k)
- Citation Author(s):
- Submitted by:
- Sirawit Chokpha...
- Last updated:
- Mon, 02/10/2025 - 06:44
- DOI:
- 10.21227/chhq-e465
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
Legal analysis utilizing natural language processing and machine learning technologies is a difficult undertaking that has recently sparked the interest of many academics and industries. Using a human-annotated dataset summarized into colloquial Thai from Supreme Court decisions, this work investigates a different combination of NLP, ML, and rule-based techniques for accurate legal case analysis as per Thai law, especially property-related offences, with the intuition to imitate the lawyer's cognitive process. We experimented with two major tasks, binary and multi-label classification, evaluated using a five-fold cross-validation method. We achieved exceptional performance for the former task for average accuracy and F1-score, reaching 94.2\% and 96.7\%, respectively, together with an intriguing finding that solely vanilla fastText, a static embedding, is enough for such a task. For the part of multi-label classification, we obtained a remarkable result of 82\% in average zero-one accuracy and 92\% in average hamming accuracy, with the fine-tuned joint embedding classification pipeline incorporating rule-based post-processing, showing an improvement from without the rule-based technique. This highlights the possibility of integrating the symbolic information from a rule-based algorithm together with the statistical computation from machine learning techniques in performing a complex legal analysis task.
Dataset Name: Thai Property-Offence Dataset
Description:
This dataset contains 120 legal case descriptions related to property-related offences in Thailand. Each entry includes a reference to a Supreme Court decision, a text prompt describing the case, and binary labels indicating the presence of specific legal provisions. The dataset is useful for legal NLP tasks such as classification and case analysis.
Columns:
- Supreme Court Decision No. – The reference number of the Supreme Court ruling.
- Prompt – A textual description of the case, written in Thai.
- Section 334 – Binary indicator (1 or 0) for whether the case involves theft under Section 334 of the Thai Criminal Code.
- Section 336 – Binary indicator for whether the case involves snatching under Section 336.
- Section 339 – Binary indicator for whether the case involves robbery under Section 339.
- Section 340 – Binary indicator for whether the case involves gang robbery under Section 340.