QAmultilabelEURLEXsamples

Citation Author(s):: WANG LI
Submitted by:: WANG LI
Last updated:: Mon, 07/08/2024 - 19:58
DOI:: 10.21227/pwcg-7b84

11 views

Categories:

Other

Keywords:

multi-label classification

ACCESS DATASET CITE

Abstract

The dataset is the sampling dataset from EURLEX57k and built for multi-answer questioning task with EUROVOC. , Each legal document in the EURLEX57k dataset is assigned several labels from the European Vocabulary (EUROVOC), which maintains thousands of concepts such as "export industry" and "organic acid". Before building the data, the sample is chosen. A Z-scorebased online sample size calculator is used to determine the sample sizes. The given confidence level is 95%. A 5% margin of uncertainty is used. The computation results in a 381 out of 45,000 train sample size. Additionally, 362 out of 6,000 were drawn for both validation and test samples. The train, validation, and test data examples after data building are 1708, 1650, and 1648, respectively. The dataset is the validation sample dataset.

Data building is the initial stage in preparing the dataset for the multi-answer questioning task with label hierarchy. The simulation data are obtained by sampling the EURLEX dataset via the Z-score. The labels for the multiple answers are obtained by mapping the labeled Eurovoc concepts to the subdomain trees (/categories list) in the Eurovoc hierarchy. Then, labels and title(/text) are combined as the inputs for an extractive multiple answer questioning task. Titles is proved gaining similar performance as the legal document (Chalkidis et al., 2019) which could be utilized to deal with the long input problem for pre-trained models with restricted input lengths. Tokenization and label alignment are used in the second step to process the inputs. The third step involves fine-tuning pre-trained BERT-based models for the multi-answer question task using the pre-processed data. And using seqeval and the suggested auxiliary classification metric on validation and test samples, the performance of the fine-tuned models is assessed. The key elements of the methodology are presented in the subsections.

Instructions:

celex_id, input_ids, token_type_ids, attention_mask and labels are five columns of the dataset.

When using the validation data for the token classification task, directly remove the celex_id column and collator the rest columns together to feed into the model directly.

The pretrained models are BERT based legal domain models. In many NLP tasks, transformer-based pretrained models, like BERT, is suggested for classification tasks. The pretrained model chosen in this research for multi-label classification is BERT. Huggingface offers developers a place to share their models, which include domain-adapted models. There are many BERT-based pretrained models in the legal domain. Legal BERT by Chalkidis et al. (2020) and Zheng et al. (2021) are both prominent instances. Chalkidis et al. (2020) attempt to demonstrate the value of domain pretraining, while Zheng et al. (2021) investigate the circumstances in which domain pretraining is beneficial.