Readability Classifier with Linguistic Characteristics

Citation Author(s):: Chao Zhang (College of Foreign Languages, Qufu Normal University)
Submitted by:: Chao Zhang
Last updated:: Sat, 05/18/2024 - 01:34
DOI:: 10.21227/j3as-y649
Data Format:: *.avi; *.csv; *.txt; *.zip

37 views

Categories:

Keywords:

readability grading

multidimensional linguistics features

Chinese as a second language

BERT

BCRC

ACCESS DATASET CITE

Abstract

This data repository contains test data and corresponding test code for evaluating the performance of a machine learning model. The dataset includes 950 labeled samples across 7 different classes. The test code provides implementations of several common evaluation metrics, including accuracy, precision, recall, and F1-score. This resource is intended to facilitate the benchmarking and comparison of different machine learning algorithms on a standardized task. Researchers and practitioners in the field of artificial intelligence and pattern recognition may find this data and test code useful for their work.

It is important to note that the training code utilized in this study was extracted from published textbooks. Due to copyright considerations, the training code is not included in this public repository. Interested parties who wish to access the training data can contact the authors directly. The test data and evaluation codes, however, are available under an open-source license to encourage reproducibility and further research in this area.

Instructions:

Instructions

The dataset includes the following files:

BRCCResults.txt: Contains the performance results for a BCRC model.

model_LSTM.pth: Saved PyTorch model checkpoint for the BCRC model.

test_data1.txt: Test dataset in text format for BCRC test.

TestCode_LSTM.py: Python script to evaluate the BCRC model on the test data.

BERTOnly_Test.py: Python script to evaluate the baseline model on the test data.

model_basebert.pth: Saved PyTorch model checkpoint for the baseline model.

Results2E5_32.txt: Performance results for the baseline model.

test.txt: test dataset in text format for baseline model.

To use this dataset, follow these steps:

Download all the files to your local machine.

Review the data documentation to understand the dataset.

Use the Python scripts (TestCode_LSTM.py and BERTOnly_Test.py) to load the test data and evaluate the pre-trained models.

The performance results for the BCRC models are printed and stored in the BRCCResults.txt and Results2E5_32.txt files, respectively.

Please note that the training data used to generate the provided model checkpoints is not included in this public dataset due to copyright restrictions. Interested users who wish to access the training code should contact the original authors directly.

If you have any questions or need further assistance, feel free to reach out to the us.

Funding Agency

International Chinese Language Education Research Program; Higher Education Youth Innovation Team Project of Shandong Province

Grant Number

23YH82C; 2023RW050