Dataset for Word Difficulty Prediction

Name: Dataset for Word Difficulty Prediction
Creator: Avishek Garain
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Avishek Garain (Jadavpur University)

Arpan Basu (Jadavpur University)

Sudip Kumar Naskar (Jadavpur University)
Submitted by:: Avishek Garain
Last updated:: Sun, 10/04/2020 - 16:00
DOI:: 10.21227/w0av-f618
Data Format:: *.csv
Links:: Word Difficulty Prediction Using Convolutional Neural Networks

2772 views

Categories:

Keywords:

difficulty;nlp;simplification;classification

CITE

Abstract

Most text-simplification systems require an indicator of the complexity of the words. The prevalent approaches to word difficulty prediction are based on manual feature engineering. Using deep learning based models are largely left unexplored due to their comparatively poor performance. We have explored the use of one of such in predicting the difficulty of words. We have treated the problem as a binary classification problem. We have trained traditional machine learning models and evaluated their performance on the task. Removing dependency on frequency of previously acquired words for measuring difficulty was one of our primary aims. Then we analyzed a convolutional neural network based prediction model which operates at the character level and evaluate its efficiency compared to others.

This dataset contains 40481 data instances. The various column headers are as follows:

Word
Length
Freq_HAL
Log_Freq_HAL
I_Mean_RT
I_Zscore
I_SD
Obs
I_Mean_Accuracy

The other details of the dataset and the method to obtain the difficulty labels are present in the research publication whose link is attached.

For getting open-access to the publication visit https://garain.codes

Please cite both the dataset and the conference paper if the dataset comes to any use.

Instructions:

The data is in CSV format. Please check the research paper for obtaining the difficulty label from the I_Z score.

For selecting Bibtex contents, double click on IEEE contents. Then use Ctrl+C to copy. It's a bug and we need to wait till its fixed. Till then this is how you can cite.

Avishek Garain Tue, 10/06/2020 - 07:42 Permalink

The bug has been fixed.

Avishek Garain Tue, 11/24/2020 - 10:10 Permalink