ColBERT dataset - 200k short texts for humor detection

Citation Author(s):: Issa Annamoradnejad (Sharif University of Technology)
Submitted by:: Issa Annamoradnejad
Last updated:: Tue, 03/09/2021 - 15:18
DOI:: 10.21227/fw8e-z983
Data Format:: CSV
Links:: Related preprint

1128 views

Categories:

Machine Learning

Keywords:

classification; humor; short text

ACCESS DATASET CITE

Abstract

Automatic humor detection has interesting use cases in modern technologies, such as chatbots and virtual assistants. Existing humor detection datasets usually combined formal non-humorous texts and informal jokes with incompatible statistics (text length, words count, etc.). This makes it more likely to detect humor with simple analytical models and without understanding the underlying latent lingual features and structures.

We introduce a new combined dataset for the task of humor detection, entitled “ColBERT dataset”, which contains 200k labeled short texts, equally distributed between humor and non-humor. We reduced or completely removed issues of the existing datasets from the new dataset. The dataset is much larger than the previous datasets and it includes texts with similar textual features. Correlation between character count and the target is insignificant (+0.09), and there is no notable connection between the target value and sentiment features (correlation coefficient of -0.09 and +0.02 for polarity and subjectivity, respectively).

Instructions:

If you already have Microsoft Excel installed, just double-click a CSV file to open it in Excel. After double-clicking the file, you may see a prompt asking which program you want to open it with. Select Microsoft Excel. If you are already in Microsoft Excel, you can choose File > Open and select the CSV file.

Reading CSV files is possible in python's pandas as well. It is highly recommended if you have a lot of data to analyze. pandas is an open-source Python library that provides high performance data analysis tools and easy to use data structures. pandas is available for all Python installations, but it is a key part of the Anaconda distribution and works extremely well in Jupyter notebooks to share data, code, analysis results, visualizations, and narrative text.