ColBERT dataset - 200k short texts for humor detection

Sharif University of Technology
Issa Annamoradnejad
Tue, 03/09/2021 - 10:18
Automatic humor detection has interesting use cases in modern technologies, such as chatbots and virtual assistants. Existing humor detection datasets usually combined formal non-humorous texts and informal jokes with incompatible statistics (text length, words count, etc.). This makes it more likely to detect humor with simple analytical models and without understanding the underlying latent lingual features and structures.

We introduce a new combined dataset for the task of humor detection, entitled “ColBERT dataset”, which contains 200k labeled short texts, equally distributed between humor and non-humor. We reduced or completely removed issues of the existing datasets from the new dataset. The dataset is much larger than the previous datasets and it includes texts with similar textual features. Correlation between character count and the target is insignificant (+0.09), and there is no notable connection between the target value and sentiment features (correlation coefficient of -0.09 and +0.02 for polarity and subjectivity, respectively).


