Combined rumor and non-rumor dataset

Citation Author(s):
Mansor
Alohali
Submitted by:
Mansor Alohali
Last updated:
Mon, 03/31/2025 - 03:33
DOI:
10.21227/9kkw-x403
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset, comprising 103,806 text entries, is a comprehensive resource for rumor detection on social media, constructed by merging benchmark collections including PHEME, LIAR Fake News, Twitter15, Twitter16, and ISOT Fake News. It features a binary classification schema (47% rumor, 53% non-rumor) and integrates original and adversarially augmented samples to enhance model robustness. Augmentation, applied selectively to the rumor class, employs the TextAttack framework with EmbeddingAugmenter (20% word swaps) and CharSwapAugmenter (character-level perturbations), preserving semantic integrity while introducing realistic textual variations. Preprocessing includes text normalization (e.g., lowercase conversion, URL/user placeholders)

Instructions: 

This dataset supports the paper "Transparent and Resilient Misinformation Detection with Multi-Level Explainability and Adversarial Training." It merges multiple publicly available rumor detection datasets to provide a comprehensive benchmark for training and evaluating robust, explainable models in the context of social media misinformation.

the date set Contains the cleaned, preprocessed dataset (103,806 entries) with binary labels (rumor or non-rumor) and standardized text fields.

the dataset sources are Sources:

 

  • PHEME Dataset (event-based rumors)

  • LIAR Fake News Dataset

  • Twitter15 / Twitter16

  • ISOT Fake News Dataset

the data sets used Adversarial Augmentation

Techniques Use

Embedding-based word substitution (TextAttack)

Character-level perturbations (random swaps, insertions)

To improve model robustness and simulate real-world adversarial misinformation