Datasets
Standard Dataset
Combined rumor and non-rumor dataset

- Citation Author(s):
- Submitted by:
- Mansor Alohali
- Last updated:
- Mon, 03/31/2025 - 03:33
- DOI:
- 10.21227/9kkw-x403
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
This dataset, comprising 103,806 text entries, is a comprehensive resource for rumor detection on social media, constructed by merging benchmark collections including PHEME, LIAR Fake News, Twitter15, Twitter16, and ISOT Fake News. It features a binary classification schema (47% rumor, 53% non-rumor) and integrates original and adversarially augmented samples to enhance model robustness. Augmentation, applied selectively to the rumor class, employs the TextAttack framework with EmbeddingAugmenter (20% word swaps) and CharSwapAugmenter (character-level perturbations), preserving semantic integrity while introducing realistic textual variations. Preprocessing includes text normalization (e.g., lowercase conversion, URL/user placeholders)
This dataset supports the paper "Transparent and Resilient Misinformation Detection with Multi-Level Explainability and Adversarial Training." It merges multiple publicly available rumor detection datasets to provide a comprehensive benchmark for training and evaluating robust, explainable models in the context of social media misinformation.
the date set Contains the cleaned, preprocessed dataset (103,806 entries) with binary labels (rumor
or non-rumor
) and standardized text fields.
the dataset sources are Sources:
-
PHEME Dataset (event-based rumors)
-
LIAR Fake News Dataset
-
Twitter15 / Twitter16
-
ISOT Fake News Dataset
the data sets used Adversarial Augmentation
Techniques Use
Embedding-based word substitution (TextAttack)
Character-level perturbations (random swaps, insertions)
To improve model robustness and simulate real-world adversarial misinformation