Abstract

This dataset, comprising 103,806 text entries, is a comprehensive resource for rumor detection on social media, constructed by merging benchmark collections including PHEME, LIAR Fake News, Twitter15, Twitter16, and ISOT Fake News. It features a binary classification schema (47% rumor, 53% non-rumor) and integrates original and adversarially augmented samples to enhance model robustness. Augmentation, applied selectively to the rumor class, employs the TextAttack framework with EmbeddingAugmenter (20% word swaps) and CharSwapAugmenter (character-level perturbations), preserving semantic integrity while introducing realistic textual variations. Preprocessing includes text normalization (e.g., lowercase conversion, URL/user placeholders)

Instructions:

This dataset supports the paper "Transparent and Resilient Misinformation Detection with Multi-Level Explainability and Adversarial Training." It merges multiple publicly available rumor detection datasets to provide a comprehensive benchmark for training and evaluating robust, explainable models in the context of social media misinformation.

the date set Contains the cleaned, preprocessed dataset (103,806 entries) with binary labels (rumor or non-rumor) and standardized text fields.

the dataset sources are Sources:

PHEME Dataset (event-based rumors)
LIAR Fake News Dataset
Twitter15 / Twitter16
ISOT Fake News Dataset

the data sets used Adversarial Augmentation

Techniques Use

Embedding-based word substitution (TextAttack)

Character-level perturbations (random swaps, insertions)

To improve model robustness and simulate real-world adversarial misinformation

Dataset Files

dataset augmented_twitter_rumor_dataset.zip (6.90 MB)
clean dataset clean_rumor_dataset.py (1.64 kB)
data augmentation process augment_rumor_dataset.py (3.22 kB)

Datasets

Standard Dataset

Combined rumor and non-rumor dataset

Abstract

Dataset Files

QUESTIONS?