Datasets
Standard Dataset
COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations
- Citation Author(s):
- Submitted by:
- Nirmalya Thakur
- Last updated:
- Tue, 01/21/2025 - 21:01
- DOI:
- 10.21227/sbj6-pt91
- Data Format:
- License:
- Categories:
- Keywords:
-
YouTube, Covid, covid-19, coronavirus, Data Mining, Sentiment Analysis, Machine Learning, Supervised Learning, Unsupervised Learning, Social Media, data science, Data Analysis, Pattern Recognition, Information Retrieval, web mining, Public Health, social media platforms, social media mining, social networks, virus outbreak, emotion analysis, multilingual dataset, Natural Language Processing, NLP, AI, artificial intelligence, online misinformation, toxic content detection, public attitudes, syndromic surveillance, neural networks, WHO, epidemic, pandemic, classification, Google Translate, language detection, language translation, public perception, public discourse, misinformation analysis, online behavior, health communication, user-generated content, social contagion, online hate, Text Classification, toxic language, Text Mining, pandemic studies, health misinformation, Dataset
Abstract
Please cite the following paper when using this dataset:
Vanessa Su and Nirmalya Thakur, “COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations”, Proceedings of the IEEE 15th Annual Computing and Communication Workshop and Conference 2025, Las Vegas, USA, Jan 06-08, 2025 (Paper accepted for publication, Preprint: https://arxiv.org/abs/2412.17180).
Abstract:
This dataset comprises metadata and analytical attributes for 9,325 publicly available YouTube videos related to COVID-19, published between January 1, 2023, and October 25, 2024. The dataset was created using the YouTube API and refined through rigorous data cleaning and preprocessing.
Key Attributes of the Dataset:
- Video URL: The full URL linking to each video.
- Video ID: A unique identifier for each video.
- Title: The title of the video.
- Description: A detailed textual description provided by the video uploader.
- Publish Date: The date the video was published, ranging from January 1, 2023, to October 25, 2024.
- View Count: The total number of views per video, ranging from 0 to 30,107,100 (mean: ~59,803).
- Like Count: The number of likes per video, ranging from 0 to 607,138 (mean: ~1,413).
- Comment Count: The number of comments, varying from 1 to 25,000 (mean: ~147).
- Duration: Video length in seconds, ranging from 0 to 42,900 seconds (median: 137 seconds).
- Categories: Categorization of videos into 15 unique categories, with "News & Politics" being the most common (4,035 videos).
- Tags: Tags associated with each video.
- Language: The language of the video, predominantly English ("en").
Please refer to the above-mentioned paper for details about this dataset