Abstract 

Please cite the following paper when using this dataset:

Vanessa Su and Nirmalya Thakur, “COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations”, Proceedings of the IEEE 15th Annual Computing and Communication Workshop and Conference 2025, Las Vegas, USA, Jan 06-08, 2025 (Paper accepted for publication, Preprint: https://arxiv.org/abs/2412.17180).

Abstract:

This dataset comprises metadata and analytical attributes for 9,325 publicly available YouTube videos related to COVID-19, published between January 1, 2023, and October 25, 2024. The dataset was created using the YouTube API and refined through rigorous data cleaning and preprocessing. 

Key Attributes of the Dataset:

  • Video URL: The full URL linking to each video.
  • Video ID: A unique identifier for each video.
  • Title: The title of the video.
  • Description: A detailed textual description provided by the video uploader.
  • Publish Date: The date the video was published, ranging from January 1, 2023, to October 25, 2024.
  • View Count: The total number of views per video, ranging from 0 to 30,107,100 (mean: ~59,803).
  • Like Count: The number of likes per video, ranging from 0 to 607,138 (mean: ~1,413).
  • Comment Count: The number of comments, varying from 1 to 25,000 (mean: ~147).
  • Duration: Video length in seconds, ranging from 0 to 42,900 seconds (median: 137 seconds).
  • Categories: Categorization of videos into 15 unique categories, with "News & Politics" being the most common (4,035 videos).
  • Tags: Tags associated with each video.
  • Language: The language of the video, predominantly English ("en").
Instructions: 

Please refer to the above-mentioned paper for details about this dataset

Data Descriptor Article DOI: