YouTube
Please cite the following paper when using this dataset:
Vanessa Su and Nirmalya Thakur, “COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations”, Paper submitted to the IEEE 15th Annual Computing and Communication Workshop and Conference 2025, Las Vegas, USA, Jan 06-08, 2025.
- Categories:
# Top 100 YouTube Channels Dataset
## Overview
This dataset provides comprehensive information about the top 100 YouTube channels based on subscriber count. It offers valuable insights into the most popular content creators on the platform, their performance metrics, and channel details.
## Dataset Contents
The dataset includes the following information for each channel:
- Channel ID
- Title
- Custom URL
- Subscriber Count
- Video Count
- View Count
- Category
- Country
- Categories:
This dataset contains audio recordings and transcriptions of toxic speech derived from Indonesian conversations during YouTube videos where scammers are confronted. The dataset captures two separate interactions that escalate into toxic exchanges. Each interaction has been verified by native Indonesian speakers and labeled into two classes: toxic and non-toxic. The dataset includes both the original and preprocessed versions of the speech and text data. The original speech files total 136MB, while the preprocessed speech files are 111,7MB.
- Categories:
The dataset includes Pakistan most popular YouTube videos for each category from year 2021- 2023. There are two kinds of data files, one includes video statistics and other one related to comments on those videos. They are linked by the unique video_id field. Both datasets are merged in final videos file which contains all videos statistics and sentiment extracted from comments. Here’s a breakdown of each column:
- Categories:
Fifth Generation 5G cellular network users are increasing exponentially, where 5G coverage is a challenge for global telecommunications to provide end-users with maximum Quality of Experience (QoE). 5G technology New Radio (NR) is developed to address high bandwidth, low latency and massive connectivity requirements of enhanced Mobile Broadband (eMBB) compared to Fourth Generation (4G) Long-Term Evolution (LTE).
- Categories:
The dataset is oriented on encrypted traffic classification problems. The dataset contains three classes of flows: web flows, YouTube flows, and Netflixflows. These classes are chosen because web and video traffic account for 90% of global traffic, while YouTube and Netflix are the largest video services. The structure of the dataset is as follows. It includes 100 download traces of the most popular web pages according to https://httparchive.org, 100 the most popular YouTube videos, and 50 Netflix series and movies.
- Categories: