110K Sensitive Video Dataset
ATTENTION: THIS DATASET DOES NOT HOST ANY SOURCE VIDEOS. WE PROVIDE ONLY HIDDEN FEATURES GENERATED BY PRE-TRAINED DEEP MODELS AS DATA
Massive amounts of video are uploaded on video-hosting platforms every minute. This volume of data presents a challenge in controlling the type of content uploaded to these video hosting services. Those platforms are responsible for any sensitive media uploaded by their users. In this context, we propose the 110K Sensitive Video Dataset for binary video classification (whether there is sensitive content in the video or not), containing more than 110 thousand tagged videos. Additionally, we separated an exclusive subset with 11 thousand videos for testing in Kaggle.
To compose the sensitive video subset, we collected videos with content of sex, violence, and gore from various internet sources. While composing the subset of safe videos, we collect videos from everyday life, online courses, tutorials, sports, etc. It is worth mentioning that we were concerned about creating more challenging examples for each class. We collected sex videos with people wearing full-body clothes (e.g., latex and cosplay) for the sensitive video class. Moreover, we have collected videos that could be misclassified as sensitive for the safe videos class, such as MMA, breastfeeding, pool party, beach, and other videos with a higher amount of skin exposure.
This dataset comprises 53,683 safe videos and 53,683 videos with sensitive content. Those sensitive videos are 51,563 Pornographic Videos and 2120 Gore Videos. Additionally, each video class contains a list of related tags.
We extracted visual and audio embeddings, concatenated them, and saved each video's labels and features. Inception V3 extracted the visual features, generating embeddings of 1024-d. The audio embeddings were extracted by the network Vggish, generating embeddings of 128-d.
The dataset has two variations:
Sequential: Each video is sampled into windows of 0.96s, resulting in an array of shapes (N, 1152), where the video duration limits N.
Non-Sequential: The entire video is globally aggregated as a single sample, generating an array of shapes (1, 1152). Additionally, we compute the mean, median, max, min, and std for each feature, resulting in a final array of shapes (1, 5760) for each video.
We structured this dataset into chunks with a max size of 4GB. Each chunk was stored as an NPZ file. Chunks are composed of keys and values; the keys are strings in the format (label)_(video id) (for instance, "improper_29024487", "proper_MqnZqzAxQTk", "improper_gore122"). Videos labeled as "improper" are Sensitive, and "proper" are safe. The values are the audio-visual embeddings stored as NumPy arrays. A CSV file in the root directory contains all video indexes, metadata, and tags.
The scripts and more info about the dataset are available on the GitHub repository: https://github.com/TeleMidia/Sensitive-Video-Dataset