Text Mining | IEEE DataPort

COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations

Please cite the following paper when using this dataset:

Vanessa Su and Nirmalya Thakur, “COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations”, Proceedings of the IEEE 15th Annual Computing and Communication Workshop and Conference 2025, Las Vegas, USA, Jan 06-08, 2025 (Paper accepted for publication, Preprint: https://arxiv.org/abs/2412.17180).

Abstract:

Categories:

Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis

To download this dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13896353

Please cite the following paper when using this dataset:

Categories:

Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis

To download the dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13738598

Please cite the following paper when using this dataset:

N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292

Abstract

Categories:

Biographies of literature writers

The biographies_EN dataset contains 1000 biographies of literature writers retrieved from the english version of Wikipedia. There is a total of 500 biographies of women writers extracted from the category entitled “19th-century_women_writers” (https://en.wikipedia.org/wiki/Category:19th-century_women_writers) and 500 male biographies extracted from the category “19th-century_male_writers” (https://en.wikipedia.org/wiki/Category:19th-century_male_writers).

Categories:

Machine Learning

Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people

Several fields of study can benefit from a large, structured, and accurate dataset of historical figures. Due to a lack of such a dataset, in this paper, we aim to use machine learning and text mining models to collect, predict, and cleanse online data with a focus on age and gender. We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project.

Categories:

Korean stock trading app review dataset

This dataset contains information about Android app users’ reviews crawled from https://play.google.com/store/apps from 2022/4/2 to 2022/4/14. User reviews of 24 Korean trading apps were collected from Google Play Store, and the total number of the collected reviews is 41,705. App name, user ID, review content, rating, and date information were collected for each review by web crawling. The entire dataset is in Korean.

Categories:

Artificial Intelligence

Job-Skills

This dataset contains job and their skills extracted from the job adverisments.

Categories:

Artificial Intelligence

RetroRevMatchEvalICIP16: A retrospective reviewer matching dataset and evaluation for IEEE ICIP 2016

The "RetroRevMatchEvalICIP16" dataset provides a retrospective reviewer recommendation dataset and evaluation for IEEE ICIP 2016. The methodology via which the recommendations were obtained and the evaluation was performed is described in the associated paper.

Y. Zhao, A. Anand, and G. Sharma, “Reviewer recommendations using document vector embeddings and a publisher database: Implementation and evaluation,” IEEE Access, vol. 10, pp. 21 798–21 811, 2022. https://doi.org/10.1109/ACCESS.2022.3151640

Categories:

USA Nov.2020 Election 20 Mil. Tweets (with Sentiment and Party Name Labels) Dataset

This dataset includes 24,201,654 tweets related to the US Presidential Election on November 3, 2020, collected between July 1, 2020, and November 11, 2020. The related party name and sentiment scores of tweets, also the words that affect the score were added to the data set.

Categories: