Text Mining

Please cite the following paper when using this dataset:

Vanessa Su and Nirmalya Thakur, “COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations”, Proceedings of the IEEE 15th Annual Computing and Communication Workshop and Conference 2025, Las Vegas, USA, Jan 06-08, 2025 (Paper accepted for publication, Preprint: https://arxiv.org/abs/2412.17180).

Abstract:

Categories:
106 Views

The biographies_EN dataset contains 1000 biographies of literature writers retrieved from the english version of Wikipedia.

Categories:
101 Views

Several fields of study can benefit from a large, structured, and accurate dataset of historical figures. Due to a lack of such a dataset, in this paper, we aim to use machine learning and text mining models to collect, predict, and cleanse online data with a focus on age and gender. We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project.

Categories:
1156 Views

This dataset contains information about Android app users’ reviews crawled from https://play.google.com/store/apps from 2022/4/2 to 2022/4/14. User reviews of 24 Korean trading apps were collected from Google Play Store, and the total number of the collected reviews is 41,705. App name, user ID, review content, rating, and date information were collected for each review by web crawling. The entire dataset is in Korean. 

Categories:
182 Views

This dataset contains  job and their skills extracted from the job adverisments. 

Categories:
2453 Views

The "RetroRevMatchEvalICIP16" dataset provides a retrospective reviewer recommendation dataset and evaluation for IEEE ICIP 2016. The methodology via which the recommendations were obtained and the evaluation was performed is described in the associated paper.

Y. Zhao, A. Anand, and G. Sharma, “Reviewer recommendations using document vector embeddings and a publisher database: Implementation and evaluation,” IEEE Access, vol. 10, pp. 21 798–21 811, 2022. https://doi.org/10.1109/ACCESS.2022.3151640

Categories:
282 Views

This dataset includes 24,201,654 tweets related to the US Presidential Election on November 3, 2020, collected between July 1, 2020, and November 11, 2020. The related party name and sentiment scores of tweets, also the words that affect the score were added to the data set.

Categories:
7359 Views