Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis

Citation Author(s):: Nirmalya
Thakur

Department of Electrical Engineering and Computer Science, South Dakota School of Mines and Technology
Submitted by:: Nirmalya Thakur
Last updated:: Tue, 01/21/2025 - 20:53
DOI:: 10.21227/7fvc-y093
Research Article Link:: Data Descriptor
License:: Creative Commons Attribution

1360 Views

Categories:: Artificial Intelligence
Education and Learning Technologies
Machine Learning
Social Sciences
Biomedical and Health Sciences
Communications
Computational Intelligence
COVID-19
Demographic
Education
Age
Keywords:: Instagram, Mpox, Monkeypox, Data Mining, Sentiment Analysis, hate speech, anxiety detection, stress analysis, Machine Learning, Supervised Learning, Unsupervised Learning, Social Media, data science, Data Analysis, Pattern Recognition, Information Retrieval, web mining, Public Health, social media platforms, social media mining, social networks, virus outbreak, emotion analysis, multilingual dataset, Natural Language Processing, NLP, AI, artificial intelligence, online misinformation, toxic content detection, public attitudes, syndromic surveillance, neural networks, WHO, epidemic, pandemic, classification, Google Translate, language detection, language translation, public perception, public discourse, misinformation analysis, online behavior, health communication, user-generated content, social contagion, online hate, Text Classification, toxic language, LGBTQ+ stigma, mpox stigma, Text Mining, pandemic studies, health misinformation, Dataset

0 ratings - Please login to submit your rating.

ACCESS DATASET CITE

Abstract

To download the dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13738598

Please cite the following paper when using this dataset:

N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292

Abstract

The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset.

After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post into

one of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral
hate or not hate
anxiety/stress detected or no anxiety/stress detected.

These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.

The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.

The following is a description of the attributes present in this dataset:

Post ID: Unique ID of each Instagram post
Post Description: Complete description of each post in the language in which it was originally published
Date: Date of publication in MM/DD/YYYY format
Language: Language of the post as detected using the Google Translate API
Translated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.
Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutral
Hate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hate
Anxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

Instructions:

The dataset can be directly used for training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.

Data Descriptor Article DOI:

https://arxiv.org/pdf/2409.05292