Datasets
Standard Dataset
Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis
- Citation Author(s):
- Submitted by:
- Nirmalya Thakur
- Last updated:
- Mon, 10/21/2024 - 18:45
- DOI:
- 10.21227/d46p-v480
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
-
Instagram, Covid, covid-19, coronavirus, Data Mining, Sentiment Analysis, Machine Learning, Supervised Learning, Unsupervised Learning, Social Media, data science, Data Analysis, Pattern Recognition, Information Retrieval, web mining, Public Health, social media platforms, social media mining, social networks, virus outbreak, emotion analysis, multilingual dataset, Natural Language Processing, NLP, AI, artificial intelligence, online misinformation, toxic content detection, public attitudes, syndromic surveillance, neural networks, WHO, epidemic, pandemic, classification, Google Translate, language detection, language translation, public perception, public discourse, misinformation analysis, online behavior, health communication, user-generated content, social contagion, online hate, Text Classification, toxic language, LGBTQ+ stigma, Text Mining, pandemic studies, health misinformation, Dataset
Abstract
To download this dataset without purchasing an IEEE Dataport subscription, please visit: https://zenodo.org/records/13896353
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
- Post ID: Unique ID of each Instagram post
- Post Description: Complete description of each post in the language in which it was originally published
- Date: Date of publication in MM/DD/YYYY format
- Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
- Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
- Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral
Open Research Questions:
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
- How does sentiment toward COVID-19 vary across different languages?
- How has public sentiment toward COVID-19 evolved from 2020 to the present?
- How do cultural differences affect social media discourse about COVID-19 across various languages?
- How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
- How effective were public health campaigns in shifting public sentiment in different languages?
- What patterns of vaccine hesitancy or support are present in different languages?
- How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
- What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
- How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
- What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Please refer to the above-mentioned paper for more information about the dataset development.