MultiModal dataset from Instragram

Citation Author(s):
Qi
Yang
Submitted by:
Qi Yang
Last updated:
Tue, 05/17/2022 - 22:17
DOI:
10.21227/j1rf-fa09
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

We collect almost 248,166 public microblogs according to selected 97 hashtags of "Top 100" on Instagram. The final collection contains 56861 microblogs which include both text and image, called MultiModal data from Instagram (MM-INS). We filter duplicate hashtags in one sample and drop out those microblogs without texts.

Instructions: 

This dataset is a collection of crawled microblogs from Instagram by using Instaloader API, https://instaloader.github.io/. As the raw dataset is too larger to upload all of them, we choose 3 sub-datasets without preprocessing, including "#beach", "#cat", "#dog", and the corresponding sub-datasets with preprocessing that remove those images without texts, including "beach", "cat", "dog". Hope these samples can be helpful for your research, and we are open for academic cooperation if necessary.