This dataset has been developed based on the work of the GeoCOV19Tweets Dataset. The original work by Lamsal, R. runs network analysis on a similar dataset to understand the underlying relationship between countries and hashtags. The work did an analysis on roughly 300k number of [country, hashtag] relations from 190 countries and territories, and 5055 unique hashtags. This work pushes the number of relationships by 3 times.

If you're using this dataset, please cite the original work: Design and analysis of a large-scale COVID-19 tweets dataset. The tweets collected in the GeoCOV19Tweets Dataset have been used for extracting more than 900k [place, hashtag] relationships. To be specific, 936,570 relationships. Each relationship is provided in a single row where the place and the hashtag are separated by a comma.

As an example, consider this tweet tweeted from Frankfurt, Germany - I am doing just fine in this COVID-19 pandemic. #quarantine #life. Since the example tweet has two hashtags, there are two possible relationships, (i) Frankfurt, quarantine (ii) Frankfurt, life. This way, all the tweets in the GeoCOV19Tweets Dataset have been processed for extracting similar relationships between places and hashtags. The cover image (or thumbnail) of this dataset is the result of importing the relationships as adjacency lists and running a network analysis on the resulting nodes and edges. Nodes are basically the places and hashtags, while edges are respective relationships.

The place entity in this dataset is the ["place"]["name"] string object received from Twitter's API payload. The ['name'] field in the place data dictionary of Twitter payload provides access to a short human-readable representation of the place’s name. You can learn more about Twitter's data dictionary here. The place entity can be easily changed to country or city as per needs. This decision should be taken during the hydration of the tweet IDs.

You can peek into this dataset from here: screenshot 1 and screenshot 2

Instructions:

This dataset provides [place, hashtag] relationships in a Comma-separated values (CSV) file. Each line represents a relationship. You can simply use the CSV file as per your research needs.

However, if you need to change the place entity from city (currently the dataset uses ["place"]["name"] object) to country, you'll have to consider the ["place"]["country"] object instead. The sample script is provided with this dataset. The script takes in a list of tweet IDs present in a CSV file and hydrates the IDs to extract places and hashtags relationships. The script is written for twarc.