GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information

Name: GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information
Creator: Muhammad Imran
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Umair Qazi (Qatar Computing Research Institute)

Muhammad Imran (Qatar Computing Research Institute)

Ferda Ofli (Qatar Computing Research Institute)
Submitted by:: Muhammad Imran
Last updated:: Wed, 06/24/2020 - 19:39
DOI:: 10.21227/et8d-w881
Data Format:: *.JSON (ZIP)
Links:: CrisisNLP GeoCoV19 repo

arXiv (paper)

5662 views

Categories:

Keywords:

Health-related tweets

CITE

Abstract

Abstract:

We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.

The dataset was collected using more than 800 multilingual keywords and hashtags. The complete list of keywords can be downloaded from here: https://crisisnlp.qcri.org/covid19

For more details, please refer to this paper: https://arxiv.org/abs/2005.11177

Explore interesting trends in GeoCoV19 dataset using our new service: https://covid19-trends.qcri.org/

Instructions:

GeoCoV19 Dataset Description

The GeoCoV19 Dataset comprises several TAR files, which contain zip files representing daily data. Each zip file contains a JSON with the following format:

{ "tweet_id": "122365517305623353", "created_at": "Sat Feb 01 17:11:42 +0000 2020", "user_id": "335247240", "geo_source": "user_location", "user_location": { "country_code": "br" }, "geo": {}, "place": { }, "tweet_locations": [ { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "us" }, { "country_code": "ru", "state": "Voronezh Oblast", "county": "Petropavlovsky District" }, { "country_code": "at", "state": "Upper Austria", "county": "Braunau am Inn" }, { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "cn" }, { "country_code": "in", "state": "Himachal Pradesh", "county": "Jubbal" } ] }

Description of all the fields in the above JSON

Each JSON in the Geo file has the following eight keys:

1. Tweet_id: it represents the Twitter provided id of a tweet

2. Created_at: it represents the Twitter provided "created_at" date and time in UTC

3. User_id: it represents the Twitter provided user id

4. Geo_source: this field shows one of the four values: (i) coordinates, (ii) place, (iii) user_location, or (iv) tweet_text. The value depends on the availability of these fields. However, priority is given to the most accurate fields if available. The priority order is coordinates, places, user_location, and tweet_text. For instance, when a tweet has GPS coordinates, the value will be "coordinates" even though all other location fields are present. If a tweet does not have GPS, place, and user_location information, then the value of this field will be "tweet_text" if there is any location mention in the tweet text.

The remaining keys can have the following location_json inside them. Sample location_json: {"country_code":"us","state":"California","county":"San Francisco","city":"San Francisco"}. Depending on the available granularity, country_code, state, county or city keys can be missing in the location_json.

5. user_location: It can have a "location_json" as described above or an empty JSON {}. This field uses the "location" profile meta-data of a Twitter user and represents the user declared location in the text format. We resolve the text to a location.

6. geo: represents the "geo" field provided by Twitter. We resolve the provided latitude and longitude values to locations. It can have a "location_json" as described above or an empty JSON {}.

7. tweet_locations: This field can have an array of "location_json" as described above [location_json1, location_json2] or an empty array []. This field uses the tweet content (i.e., actual tweet message) to find toponyms. A tweet message can have several mentions of different locations (i.e., toponyms). That is why we have an array of locations representing all those toponyms in a tweet. For instance, in a tweet like "The UK has over 65,000 #COVID19 deaths. More than Qatar, Pakistan, and Norway.", there are four location mentions. Our tweet_locations array should represent these four separately.

8. place: It can have a "location_json" described above or an empty JSON {}. It represents the Twitter-provided "place" field.

Tweets hydrators:

CrisisNLP (Java): https://crisisnlp.qcri.org/#resource8

Twarc (Python): https://github.com/DocNow/twarc#dehydrate

Docnow (Desktop application): https://github.com/docnow/hydrator

If you have doubts or questions, feel free to contact us at: uqazi@hbku.edu.qa and mimran@hbku.edu.qa

can you please tell me how can I get the sentiment label i this dataset.

imran khan Tue, 07/07/2020 - 15:04 Permalink

Thanks for your question. This dataset does not have sentiment labels. However, you can use any multilingual sentiment classifier to determine tweets' sentiment polarity.

Muhammad Imran Tue, 07/07/2020 - 20:03 Permalink

If you don't mind. can you give some reference for the "sentiment classifier", Because I search all over the internet and I find some reference which was not good as I want.

thank you.

imran khan Wed, 07/08/2020 - 11:05 Permalink

Probably the following references would be helpful:

Severyn, A., & Moschitti, A. (2015, August). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 959-962).

Giachanou, A., & Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys (CSUR), 49(2), 1-41.

Muhammad Imran Tue, 07/14/2020 - 19:21 Permalink

how much time will it take me to hydrate this complete dataset?

Somodo Non Thu, 07/23/2020 - 06:20 Permalink

I think it depends on how many parallel threads one uses to call Twitter API. Parallel calls will significantly reduce the rehydration time.

Muhammad Imran Tue, 08/18/2020 - 14:15 Permalink

Do you have datasets for May and June as well?

Hyun Kim Thu, 07/23/2020 - 08:08 Permalink

Yes, we have been collecting data for May, June, July, and onwards. We need to process it before sharing it. It may take some time though.

Muhammad Imran Tue, 08/18/2020 - 14:17 Permalink

how can I get the sentiment label of this datasets

Abdullah Matin Sun, 08/23/2020 - 08:47 Permalink

Hi, I didn't see the original tweets in this dataset, without it I cannot apply sentiment analysis. Could you also include this in your dataset?

Yimei Fan Fri, 11/13/2020 - 02:45 Permalink

do you have any statistics of covid-related tweets per country you can share?

Davide Morselli Wed, 01/27/2021 - 10:42 Permalink

Thank you Mohammed for this dataset, but I did not find tweet text.. does t the data set have tweet text?

Soha Mohamed Wed, 03/17/2021 - 22:40 Permalink

Dataset Files

Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

Datasets

Open Access

GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information

Abstract

Instructions:

Dataset Files

QUESTIONS?

More like this Dataset

Weather Monitoring Station For Farms And Agriculture

Trilateration based on RSSI values in transmitters and receivers

The FLAME dataset: Aerial Imagery Pile burn detection using drones (UAVs)

Retinal Fundus Multi-disease Image Dataset (RFMiD)

Experimental database for detecting and diagnosing rotor broken bar in a three-phase induction motor.

Dataset for classification of handwritten and printed text in a Doctor's prescription