Datasets
Open Access
GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information
- Citation Author(s):
- Submitted by:
- Muhammad Imran
- Last updated:
- Wed, 06/24/2020 - 15:39
- DOI:
- 10.21227/et8d-w881
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
Abstract:
We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.
The dataset was collected using more than 800 multilingual keywords and hashtags. The complete list of keywords can be downloaded from here: https://crisisnlp.qcri.org/covid19
For more details, please refer to this paper: https://arxiv.org/abs/2005.11177
Explore interesting trends in GeoCoV19 dataset using our new service: https://covid19-trends.qcri.org/
GeoCoV19 Dataset Description
The GeoCoV19 Dataset comprises several TAR files, which contain zip files representing daily data. Each zip file contains a JSON with the following format:
{ "tweet_id": "122365517305623353", "created_at": "Sat Feb 01 17:11:42 +0000 2020", "user_id": "335247240", "geo_source": "user_location", "user_location": { "country_code": "br" }, "geo": {}, "place": { }, "tweet_locations": [ { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "us" }, { "country_code": "ru", "state": "Voronezh Oblast", "county": "Petropavlovsky District" }, { "country_code": "at", "state": "Upper Austria", "county": "Braunau am Inn" }, { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "cn" }, { "country_code": "in", "state": "Himachal Pradesh", "county": "Jubbal" } ] }
Description of all the fields in the above JSON
Each JSON in the Geo file has the following eight keys:
1. Tweet_id: it represents the Twitter provided id of a tweet
2. Created_at: it represents the Twitter provided "created_at" date and time in UTC
3. User_id: it represents the Twitter provided user id
4. Geo_source: this field shows one of the four values: (i) coordinates, (ii) place, (iii) user_location, or (iv) tweet_text. The value depends on the availability of these fields. However, priority is given to the most accurate fields if available. The priority order is coordinates, places, user_location, and tweet_text. For instance, when a tweet has GPS coordinates, the value will be "coordinates" even though all other location fields are present. If a tweet does not have GPS, place, and user_location information, then the value of this field will be "tweet_text" if there is any location mention in the tweet text.
The remaining keys can have the following location_json inside them. Sample location_json: {"country_code":"us","state":"California","county":"San Francisco","city":"San Francisco"}. Depending on the available granularity, country_code, state, county or city keys can be missing in the location_json.
5. user_location: It can have a "location_json" as described above or an empty JSON {}. This field uses the "location" profile meta-data of a Twitter user and represents the user declared location in the text format. We resolve the text to a location.
6. geo: represents the "geo" field provided by Twitter. We resolve the provided latitude and longitude values to locations. It can have a "location_json" as described above or an empty JSON {}.
7. tweet_locations: This field can have an array of "location_json" as described above [location_json1, location_json2] or an empty array []. This field uses the tweet content (i.e., actual tweet message) to find toponyms. A tweet message can have several mentions of different locations (i.e., toponyms). That is why we have an array of locations representing all those toponyms in a tweet. For instance, in a tweet like "The UK has over 65,000 #COVID19 deaths. More than Qatar, Pakistan, and Norway.", there are four location mentions. Our tweet_locations array should represent these four separately.
8. place: It can have a "location_json" described above or an empty JSON {}. It represents the Twitter-provided "place" field.
Tweets hydrators:
CrisisNLP (Java): https://crisisnlp.qcri.org/#resource8
Twarc (Python): https://github.com/DocNow/twarc#dehydrate
Docnow (Desktop application): https://github.com/docnow/hydrator
If you have doubts or questions, feel free to contact us at: uqazi@hbku.edu.qa and mimran@hbku.edu.qa
Dataset Files
- geo_feb_01_10.tar (799.59 MB)
- geo_feb_11_20.tar (579.87 MB)
- geo_feb_21_29.tar (1.96 GB)
- geo_march_01_10.tar (3.26 GB)
- geo_march_11_20.tar (4.51 GB)
- geo_march_21_31.tar (6.11 GB)
- geo_april_01_10.tar (6.75 GB)
- geo_april_11_20.tar (7.12 GB)
- geo_april_21_30.tar (6.47 GB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.
Comments
can you please tell me how can I get the sentiment label i this dataset.
Thanks for your question. This dataset does not have sentiment labels. However, you can use any multilingual sentiment classifier to determine tweets' sentiment polarity.
If you don't mind. can you give some reference for the "sentiment classifier", Because I search all over the internet and I find some reference which was not good as I want.
thank you.
Probably the following references would be helpful:
Severyn, A., & Moschitti, A. (2015, August). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 959-962).
Giachanou, A., & Crestani, F. (2016). Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys (CSUR), 49(2), 1-41.
how much time will it take me to hydrate this complete dataset?
I think it depends on how many parallel threads one uses to call Twitter API. Parallel calls will significantly reduce the rehydration time.
Do you have datasets for May and June as well?
Yes, we have been collecting data for May, June, July, and onwards. We need to process it before sharing it. It may take some time though.
how can I get the sentiment label of this datasets
Hi, I didn't see the original tweets in this dataset, without it I cannot apply sentiment analysis. Could you also include this in your dataset?
do you have any statistics of covid-related tweets per country you can share?
Thank you Mohammed for this dataset, but I did not find tweet text.. does t the data set have tweet text?