Coronavirus (COVID-19) Geo-tagged Tweets Dataset

4.333335
6 ratings - Please login to submit your rating.

Abstract 

This dataset contains IDs and sentiment scores of the geo-tagged tweets related to the COVID-19 pandemic. The tweets are captured by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. Complying with Twitter's content redistribution policy, only the tweet IDs are shared. You can re-construct the dataset by hydrating these IDs. The tweet IDs in this dataset belong to the tweets tweeted providing an exact location.

-------------------------------------

Related: Coronavirus (COVID-19) Tweets Sentiment Trend (Global)Coronavirus (COVID-19) Tweets Dataset and Tweets Originating from India During COVID-19 Lockdowns 1, 2, 3, 4

-------------------------------------

Below is the quick overview of this dataset.

— Dataset name: GeoCOV19Tweets Dataset

— Number of tweets : 246,117 tweets

— Coverage : Global

— Language : English (EN)

— Primary dataset : Coronavirus (COVID-19) Tweets Dataset (COV19Tweets Dataset)

— Dataset updates : Everyday

— Usage policy : As per Twitter's Developer Policy

— Active keywords and hashtags: keywords.tsv

Please visit this page (primary dataset) for details regarding the collection date and time of each CSV file present in this dataset.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("april28_april29.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_april28_april29.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., april28_april29.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_april28_april29.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Comments

Great Work!

 

Submitted by Sadiksha sharma on Sun, 04/26/2020 - 04:14

Thanks, sadiksha!

Submitted by Rabindra Lamsal on Fri, 05/08/2020 - 02:46

Thank you very much for providing this dataset and your support

Submitted by hanaa hammad on Tue, 05/05/2020 - 09:39

My pleasure, Hanaa!

Submitted by Rabindra Lamsal on Tue, 05/05/2020 - 12:39

I created an ieee account just to download this dataset. There are numerous tweets datasets currently floating around but did not have particularly the list of tweets ids that had pin location. Thanks for your efforts.

Submitted by Curran White on Fri, 05/08/2020 - 02:45

Thanks, Curran! I am glad that you found the dataset useful.

Submitted by Rabindra Lamsal on Fri, 05/08/2020 - 03:20

Hi, I hydrated IDS file using twarc. (https://github.com/echen102/COVID-19 TweetIDs/pull/2/commits/7d16ff3f29acf15af88c0d27424041b711865be3).

 But when I tried to add the condition you used to get geolocation data, it gives me error for invalid syntax.

It would be nice if you can share which twarc code you used so that I can edit the variable names properly.

You have done great work!

Submitted by WonSeok Kim on Sat, 05/09/2020 - 15:17

Hey Kim. I think you meant using twarc (https://github.com/DocNow/twarc). That was just a pseudo-code which I had mentioned in the abstract (I've now replaced it with an excerpt of the real code to avoid confusion). 

It does not matter how you are getting your JSON archived. Just make sure to add the following "if clause" in whatever way you're trying to pull the tweets. The "if clause" below will only be TRUE if the tweet contains an exact pin location.

data = json.loads(data)

if data["coordinates"]:

       longitude, latitude = data["coordinates"]["coordinates"]

Now you can store the longitude and latitude values as per your convenience. I hope this helps!

Submitted by Rabindra Lamsal on Sun, 05/24/2020 - 12:53

hey i want to download full data not only id , how can i do so please give response

 

Submitted by charu v on Wed, 05/20/2020 - 13:46

Hello Charu. Twitter's data sharing policy does not allow anyone to share tweet information other than tweet ID and/or user ID. The list of IDs should be hydrated to re-create a full fresh tweet dataset. For this purpose, you can use applications such as DocNow's Hydrator or QCRI's Tweets Downloader.

Submitted by Rabindra Lamsal on Fri, 05/29/2020 - 22:28

Thanks for the data. I am not sure if this is just at my end but the csv files have issue with the tweet ID fields due to its 15 digit limit. The values are different from the one in json. Maybe export them to .txt files rather than .csv.

Submitted by Abhay Singh on Tue, 06/02/2020 - 21:38

Hello Abhay. Yes, I have heard from a couple of people about getting the tweet IDs fixed on their machines. That is why I am also uploading the JSON for those experiencing this issue.

Can you confirm if the IDs are fixed even when opened using some text editors (Notepad or Sublime)? I think you're opening the CSV files with MS Excel. I've seen multiple posts regarding Excel, at Stack Exchange, truncating the digits after 15.

Submitted by Rabindra Lamsal on Tue, 06/02/2020 - 22:18

Hello Rabindra,

 

No it doesnt happen if you open the dataset using some other editor. Reading the data in different systems (R/Python) leads to different results as may not convert it properly. Also, if someone is using hydrate app and converts the csv to txt with just the IDs then it will have errors. Anyway, its fairly straight forward to convert the json to txt containing IDs but some users may benefit with just .txt files.

 

Cheers

Submitted by Abhay Singh on Tue, 06/02/2020 - 22:43

Thanks for getting back.

If you use the DocNow's hydrator app you can straightway import the downloaded CSV file for the hydrating purpose (while removing the sentiment column). However, QCRI's Tweets Downloader requires a TXT file (with a single tweet ID per line). So you'll have to play around the CSV file, to some extent, for the task to be done.

I have been reached by a very handful of people having an issue similar to this. Most of them were opening the CSV files with MS Excel to remove the sentiment column. The problem was not even there when the downloaded CSV was imported as a pandas data frame and the sentiment column was dropped, and the final data frame was exported as a CSV file ready to be hydrated.

Submitted by Rabindra Lamsal on Wed, 06/03/2020 - 00:41

Thanks Rabindra. All good. As I said, its not that hard to deal with it. I mentioned it so that some one else in having a similar issue could benefit. Cheers.

Submitted by Abhay Singh on Wed, 06/03/2020 - 00:51

Roger-that.

Submitted by Rabindra Lamsal on Wed, 06/03/2020 - 11:34

I need 2000 twitter messages relevant COVID-19 for my course work. where I need to get the distribution of these tweets in world map. Can someone help me to get the twitter  messages.

Submitted by Gayathri Parame... on Tue, 07/07/2020 - 01:28

[updated on August 7, 2020] Hello Gayathri. You'll have to hydrate the tweet IDs provided in this dataset to get your work done. I'd suggest you use twarc for this purpose. I am guessing you'll only need the tweet and geo-location for your work.

#import libraries

from twarc import Twarc

import sqlite3

#create a database

connection = sqlite3.connect('database.db')

c = connection.cursor()

#creating a table

def table():

     try:

          c.execute("CREATE TABLE IF NOT EXISTS geo_map(tweet TEXT, longitude REAL, latitude REAL)")

          connection.commit()

     except Exception as e:

          print(str(e))

table()

#Initializing Twitter API keys

consumer_key=""

consumer_secret=""

access_token=""

access_token_secret=""

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

#hydrating the tweet IDs

for tweet in t.hydrate(open('ready_july5_july6.csv')):

     text = tweet["full_text"]

     longitude, latitude = tweet["coordinates"]["coordinates"]

     c.execute("INSERT INTO geo_map (tweet, longitude, latitude) VALUES (?, ?, ?)", (text, longitude, latitude))

     connection.commit()

Now you can simply make a connection to the above database-table to read its contents and plot the tweets using libraries such as Plotly. I hope this helps. Good luck!

Submitted by Rabindra Lamsal on Fri, 08/07/2020 - 07:54

If I am to filter out the tweets from the geotagged ones how can I do that? I have a tweet id dataset which has tweets from before march 20. I only want to filter the geotagged tweets from other tweets. And amazing work you have done here with the two datasets having daily files. Thanks.

Submitted by Mohit Singh on Wed, 07/08/2020 - 10:17

Hello Mohit. Filtering geo-tagged tweets from the rest is quite straightforward if you use twarc for hydrating the tweet IDs. You'll have to add a condition to the "coordinates" Twitter object. 

for tweet in t.hydrate(open('/path/to/tweet/file.csv')):

     if tweet["coordinates"]:

          #now you can extract whichever information you want

          longitude, latitude = tweet["coordinates"]["coordinated"] #for getting geo-coordinates

You can go-through the code snippet replied to the comment thread just above this one for getting a headstart with storing the extracted information to a database.

Submitted by Rabindra Lamsal on Wed, 07/08/2020 - 11:00

Thank you for instant reply. May I ask which database you use in your project running at live.rlamsal.com.np?

Submitted by Mohit Singh on Wed, 07/08/2020 - 10:50

The project uses SQLite.

Submitted by Rabindra Lamsal on Wed, 07/08/2020 - 10:55

hey, sorry if i'm being dense but i can't find the json files?

Submitted by Lucas Nakach on Thu, 07/09/2020 - 21:38

Hello Lucas. The JSON files were initially present in this dataset and were lately removed as they seemed redundant. The JSON files also included the same content that the CSV files had.

Submitted by Rabindra Lamsal on Fri, 07/10/2020 - 01:35

I have downloaded the data...what is the total number of rows in all the datasets taken togather.

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 05:15

There are more than 140k tweet IDs in the dataset together.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:20

It appears to be just a few thousand rows in all the datasets taken togather.

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 05:24

Yes, there are 140k get-tagged tweets in this dataset. These are the tweets that have "point" location information. If you are okay with having a boundary location instead, you'll have to hydrate the tweets in this dataset (https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset) and consider conditioning the ["place"] twitter object. The Coronavirus (COVID-19) Tweets Dataset has more than 310 million tweets, and I guess you'll be able to come up with a few million of tweets with the boundary condition enabled.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:27

The geo tagging is from India alone?

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 07:18

No. This is a global dataset.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:21

Thanks. I was looking for day by day geo data.

Submitted by Somodo Non on Thu, 07/23/2020 - 02:28

Glad to be of help.

Submitted by Rabindra Lamsal on Thu, 07/23/2020 - 04:50

Thank you a lot for the dataset!

I'm trying to hydrate the tweets for 7.26 but it seems too slow since there are over 3 million tweets. Is there some faster way to hydrate them?

Submitted by Danqing Wang on Tue, 07/28/2020 - 01:33

Hello Danqing. Twitter has rate limits for its APIs. Both the hydrator app and twarc handle the rate limits and pull the JSON accordingly. If you're searching for some way to get the hydration process to expedite, I'd recommend involving some other person who has access to the Twitter Devs, and you can ask him/her to hydrate a portion of the IDs.

Submitted by Rabindra Lamsal on Tue, 07/28/2020 - 06:21

How to filter the tweets according to a particular country? For e.g India 

 

Submitted by Trupti Kachare on Thu, 08/06/2020 - 14:39

Hello Trupti. Just to give you a headstart: If I were you, I would play around the location-specific Twitter Objects at three different levels. First, I would check if the tweet is geo-tagged (if it contains an exact location). Secondly, if the tweet is not geo-tagged, chances are that it might have a region or a country boundary box defined. Third, if none of the criteria satisfy, I would simply try to extract location information from the user's profile.

Here's an example of using twarc as a python library for this purpose.

from twarc import Twarc

consumer_key=""

consumer_secret=""

access_token=""

access_token_secret=""

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

for tweet in t.hydrate(open('tweet_ids.txt')):

    if tweet["coordinates"]:

        loc = tweet[‘‘place"]["country"] #place based on the "point" location

        '''check the value in "loc" if it is from a country of your interest'''

        '''however do check if tweet["place"] is of NoneType. In that condition get the long, lat from tweet["coordinates"]["coordinates"] and convert it to human readable format.

    elif tweet["place"]:

        loc = tweet[‘‘place"]["country"] #bounding box region

        '''check the value in "loc" if it is from a country of your interest'''

    else:

        loc_profile = tweet["user"]["location"] #location from profile

        '''check the value in "loc_profile" if it is from a country of your interest'''

However, this dataset contains the geo-tagged tweets IDs. I'd suggest you to use the Coronavirus (COVID-19) Tweets Dataset, that contains more than 386 million tweet IDs. Applying these geo specific conditions on that dataset would help you extract more tweets for your work. I hope this helps. 

Submitted by Rabindra Lamsal on Thu, 08/06/2020 - 22:57

Great work!

Which API do you use - twitter search api or twitter streaming api? Does the data includes retweet?

Submitted by antony zzr on Sat, 08/08/2020 - 11:09

Thanks, Antony. It's streaming API. Retweets have NULL geo and place objects. Therefore, retweets won't be making their way to this dataset. However, Quote tweets are included as they can have their own geo and place objects.

Submitted by Rabindra Lamsal on Sat, 08/08/2020 - 13:39

Hi, what algorithm are you using to calculate the sentiment scores, e.g. vader? Thank you!

Submitted by Molu Shi on Mon, 08/10/2020 - 09:00

Hello Molu. The TextBlob library has been used to compute the sentiment scores.

Submitted by Rabindra Lamsal on Tue, 08/11/2020 - 01:09

How do I use It?? Can you share the tweets and the sentiment label, so that I can use it in training my model

Submitted by Vaibhav Kumar on Wed, 09/16/2020 - 07:24

Please refer to my previous comments.

Submitted by Rabindra Lamsal on Tue, 09/29/2020 - 23:55

hi thank you sooo much for the tremendous work. I appreciated it very much. two quick questions: how do you know
whether those tweets are by robots. Have you applied any filtering techniques?

Second, if I would like to replicate your data collection from twitter myself, could you share your code regarding how to collect geo-tagged tweets.

Submitted by yi yang on Sat, 10/03/2020 - 21:03

Hello Yang. Glad to know that you found this dataset useful.
(i) To curate this dataset, the real-time Twitter stream is filtered by tracking 90+ COVID-19 specific keywords (view the attached keywords.tsv file). All the tweets received from the stream make their way to the primary dataset. The primary dataset can be considered a comprehensive collection for all kinds of analyses (sentiment, geo, fact check, trend, etc.).

(ii) Please refer to my previous comments: https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw... and https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw...

Submitted by Rabindra Lamsal on Sun, 10/04/2020 - 00:45

Hello sir,

twarc hydrate is not working in spite of giving correct twitter API credentials for configuring twarc. Its creating a blank json file. To test twarc, I used search, it is able to pull out tweets.

Submitted by Jayshree Ravi on Fri, 10/16/2020 - 04:30

Hello Jayshree. Please create an issue at twarc's github. And FYI, I am able to hydrate tweets at my end without any problem.

Submitted by Rabindra Lamsal on Sat, 10/17/2020 - 01:10

Thanks for your response. Only hydrate command is not working. All other commands like search, filter, users and dehydrate are able to connect to twitter and give me the requisite information. Hydrate command does not throw any error. It just produces a blank json file. I even tried with only one tweet id in the txt file. Your guidance would be of great help

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 01:20

Its working now. Thanks

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 02:19

That's great.

Submitted by Rabindra Lamsal on Mon, 10/19/2020 - 00:53

Dataset Files

ACCESS ON AWS