Coronavirus (COVID-19) Geo-tagged Tweets Dataset

4.2
5 ratings - Please login to submit your rating.

Abstract 

This dataset contains IDs and sentiment scores of the geo-tagged tweets related to the COVID-19 pandemic. The tweets are captured by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. Complying with Twitter's content redistribution policy, only the tweet IDs are shared. You can re-construct the dataset by hydrating these IDs. The tweet IDs in this dataset belong to the tweets tweeted providing an exact location. Below is the quick overview of this dataset.

— Number of tweets : 182,486 tweets

— Coverage : Global

— Language : English (EN)

— Keywords and hashtags (last updated on August 11, 2020) : "corona", "#corona", "coronavirus", "#coronavirus", "covid", "#covid", "covid19", "#covid19", "covid-19", "#covid-19", "sarscov2", "#sarscov2", "sars cov2", "sars cov 2", "covid_19", "#covid_19", "#ncov", "ncov", "#ncov2019", "ncov2019", "2019-ncov", "#2019-ncov", "pandemic", "#pandemic" "#2019ncov", "2019ncov", "quarantine", "#quarantine", "flatten the curve", "flattening the curve", "#flatteningthecurve", "#flattenthecurve", "hand sanitizer", "#handsanitizer", "#lockdown", "lockdown", "social distancing", "#socialdistancing", "work from home", "#workfromhome", "working from home", "#workingfromhome", "ppe", "n95", "#ppe", "#n95", "#covidiots", "covidiots", "herd immunity", "#herdimmunity", "pneumonia", "#pneumonia", "chinese virus", "#chinesevirus", "wuhan virus", "#wuhanvirus", "kung flu", "#kungflu", "wearamask", "#wearamask", "wear a mask", "vaccine", "vaccines", "#vaccine", "#vaccines", "corona vaccine", "corona vaccines", "#coronavaccine", "#coronavaccines", "face shield", "#faceshield", "face shields", "#faceshields", "health worker", "#health worker", "health workers", "#healthworkers", "#stayhomestaysafe", "#coronaupdate", "#frontlineheroes", "#coronawarriors", "#homeschool", "#homeschooling", "#hometasking", "#masks4all", "#wfh", "wash ur hands", "wash your hands", "#washurhands", "#washyourhands", "#stayathome", "#stayhome", "#selfisolating", "self isolating", "bars closed", "restaurants closed"

— Dataset updates : Everyday

— Primary dataset : Coronavirus (COVID-19) Tweets Dataset

— Usage policy : As per Twitter's Developer Policy

Dataset Files (the local time mentioned below is GMT+5:45)

march20_march21.csv: March 20, 2020 01:37 AM - March 21, 2020 09:25 AM

march21_march22.csv: March 21, 2020 09:27 AM - March 22, 2020 07:46 AM

march22_march23.csv: March 22, 2020 07:50 AM - March 23, 2020 09:08 AM

march23_march24.csv: March 23, 2020 09:11 AM - March 24, 2020 11:35 AM

march24_march25.csv: March 24, 2020 11:42 AM - March 25, 2020 11:43 AM

march25_march26.csv: March 25, 2020 11:47 AM - March 26, 2020 12:46 PM 

march26_march27.csv: March 26, 2020 12:51 PM - March 27, 2020 11:53 AM 

march27_march28.csv: March 27, 2020 11:56 AM - March 28, 2020 01:59 PM

march28_march29.csv: March 28, 2020 02:03 PM - March 29, 2020 04:01 PM

March 29, 2020 04:02 PM - March 30, 2020 02:00 PM: NOT AVAILABLE

march30_march31.csv: March 30, 2020 02:01 PM - March 31, 2020 10:16 AM

march31_april1.csv: March 31, 2020 10:20 AM - April 01, 2020 10:59 AM

april1_april2.csv: April 01, 2020 11:02 AM - April 02, 2020 12:19 PM

april2_april3.csv: April 02, 2020 12:21 PM - April 03, 2020 02:38 PM

april3_april4.csv: April 03, 2020 02:40 PM - April 04, 2020 11:54 AM

april4_april5.csv: April 04, 2020 11:56 AM - April 05, 2020 12:54 PM

april5_april6.csv: April 05, 2020 12:56 PM - April 06, 2020 10:57 AM

april6_april7.csv: April 06, 2020 10:58 AM - April 07, 2020 12:28 PM

april7_april8.csv: April 07, 2020 12:29 PM - April 08, 2020 12:34 PM

april8_april9.csv: April 08, 2020 12:37 PM - April 09, 2020 12:18 PM

april9_april10.csv: April 09, 2020 12:20 PM - April 10, 2020 09:20 AM

april10_april11.csv: April 10, 2020 09:22 AM - April 11, 2020 10:22 AM

april11_april12.csv: April 11, 2020 10:24 AM - April 12, 2020 10:53 AM

april12_april13.csv: April 12, 2020 10:57 AM - April 13, 2020 11:43 AM

april13_april14.csv: April 13, 2020 11:46 AM - April 14, 2020 12:49 AM

april14_april15.csv: April 14, 2020 11:09 AM - April 15, 2020 12:38 PM

april15_april16.csv: April 15, 2020 12:40 PM - April 16, 2020 10:03 AM

april16_april17.csv: April 16, 2020 10:04 AM - April 17, 2020 10:38 AM

april17_april18.csv: April 17, 2020 10:40 AM - April 18, 2020 10:17 AM

april18_april19.csv: April 18, 2020 10:19 AM - April 19, 2020 09:34 AM

april19_april20.csv: April 19, 2020 09:43 AM - April 20, 2020 10:45 AM

april20_april21.csv: April 20, 2020 10:56 AM - April 21, 2020 10:47 AM

april21_april22.csv: April 21, 2020 10:54 AM - April 22, 2020 10:33 AM

april22_april23.csv: April 22, 2020 10:45 AM - April 23, 2020 10:49 AM

april23_april24.csv: April 23, 2020 11:08 AM - April 24, 2020 10:39 AM

april24_april25.csv: April 24, 2020 10:51 AM - April 25, 2020 11:50 AM

april25_april26.csv: April 25, 2020 12:20 PM - April 26, 2020 09:13 AM

april26_april27.csv: April 26, 2020 09:16 AM - April 27, 2020 10:21 AM

april27_april28.csv: April 27, 2020 10:33 AM - April 28, 2020 10:09 AM

april28_april29.csv: April 28, 2020 10:20 AM - April 29, 2020 08:48 AM

april29_april30.csv: April 29, 2020 09:09 AM - April 30, 2020 10:33 AM

april30_may1.csv: April 30, 2020 10:53 AM - May 01, 2020 10:18 AM

may1_may2.csv: May 01, 2020 10:23 AM - May 02, 2020 09:54 AM

may2_may3.csv: May 02, 2020 10:18 AM - May 03, 2020 09:57 AM 

may3_may4.csv: May 03, 2020 10:09 AM - May 04, 2020 10:17 AM

may4_may5.csv: May 04, 2020 10:32 AM - May 05, 2020 10:17 AM

may5_may6.csv: May 05, 2020 10:38 AM - May 06, 2020 10:26 AM

may6_may7.csv: May 06, 2020 10:35 AM - May 07, 2020 09:33 AM

may7_may8.csv: May 07, 2020 09:55 AM - May 08, 2020 09:35 AM

may8_may9.csv: May 08, 2020 09:39 AM - May 09, 2020 09:49 AM

may9_may10.csv: May 09, 2020 09:55 AM - May 10, 2020 10:11 AM

may10_may11.csv: May 10, 2020 10:23 AM - May 11, 2020 09:57 AM

may11_may12.csv: May 11, 2020 10:08 AM - May 12, 2020 09:52 AM

may12_may13.csv: May 12, 2020 09:59 AM - May 13, 2020 10:14 AM

may13_may14.csv: May 13, 2020 10:24 AM - May 14, 2020 11:21 AM

may14_may15.csv: May 14, 2020 11:38 AM - May 15, 2020 09:58 AM

may15_may16.csv: May 15, 2020 10:13 AM - May 16, 2020 09:43 AM

may16_may17.csv: May 16, 2020 09:58 AM - May 17, 2020 10:34 AM

may17_may18.csv: May 17, 2020 10:36 AM - May 18, 2020 10:07 AM 

may18_may19.csv: May 18, 2020 10:08 AM - May 19, 2020 10:07 AM 

may19_may20.csv: May 19, 2020 10:08 AM - May 20, 2020 10:06 AM

may20_may21.csv: May 20, 2020 10:06 AM - May 21, 2020 10:15 AM

may21_may22.csv: May 21, 2020 10:16 AM - May 22, 2020 10:13 AM

may22_may23.csv: May 22, 2020 10:14 AM - May 23, 2020 10:08 AM

may23_may24.csv: May 23, 2020 10:08 AM - May 24, 2020 10:02 AM

may24_may25.csv: May 24, 2020 10:02 AM - May 25, 2020 10:10 AM 

may25_may26.csv: May 25, 2020 10:11 AM - May 26, 2020 10:22 AM 

may26_may27.csv: May 26, 2020 10:22 AM - May 27, 2020 10:16 AM

may27_may28.csv: May 27, 2020 10:17 AM - May 28, 2020 10:35 AM

may28_may29.csv: May 28, 2020 10:36 AM - May 29, 2020 10:07 AM

may29_may30.csv: May 29, 2020 10:07 AM - May 30, 2020 10:14 AM

may30_may31.csv: May 30, 2020 10:15 AM - May 31, 2020 10:13 AM 

may31_june1.csv: May 31, 2020 10:13 AM - June 01, 2020 10:14 AM

june1_june2.csv: June 01, 2020 10:15 AM - June 02, 2020 10:07 AM

june2_june3.csv: June 02, 2020 10:08 AM - June 03, 2020 10:26 AM

june3_june4.csv: June 03, 2020 10:27 AM - June 04, 2020 10:23 AM

june4_june5.csv: June 04, 2020 10:26 AM - June 05, 2020 10:03 AM

june5_june6.csv: June 05, 2020 10:11 AM - June 06, 2020 10:16 AM

june6_june7.csv: June 06, 2020 10:17 AM - June 07, 2020 10:24 AM

june7_june8.csv: June 07, 2020 10:25 AM - June 08, 2020 10:13 AM

june8_june9.csv: June 08, 2020 10:13 AM - June 09, 2020 10:12 AM

june9_june10.csv: June 09, 2020 10:12 AM - June 10, 2020 10:13 AM

june10_june11.csv: June 10, 2020 10:14 AM - June 11, 2020 10:11 AM

june11_june12.csv: June 11, 2020 10:12 AM - June 12, 2020 10:10 AM

june12_june13.csv: June 12, 2020 10:11 AM - June 13, 2020 10:10 AM

june13_june14.csv: June 13, 2020 10:11 AM - June 14, 2020 10:08 AM

june14_june15.csv: June 14, 2020 10:09 AM - June 15, 2020 10:10 AM

june15_june16.csv: June 15, 2020 10:10 AM - June 16, 2020 10:10 AM

june16_june17.csv: June 16, 2020 10:11 AM - June 17, 2020 10:10 AM 

june17_june18.csv: June 17, 2020 10:10 AM - June 18, 2020 10:09 AM

june18_june19.csv: June 18, 2020 10:10 AM - June 19, 2020 10:10 AM

june19_june20.csv: June 19, 2020 10:10 AM - June 20, 2020 10:10 AM

june20_june21.csv: June 20, 2020 10:10 AM - June 21, 2020 10:10 AM

june21_june22.csv: June 21, 2020 10:10 AM - June 22, 2020 10:10 AM

june22_june23.csv: June 22, 2020 10:10 AM - June 23, 2020 10:09 AM

june23_june24.csv: June 23, 2020 10:10 AM - June 24, 2020 10:09 AM

june24_june25.csv: June 24, 2020 10:10 AM - June 25, 2020 10:09 AM

june25_june26.csv: June 25, 2020 10:10 AM - June 26, 2020 10:09 AM

june26_june27.csv: June 26, 2020 10:09 AM - June 27, 2020 10:10 AM

june27_june28.csv: June 27, 2020 10:11 AM - June 28, 2020 10:10 AM

june28_june29.csv: June 28, 2020 10:10 AM - June 29, 2020 10:10 AM

june29_june30.csv: June 29, 2020 10:10 AM - June 30, 2020 10:10 AM

june30_july1.csv: June 30, 2020 10:10 AM - July 01, 2020 10:10 AM

july1_july2.csv: July 01, 2020 10:11 AM - July 02, 2020 12:28 PM

july2_july3.csv: July 02, 2020 12:29 PM - July 03, 2020 10:10 AM

july3_july4.csv: July 03, 2020 10:10 AM - July 04, 2020 07:00 AM 

july4_july5.csv: July 04, 2020 07:01 AM - July 05, 2020 09:16 AM

july5_july6.csv: July 05, 2020 09:17 AM - July 06, 2020 10:10 AM

july6_july7.csv: July 06, 2020 10:10 AM - July 07, 2020 10:10 AM

july7_july8.csv: July 07, 2020 10:11 AM - July 08, 2020 10:10 AM

july8_july9.csv: July 08, 2020 10:10 AM - July 09, 2020 10:10 AM

july9_july10.csv: July 09, 2020 10:10 AM - July 10, 2020 10:12 AM

july10_july11.csv: July 10, 2020 10:12 AM - July 11, 2020 10:20 AM

july11_july12.csv: July 11, 2020 10:20 AM - July 12, 2020 10:09 AM

july12_july13.csv: July 12, 2020 10:10 AM - July 13, 2020 10:09 AM

july13_july14.csv: July 13, 2020 10:10 AM - July 14, 2020 10:09 AM

july14_july15.csv: July 14, 2020 10:10 AM - July 15, 2020 10:25 AM

july15_july16.csv: July 15, 2020 10:26 AM - July 16, 2020 10:10 AM

july16_july17.csv: July 16, 2020 10:11 AM - July 17, 2020 10:10 AM

july17_july18.csv: July 17, 2020 10:10 AM - July 18, 2020 10:25 AM

july18_july19.csv: July 18, 2020 10:25 AM - July 19, 2020 10:30 AM

july19_july20.csv: July 19, 2020 10:30 AM - July 20, 2020 10:10 AM

july20_july21.csv: July 20, 2020 10:11 AM - July 21, 2020 10:10 AM

july21_july22.csv: July 21, 2020 10:11 AM - July 22, 2020 10:10 AM

july22_july23.csv: July 22, 2020 10:10 AM - July 23, 2020 10:10 AM

july23_july24.csv: July 23, 2020 10:10 AM - July 24, 2020 10:10 AM

july24_july25.csv: July 24, 2020 10:10 AM - July 25, 2020 10:20 AM

july25_july26.csv: July 25, 2020 10:20 AM - July 26, 2020 10:10 AM

july26_july27.csv: July 26, 2020 10:11 AM - July 27, 2020 10:10 AM

july27_july28.csv: July 27, 2020 10:10 AM - July 28, 2020 10:10 AM

july28_july29.csv: July 28, 2020 10:10 AM - July 29, 2020 10:10 AM

july29_july30.csv: July 29, 2020 10:10 AM - July 30, 2020 10:10 AM

july30_july31.csv: July 30, 2020 10:10 AM - July 31, 2020 10:10 AM

july31_august1.csv: July 31, 2020 10:10 AM - August 01, 2020 10:12 AM

august1_august2.csv: August 01, 2020 10:12 AM - August 02, 2020 10:10 AM

august2_august3.csv: August 02, 2020 10:10 AM - August 03, 2020 10:10 AM

august3_august4.csv: August 03, 2020 10:10 AM - August 04, 2020 10:12 AM

august4_august5.csv: August 04, 2020 10:12 AM - August 05, 2020 10:10 AM

august5_august6.csv: August 05, 2020 10:10 AM - August 06, 2020 10:10 AM

august6_august7.csv: August 06, 2020 10:10 AM - August 07, 2020 10:10 AM

august7_august8.csv: August 07, 2020 10:11 AM - August 08, 2020 10:10 AM

august8_august9.csv: August 08, 2020 10:11 AM - August 09, 2020 10:10 AM

august9_august10.csv: August 09, 2020 10:10 AM - August 10, 2020 10:10 AM

august10_august11.csv: August 10, 2020 10:10 AM - August 11, 2020 10:10 AM

august11_august12.csv: August 11, 2020 10:10 AM - August 12, 2020 10:10 AM

august12_august13.csv: August 12, 2020 10:10 AM - August 13, 2020 10:10 AM

august13_august14.csv: August 13, 2020 10:10 AM - August 14, 2020 10:10 AM

Why are only tweet IDs being shared?

Twitter's content redistribution policy restricts the sharing of tweet information other than tweet IDs and/or user IDs. Twitter wants researchers always to pull fresh data. It is because a user might delete a tweet or make their profile protected. If the same tweet has already been pulled and shared on a public domain, it might make the user/community vulnerable to many inferences coming out of the shared data which currently does not exist or is private.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library) or QCRI's Tweets Downloader (java based).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("april28_april29.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_april28_april29.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., april28_april29.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_april28_april29.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Comments

Great Work!

 

Submitted by Sadiksha sharma on Sun, 04/26/2020 - 04:14

Thanks, sadiksha!

Submitted by Rabindra Lamsal on Fri, 05/08/2020 - 02:46

Thank you very much for providing this dataset and your support

Submitted by hanaa hammad on Tue, 05/05/2020 - 09:39

My pleasure, Hanaa!

Submitted by Rabindra Lamsal on Tue, 05/05/2020 - 12:39

I created an ieee account just to download this dataset. There are numerous tweets datasets currently floating around but did not have particularly the list of tweets ids that had pin location. Thanks for your efforts.

Submitted by Curran White on Fri, 05/08/2020 - 02:45

Thanks, Curran! I am glad that you found the dataset useful.

Submitted by Rabindra Lamsal on Fri, 05/08/2020 - 03:20

Hi, I hydrated IDS file using twarc. (https://github.com/echen102/COVID-19 TweetIDs/pull/2/commits/7d16ff3f29acf15af88c0d27424041b711865be3).

 But when I tried to add the condition you used to get geolocation data, it gives me error for invalid syntax.

It would be nice if you can share which twarc code you used so that I can edit the variable names properly.

You have done great work!

Submitted by WonSeok Kim on Sat, 05/09/2020 - 15:17

Hey Kim. I think you meant using twarc (https://github.com/DocNow/twarc). That was just a pseudo-code which I had mentioned in the abstract (I've now replaced it with an excerpt of the real code to avoid confusion). 

It does not matter how you are getting your JSON archived. Just make sure to add the following "if clause" in whatever way you're trying to pull the tweets. The "if clause" below will only be TRUE if the tweet contains an exact pin location.

data = json.loads(data)

if data["coordinates"]:

       longitude, latitude = data["coordinates"]["coordinates"]

Now you can store the longitude and latitude values as per your convenience. I hope this helps!

Submitted by Rabindra Lamsal on Sun, 05/24/2020 - 12:53

hey i want to download full data not only id , how can i do so please give response

 

Submitted by charu v on Wed, 05/20/2020 - 13:46

Hello Charu. Twitter's data sharing policy does not allow anyone to share tweet information other than tweet ID and/or user ID. The list of IDs should be hydrated to re-create a full fresh tweet dataset. For this purpose, you can use applications such as DocNow's Hydrator or QCRI's Tweets Downloader.

Submitted by Rabindra Lamsal on Fri, 05/29/2020 - 22:28

Thanks for the data. I am not sure if this is just at my end but the csv files have issue with the tweet ID fields due to its 15 digit limit. The values are different from the one in json. Maybe export them to .txt files rather than .csv.

Submitted by Abhay Singh on Tue, 06/02/2020 - 21:38

Hello Abhay. Yes, I have heard from a couple of people about getting the tweet IDs fixed on their machines. That is why I am also uploading the JSON for those experiencing this issue.

Can you confirm if the IDs are fixed even when opened using some text editors (Notepad or Sublime)? I think you're opening the CSV files with MS Excel. I've seen multiple posts regarding Excel, at Stack Exchange, truncating the digits after 15.

Submitted by Rabindra Lamsal on Tue, 06/02/2020 - 22:18

Hello Rabindra,

 

No it doesnt happen if you open the dataset using some other editor. Reading the data in different systems (R/Python) leads to different results as may not convert it properly. Also, if someone is using hydrate app and converts the csv to txt with just the IDs then it will have errors. Anyway, its fairly straight forward to convert the json to txt containing IDs but some users may benefit with just .txt files.

 

Cheers

Submitted by Abhay Singh on Tue, 06/02/2020 - 22:43

Thanks for getting back.

If you use the DocNow's hydrator app you can straightway import the downloaded CSV file for the hydrating purpose (while removing the sentiment column). However, QCRI's Tweets Downloader requires a TXT file (with a single tweet ID per line). So you'll have to play around the CSV file, to some extent, for the task to be done.

I have been reached by a very handful of people having an issue similar to this. Most of them were opening the CSV files with MS Excel to remove the sentiment column. The problem was not even there when the downloaded CSV was imported as a pandas data frame and the sentiment column was dropped, and the final data frame was exported as a CSV file ready to be hydrated.

Submitted by Rabindra Lamsal on Wed, 06/03/2020 - 00:41

Thanks Rabindra. All good. As I said, its not that hard to deal with it. I mentioned it so that some one else in having a similar issue could benefit. Cheers.

Submitted by Abhay Singh on Wed, 06/03/2020 - 00:51

Roger-that.

Submitted by Rabindra Lamsal on Wed, 06/03/2020 - 11:34

I need 2000 twitter messages relevant COVID-19 for my course work. where I need to get the distribution of these tweets in world map. Can someone help me to get the twitter  messages.

Submitted by Gayathri Parame... on Tue, 07/07/2020 - 01:28

[updated on August 7, 2020] Hello Gayathri. You'll have to hydrate the tweet IDs provided in this dataset to get your work done. I'd suggest you use twarc for this purpose. I am guessing you'll only need the tweet and geo-location for your work.

#import libraries

from twarc import Twarc

import sqlite3

#create a database

connection = sqlite3.connect('database.db')

c = connection.cursor()

#creating a table

def table():

     try:

          c.execute("CREATE TABLE IF NOT EXISTS geo_map(tweet TEXT, longitude REAL, latitude REAL)")

          connection.commit()

     except Exception as e:

          print(str(e))

table()

#Initializing Twitter API keys

consumer_key=""

consumer_secret=""

access_token=""

access_token_secret=""

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

#hydrating the tweet IDs

for tweet in t.hydrate(open('ready_july5_july6.csv')):

     text = tweet["full_text"]

     longitude, latitude = tweet["coordinates"]["coordinates"]

     c.execute("INSERT INTO geo_map (tweet, longitude, latitude) VALUES (?, ?, ?)", (text, longitude, latitude))

     connection.commit()

Now you can simply make a connection to the above database-table to read its contents and plot the tweets using libraries such as Plotly. I hope this helps. Good luck!

Submitted by Rabindra Lamsal on Fri, 08/07/2020 - 07:54

If I am to filter out the tweets from the geotagged ones how can I do that? I have a tweet id dataset which has tweets from before march 20. I only want to filter the geotagged tweets from other tweets. And amazing work you have done here with the two datasets having daily files. Thanks.

Submitted by Mohit Singh on Wed, 07/08/2020 - 10:17

Hello Mohit. Filtering geo-tagged tweets from the rest is quite straightforward if you use twarc for hydrating the tweet IDs. You'll have to add a condition to the "coordinates" Twitter object. 

for tweet in t.hydrate(open('/path/to/tweet/file.csv')):

     if tweet["coordinates"]:

          #now you can extract whichever information you want

          longitude, latitude = tweet["coordinates"]["coordinated"] #for getting geo-coordinates

You can go-through the code snippet replied to the comment thread just above this one for getting a headstart with storing the extracted information to a database.

Submitted by Rabindra Lamsal on Wed, 07/08/2020 - 11:00

Thank you for instant reply. May I ask which database you use in your project running at live.rlamsal.com.np?

Submitted by Mohit Singh on Wed, 07/08/2020 - 10:50

The project uses SQLite.

Submitted by Rabindra Lamsal on Wed, 07/08/2020 - 10:55

hey, sorry if i'm being dense but i can't find the json files?

Submitted by Lucas Nakach on Thu, 07/09/2020 - 21:38

Hello Lucas. The JSON files were initially present in this dataset and were lately removed as they seemed redundant. The JSON files also included the same content that the CSV files had.

Submitted by Rabindra Lamsal on Fri, 07/10/2020 - 01:35

I have downloaded the data...what is the total number of rows in all the datasets taken togather.

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 05:15

There are more than 140k tweet IDs in the dataset together.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:20

It appears to be just a few thousand rows in all the datasets taken togather.

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 05:24

Yes, there are 140k get-tagged tweets in this dataset. These are the tweets that have "point" location information. If you are okay with having a boundary location instead, you'll have to hydrate the tweets in this dataset (https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset) and consider conditioning the ["place"] twitter object. The Coronavirus (COVID-19) Tweets Dataset has more than 310 million tweets, and I guess you'll be able to come up with a few million of tweets with the boundary condition enabled.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:27

The geo tagging is from India alone?

Submitted by Moonis Shakeel on Sat, 07/18/2020 - 07:18

No. This is a global dataset.

Submitted by Rabindra Lamsal on Sat, 07/18/2020 - 12:21

Thanks. I was looking for day by day geo data.

Submitted by Somodo Non on Thu, 07/23/2020 - 02:28

Glad to be of help.

Submitted by Rabindra Lamsal on Thu, 07/23/2020 - 04:50

Thank you a lot for the dataset!

I'm trying to hydrate the tweets for 7.26 but it seems too slow since there are over 3 million tweets. Is there some faster way to hydrate them?

Submitted by Danqing Wang on Tue, 07/28/2020 - 01:33

Hello Danqing. Twitter has rate limits for its APIs. Both the hydrator app and twarc handle the rate limits and pull the JSON accordingly. If you're searching for some way to get the hydration process to expedite, I'd recommend involving some other person who has access to the Twitter Devs, and you can ask him/her to hydrate a portion of the IDs.

Submitted by Rabindra Lamsal on Tue, 07/28/2020 - 06:21

How to filter the tweets according to a particular country? For e.g India 

 

Submitted by Trupti Kachare on Thu, 08/06/2020 - 14:39

Hello Trupti. Just to give you a headstart: If I were you, I would play around the location-specific Twitter Objects at three different levels. First, I would check if the tweet is geo-tagged (if it contains an exact location). Secondly, if the tweet is not geo-tagged, chances are that it might have a region or a country boundary box defined. Third, if none of the criteria satisfy, I would simply try to extract location information from the user's profile.

Here's an example of using twarc as a python library for this purpose.

from twarc import Twarc

consumer_key=""

consumer_secret=""

access_token=""

access_token_secret=""

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

for tweet in t.hydrate(open('tweet_ids.txt')):

    if tweet["coordinates"]:

        loc = tweet[‘‘place"]["country"] #place based on the "point" location

        '''check the value in "loc" if it is from a country of your interest'''

        '''however do check if tweet["place"] is of NoneType. In that condition get the long, lat from tweet["coordinates"]["coordinates"] and convert it to human readable format.

    elif tweet["place"]:

        loc = tweet[‘‘place"]["country"] #bounding box region

        '''check the value in "loc" if it is from a country of your interest'''

    else:

        loc_profile = tweet["user"]["location"] #location from profile

        '''check the value in "loc_profile" if it is from a country of your interest'''

However, this dataset contains the geo-tagged tweets IDs. I'd suggest you to use the Coronavirus (COVID-19) Tweets Dataset, that contains more than 386 million tweet IDs. Applying these geo specific conditions on that dataset would help you extract more tweets for your work. I hope this helps. 

Submitted by Rabindra Lamsal on Thu, 08/06/2020 - 22:57

Great work!

Which API do you use - twitter search api or twitter streaming api? Does the data includes retweet?

Submitted by antony zzr on Sat, 08/08/2020 - 11:09

Thanks, Antony. It's streaming API. Retweets have NULL geo and place objects. Therefore, retweets won't be making their way to this dataset. However, Quote tweets are included as they can have their own geo and place objects.

Submitted by Rabindra Lamsal on Sat, 08/08/2020 - 13:39

Hi, what algorithm are you using to calculate the sentiment scores, e.g. vader? Thank you!

Submitted by Molu Shi on Mon, 08/10/2020 - 09:00

Hello Molu. The TextBlob library has been used to compute the sentiment scores.

Submitted by Rabindra Lamsal on Tue, 08/11/2020 - 01:09

Dataset Files

ACCESS ON AWS