Corona Virus (COVID-19) Tweets Dataset

Corona Virus (COVID-19) Tweets Dataset

Citation Author(s):
Rabindra
Lamsal
JNU, New Delhi
Submitted by:
Rabindra Lamsal
Last updated:
Tue, 04/07/2020 - 03:27
DOI:
10.21227/781w-ef42
Data Format:
Links:
License:
Dataset Views:
20725
Rating:
4.666665
3 ratings - Please login to submit your rating.
Share / Embed Cite

Tweets Counter: 19,838,935

This dataset includes CSV files which contain the tweet IDs. The tweets have been collected by the LSTM model deployed here at sentiment.live. The model monitors the real-time Twitter feed for corona virus-related tweets, using filters: language “en” and keyword “corona”. As per the Twitter Developer Policy, it is not possible for me to provide information other than the Tweet IDs (this dataset has been completely re-designed on March 20, 2020, to comply with data sharing policies set by Twitter). Note: This dataset should be solely used for non-commercial research purpose (ignore every other LICENSE category given in this page).

Schema of the CSV files: First column: tweet ID, Second column: Sentiment score for the particular tweet.

Files details (Tweets collected in GMT+0; Local time mentioned below: GMT+5:45):

corona_tweets_01.csv: 831,327 tweets    (March 20, 2020 01:37 AM - March 20, 2020 10:28 AM)

corona_tweets_02.csv: 870,924 tweets    (March 20, 2020 10:31 AM - March 20, 2020 09:43 PM)

corona_tweets_03.csv: 773,729 tweets    (March 20, 2020 09:49 PM - March 21, 2020 09:25 AM)

corona_tweets_04.csv: 1,233,340 tweets (March 21, 2020 09:27 AM - March 22, 2020 07:46 AM)

corona_tweets_05.csv: 1,782,157 tweets (March 22, 2020 07:50 AM - March 23, 2020 09:08 AM)

corona_tweets_06.csv: 1,771,295 tweets (March 23, 2020 09:11 AM - March 24, 2020 11:35 AM)

corona_tweets_07.csv: 1,479,651 tweets (March 24, 2020 11:42 AM - March 25, 2020 11:43 AM)

corona_tweets_08.csv: 1,272,592 tweets (March 25, 2020 11:47 AM - March 26, 2020 12:46 PM)

corona_tweets_09.csv: 1,091,429 tweets (March 26, 2020 12:51 PM - March 27, 2020 11:53 AM)

corona_tweets_10.csv: 1,172,013 tweets (March 27, 2020 11:56 AM - March 28, 2020 01:59 PM)

corona_tweets_11.csv: 1,141,210 tweets (March 28, 2020 02:03 PM - March 29, 2020 04:01 PM)

----- March 29, 2020 04:05 PM - March 30, 2020 12:30 PM -- Some folk(s) messed around with the server. Tweets for this period won't be available. However, I'll be continuing adding the new Tweet IDs. Some preventive measures have been taken. Sorry for the inconvenience. -----

corona_tweets_12.csv: 793,417 tweets.   (March 30, 2020 02:01 PM - March 31, 2020 10:16 AM)

corona_tweets_13.csv: 1,029,294 tweets    (March 31, 2020 10:20 AM - April 01, 2020 10:59 AM)

corona_tweets_14.csv: 920,076 tweets          (April 01, 2020 11:02 AM - April 02, 2020 12:19 PM)

corona_tweets_15.csv: 826,271 tweets          (April 02, 2020 12:21 PM - April 03, 2020 02:38 PM)

corona_tweets_16.csv: 612,512 tweets          (April 03, 2020 02:40 PM - April 04, 2020 11:54 AM)

corona_tweets_17.csv: 685,560 tweets          (April 04, 2020 11:56 AM - April 05, 2020 12:54 PM)

corona_tweets_18.csv: 717,301 tweets          (April 05, 2020 12:56 PM - April 06, 2020 10:57 AM)

corona_tweets_19.csv: 722,921 tweets          (April 06, 2020 10:58 AM - April 07, 2020 12:28 PM)

To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score out of TextBlob [1] has been appended as the second column. New databases will be added to this dataset every day. Bookmark this page for further updates.

 [1] https://textblob.readthedocs.io/en/dev/

Instructions: 

Each CSV file contains a list of Tweet IDs. You can use these lists to download fresh data from Twitter.

######################################################

Here's a quick example of using Tweepy to view a particular tweet using its tweet ID.

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

api = tweepy.API(auth)

tweet = api.get_status(tweet_id)

print(tweet.text)

######################################################

Downloading the tweets

To hydrate the tweet IDs, you can use applications such as DocNow's Hydrator (available for OS X, Windows and Linux) or QCRI's Tweets Downloader (java based).

Comments

I am getting this error

 

DatabaseError: database disk image is malformed

Submitted by Junaid khan on Mon, 03/16/2020 - 15:23

Can you tell me the name of the file you're experiencing this error with? I would recommend you to first use any kind of SQLite DB viewer to check if the downloaded file is not corrupted.

my personal suggestion: I recommend you to open the databases (which are generating the image malformed error) using any DB viewer and re-save them on your machine or export to SQL or to any tabular format file system as per your's preference.

Submitted by Rabindra Lamsal on Thu, 03/19/2020 - 10:50

Hi! Could you mention what filters are you using to get the tweets? Thanks

Submitted by Victor Tavares on Tue, 03/17/2020 - 00:21

keyword: corona, language: en

A significant amount of tweets used the word 'corona' ignoring the word 'virus'. So I had to track tweets using the most generic word: just 'corona'. Therefore, a couple of tweets relating to 'corona beer' might also be present in the databases.

Submitted by Rabindra Lamsal on Tue, 03/17/2020 - 00:45

Hi ! I cannot access the LSTM Model.

Submitted by islam sadat on Wed, 03/18/2020 - 10:41

Try refreshing. Maybe the server was busy while you were trying to access the site. I just can't believe that more than 338,500 requests have been made to the model within the last 24 hours. And this amount of request is something that my model cannot handle. Sorry for the inconvenience!

Submitted by Rabindra Lamsal on Wed, 03/18/2020 - 11:09

Please fix this two datasets        

1. corona_tweets_2M.db.zip        2. corona_tweets_2M_2.zip

 

it shows this error DatabaseError: database disk image is malformed

Submitted by imran khan on Thu, 03/19/2020 - 08:19

I downloaded the very same compressed files from this page and loaded both the databases on an SQLite DB viewer. The databases work just fine. See the screenshot here: https://i.ibb.co/SyQ7ff1/Screen-Shot-2020-03-19-at-8-21-46-PM.png

I recommend you to open the databases (which are generating the image malformed error) using any DB viewer and re-save them on your machine or export to SQL or to any tabular format file system as per your's preference.

Submitted by Rabindra Lamsal on Thu, 03/19/2020 - 10:49

Hi thanks for providing these datasets for the public. I have one questio, are all these files contain same structure? I wish if they had the other feilds twitter provides with tweets so we can directly do our research?

I wonder if the other files all have three columns only, unix, text and sentiment.

Submitted by ali ALdulaimi on Thu, 03/19/2020 - 11:54

Hello there! Yes, all the files have the same structure (unix, text, sentiment score). However, starting March 20 the collected tweets will also have one additional column, viz. tweet ID.

This is because, initially, the purpose of the deployed web app was not just to collect the tweets; it was more like an optimization project. However, when the corona outbreak started in China, I decided to release the collected tweets rather than just keeping them with me.

Submitted by Rabindra Lamsal on Thu, 03/19/2020 - 14:05

Hi

Rabindra Have the SQLite dbs been replaced with CSV with only time and sentiment score?Thanks  

Submitted by Bevan Ward on Sat, 03/21/2020 - 23:50

Hello Bevan. No, the first column in the CSV files is tweet ID. You'll have to automate the extraction of tweets using the list of tweet IDs. Twitter Policy; so I had to remove every other info except the tweet ID and sentiment score.

Submitted by Rabindra Lamsal on Sun, 03/22/2020 - 02:25

Thanks Rabindra for the reply - take care Bevan

Submitted by bevan ward on Sun, 03/29/2020 - 18:24

Hi, Can you please upload the tweet ids and sentiment of the old file from February and early March?

 

Thank you

Submitted by Rabia batool on Tue, 03/24/2020 - 05:17

Hello Rabia! unfortunately, I had to take down all the tweets which were collected between Feb 1, 2020, and Mar 19, 2020, because the old DB files didn't have tweet IDs collected. This was because, initially, the purpose of the deployed web app was not just to collect the tweets; it was more like an optimization project. However, when the corona outbreak started in China, I decided to release the collected tweets rather than just keeping them with me. Therefore, because of twitter data sharing policies, I am not authorized to share the old files. Sorry for the inconvenience.

Submitted by Rabindra Lamsal on Tue, 03/24/2020 - 10:42

Thank you for your response. I completely understand this. 

 

Submitted by Rabia batool on Wed, 03/25/2020 - 03:50

Hi, I'm trying to view a particular tweet using the tweet IDs that you provided with a piece of python code that you provided above after adding my credentials for (CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET), however,  it always gives me the following error message:

 

tweepy.error.TweepError: [{'code': 144, 'message': 'No status found with that ID.'}]

 

Have you hashed those tweet ids that you uploaded? Any advice is appreciated. 

 

Best regards, 

 

Submitted by Basheer Qolomany on Mon, 03/30/2020 - 18:26

Maybe the particular tweet which you're trying to view has been either removed or hidden by the user.

Submitted by Rabindra Lamsal on Mon, 03/30/2020 - 19:56

Thanks for replying, actually I don't think those tweets have been removed or hidden by the users,  because I tried in a for loop hundreds of different tweet ids and all of them gave me the same error message. While I got some tweet id from another source they worked just fine. 

Here are the some of tweet ids that I used from file number 10 for example: 

 

1243420522592910000

1243420476824640000

1243420477235660000

1243420477646720000

1243420477894190000

1243420478238150000

1243420478535890000

1243420478829510000

1243420478951180000

1243420479706150000

1243420479844530000

1243420479982990000

1243420479924250000

1243420478837900000

1243420480205280000

1243420481744560000

1243420482075930000

1243420482201770000

1243420482222730000

1243420482084270000

1243420482814100000

1243420482935760000

1243420482629590000

 

 

Thanks, 

Submitted by Basheer Qolomany on Mon, 03/30/2020 - 20:42

I double-checked corona_tweets_10.csv, but I could not find any of these IDs in the file. However, I can see one pattern in the tweet IDs you've listed above: they all end with a number of zeros. Use sublime text or a simple text editor to open the CSV files. Looks like the application which you're using to open these files is somehow chopping off some digits at the back and replacing the chopped ones with zeros.

For example, the last ID you've listed 1243420482629590000 should have been 1243420482629591040. See that the last 4 digits are zeroes at your end. Same is the case with all other IDs you've mentioned above.

Submitted by Rabindra Lamsal on Tue, 03/31/2020 - 02:13

Yes, that's right. I read the CSV files with R, it fixed the numbers. 

Also, if you have the tweet ids for March 13 to March 19, that would be great to upload it here. 

 

Thanks; 

Submitted by Basheer Qolomany on Tue, 03/31/2020 - 17:35

The model has been collecting the corona-related tweets since Jan 27, 2020. However, the model was designed as a part of an optimization project and therefore it was made to only extract the tweets but not the tweet IDs. And because of Twitter's data sharing policy, I am not allowed to share them. Therefore, I started extracting and uploading the tweet IDs since March 20, 2020, only.

Submitted by Rabindra Lamsal on Tue, 03/31/2020 - 21:59

Thank you,

Submitted by Basheer Qolomany on Wed, 04/01/2020 - 18:30

Hi 

I try to download all data from twitter using user id, but the app Hydrator always stop downloading.

Is that mean the download tweets reach the rate limit?

 

thanks

Submitted by JINGLI SHI on Fri, 04/03/2020 - 00:09

Can you please elaborate? Also, I would recommend you to write to the app's author regarding the issue.

Submitted by Rabindra Lamsal on Sun, 04/05/2020 - 22:50

Dataset Files

You must login with an IEEE Account to access these files. IEEE Accounts are FREE.

Sign Up now or login.

Embed this dataset on another website

Copy and paste the HTML code below to embed your dataset:

Share via email or social media

Click the buttons below:

facebooktwittermailshare
[1] Rabindra Lamsal, "Corona Virus (COVID-19) Tweets Dataset", IEEE Dataport, 2020. [Online]. Available: http://dx.doi.org/10.21227/781w-ef42. Accessed: Apr. 08, 2020.
@data{781w-ef42-20,
doi = {10.21227/781w-ef42},
url = {http://dx.doi.org/10.21227/781w-ef42},
author = {Rabindra Lamsal },
publisher = {IEEE Dataport},
title = {Corona Virus (COVID-19) Tweets Dataset},
year = {2020} }
TY - DATA
T1 - Corona Virus (COVID-19) Tweets Dataset
AU - Rabindra Lamsal
PY - 2020
PB - IEEE Dataport
UR - 10.21227/781w-ef42
ER -
Rabindra Lamsal. (2020). Corona Virus (COVID-19) Tweets Dataset. IEEE Dataport. http://dx.doi.org/10.21227/781w-ef42
Rabindra Lamsal, 2020. Corona Virus (COVID-19) Tweets Dataset. Available at: http://dx.doi.org/10.21227/781w-ef42.
Rabindra Lamsal. (2020). "Corona Virus (COVID-19) Tweets Dataset." Web.
1. Rabindra Lamsal. Corona Virus (COVID-19) Tweets Dataset [Internet]. IEEE Dataport; 2020. Available from : http://dx.doi.org/10.21227/781w-ef42
Rabindra Lamsal. "Corona Virus (COVID-19) Tweets Dataset." doi: 10.21227/781w-ef42