Corona Virus (COVID-19) Tweets Dataset
- Citation Author(s):
- Submitted by:
- Rabindra Lamsal
- Last updated:
- Tue, 04/28/2020 - 02:52
- DOI:
- 10.21227/781w-ef42
- Data Format:
- Links:
- License:
- Dataset Views:
- 30997
- Rating:
- 5 ratings - Please login to submit your rating.
- Share / Embed Cite
This dataset includes CSV files which contain the tweet IDs. The tweets have been collected by the model deployed here at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for corona virus-related tweets, using filters: language “en”, and keywords “corona”, "coronavirus", "covid", "covid19" and variants of "sarscov2". As per the Twitter Developer Policy, it is not possible for me to provide information other than the Tweet IDs (this dataset has been completely re-designed on March 20, 2020, to comply with data sharing policies set by Twitter). Note: This dataset should be solely used for non-commercial research purpose (ignore every other LICENSE category given in this page).
If you're looking for geolocation-based COVID-19 sentiment data: http://dx.doi.org/10.21227/fpsb-jz61
#####################
Tweets count: 47,985,709
#####################
Schema: First column: tweet ID, Second column: Sentiment score for the particular tweet.
(Tweets collected in GMT+0; Local time mentioned below: GMT+5:45):
corona_tweets_01.csv: 831,327 tweets (March 20, 2020 01:37 AM - March 20, 2020 10:28 AM)
corona_tweets_02.csv: 870,924 tweets (March 20, 2020 10:31 AM - March 20, 2020 09:43 PM)
corona_tweets_03.csv: 773,729 tweets (March 20, 2020 09:49 PM - March 21, 2020 09:25 AM)
corona_tweets_04.csv: 1,233,340 tweets (March 21, 2020 09:27 AM - March 22, 2020 07:46 AM)
corona_tweets_05.csv: 1,782,157 tweets (March 22, 2020 07:50 AM - March 23, 2020 09:08 AM)
corona_tweets_06.csv: 1,771,295 tweets (March 23, 2020 09:11 AM - March 24, 2020 11:35 AM)
corona_tweets_07.csv: 1,479,651 tweets (March 24, 2020 11:42 AM - March 25, 2020 11:43 AM)
corona_tweets_08.csv: 1,272,592 tweets (March 25, 2020 11:47 AM - March 26, 2020 12:46 PM)
corona_tweets_09.csv: 1,091,429 tweets (March 26, 2020 12:51 PM - March 27, 2020 11:53 AM)
corona_tweets_10.csv: 1,172,013 tweets (March 27, 2020 11:56 AM - March 28, 2020 01:59 PM)
corona_tweets_11.csv: 1,141,210 tweets (March 28, 2020 02:03 PM - March 29, 2020 04:01 PM)
----- March 29, 2020 04:05 PM - March 30, 2020 12:30 PM -- Some folk(s) messed around with the server. Tweets for this period won't be available. However, I'll be continuing adding the new Tweet IDs. Some preventive measures have been taken. -----
corona_tweets_12.csv: 793,417 tweets. (March 30, 2020 02:01 PM - March 31, 2020 10:16 AM)
corona_tweets_13.csv: 1,029,294 tweets (March 31, 2020 10:20 AM - April 01, 2020 10:59 AM)
corona_tweets_14.csv: 920,076 tweets (April 01, 2020 11:02 AM - April 02, 2020 12:19 PM)
corona_tweets_15.csv: 826,271 tweets (April 02, 2020 12:21 PM - April 03, 2020 02:38 PM)
corona_tweets_16.csv: 612,512 tweets (April 03, 2020 02:40 PM - April 04, 2020 11:54 AM)
corona_tweets_17.csv: 685,560 tweets (April 04, 2020 11:56 AM - April 05, 2020 12:54 PM)
corona_tweets_18.csv: 717,301 tweets (April 05, 2020 12:56 PM - April 06, 2020 10:57 AM)
corona_tweets_19.csv: 722,921 tweets (April 06, 2020 10:58 AM - April 07, 2020 12:28 PM)
corona_tweets_20.csv: 554,012 tweets (April 07, 2020 12:29 PM - April 08, 2020 12:34 PM)
corona_tweets_21.csv: 589,679 tweets (April 08, 2020 12:37 PM - April 09, 2020 12:18 PM)
corona_tweets_22.csv: 517,718 tweets (April 09, 2020 12:20 PM - April 10, 2020 09:20 AM)
corona_tweets_23.csv: 601,199 tweets (April 10, 2020 09:22 AM - April 11, 2020 10:22 AM)
corona_tweets_24.csv: 497,655 tweets (April 11, 2020 10:24 AM - April 12, 2020 10:53 AM)
corona_tweets_25.csv: 477,182 tweets (April 12, 2020 10:57 AM - April 13, 2020 11:43 AM)
corona_tweets_26.csv: 288,277 tweets (April 13, 2020 11:46 AM - April 14, 2020 12:49 AM)
corona_tweets_27.csv: 515,739 tweets (April 14, 2020 11:09 AM - April 15, 2020 12:38 PM)
corona_tweets_28.csv: 427,088 tweets (April 15, 2020 12:40 PM - April 16, 2020 10:03 AM)
corona_tweets_29.csv: 433,368 tweets (April 16, 2020 10:04 AM - April 17, 2020 10:38 AM)
corona_tweets_30.csv: 392,847 tweets (April 17, 2020 10:40 AM - April 18, 2020 10:17 AM)
----- Additional keywords: "coronavirus", "covid", "covid19" and variants of "sarscov2". With the addition of these keywords, the tweets/day reach beyond a couple of millions, therefore, the CSV files hereafter will be zipped. Lets save some bandwidth. -----
corona_tweets_31.csv: 2,671,818 tweets (April 18, 2020 10:19 AM - April 19, 2020 09:34 AM)
corona_tweets_32.csv: 2,393,006 tweets (April 19, 2020 09:43 AM - April 20, 2020 10:45 AM)
corona_tweets_33.csv: 2,227,579 tweets (April 20, 2020 10:56 AM - April 21, 2020 10:47 AM)
corona_tweets_34.csv: 2,211,689 tweets (April 21, 2020 10:54 AM - April 22, 2020 10:33 AM)
corona_tweets_35.csv: 2,265,189 tweets (April 22, 2020 10:45 AM - April 23, 2020 10:49 AM)
corona_tweets_36.csv: 2,201,138 tweets (April 23, 2020 11:08 AM - April 24, 2020 10:39 AM)
corona_tweets_37.csv: 2,338,713 tweets (April 24, 2020 10:51 AM - April 25, 2020 11:50 AM)
corona_tweets_38.csv: 1,981,835 tweets (April 25, 2020 12:20 PM - April 26, 2020 09:13 AM)
corona_tweets_39.csv: 2,348,827 tweets (April 26, 2020 09:16 AM - April 27, 2020 10:21 AM)
corona_tweets_40.csv: 2,212,216 tweets (April 27, 2020 10:33 AM - April 28, 2020 10:09 AM)
To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score out of TextBlob [1] has been appended as the second column. A new list of tweet IDs will be added to this dataset every day. Bookmark this page for further updates.
Each CSV file contains a list of Tweet IDs. You can use these lists to download fresh data from Twitter. Please note that the header row is absent since corona_tweets_12.csv.
Downloading the tweets
To hydrate the tweet IDs, you can use applications such as DocNow's Hydrator (available for OS X, Windows and Linux) or QCRI's Tweets Downloader (java based). Please go through the documentation of the respective tools to know the downloading process.
Dataset Files
You must login with an IEEE Account to access these files. IEEE Accounts are FREE.
Sign Up now or login.
doi = {10.21227/781w-ef42},
url = {http://dx.doi.org/10.21227/781w-ef42},
author = {Rabindra Lamsal },
publisher = {IEEE Dataport},
title = {Corona Virus (COVID-19) Tweets Dataset},
year = {2020} }
T1 - Corona Virus (COVID-19) Tweets Dataset
AU - Rabindra Lamsal
PY - 2020
PB - IEEE Dataport
UR - 10.21227/781w-ef42
ER -






Comments
I am getting this error
DatabaseError: database disk image is malformed
Can you tell me the name of the file you're experiencing this error with? I would recommend you to first use any kind of SQLite DB viewer to check if the downloaded file is not corrupted.
my personal suggestion: I recommend you to open the databases (which are generating the image malformed error) using any DB viewer and re-save them on your machine or export to SQL or to any tabular format file system as per your's preference.
Hi! Could you mention what filters are you using to get the tweets? Thanks
keyword: corona, language: en
A significant amount of tweets used the word 'corona' ignoring the word 'virus'. So I had to track tweets using the most generic word: just 'corona'. Therefore, a couple of tweets relating to 'corona beer' might also be present in the databases.
Hi ! I cannot access the LSTM Model.
Try refreshing. Maybe the server was busy while you were trying to access the site. I just can't believe that more than 338,500 requests have been made to the model within the last 24 hours. And this amount of request is something that my model cannot handle. Sorry for the inconvenience!
Please fix this two datasets
1. corona_tweets_2M.db.zip 2. corona_tweets_2M_2.zip
it shows this error DatabaseError: database disk image is malformed
I downloaded the very same compressed files from this page and loaded both the databases on an SQLite DB viewer. The databases work just fine. See the screenshot here: https://i.ibb.co/SyQ7ff1/Screen-Shot-2020-03-19-at-8-21-46-PM.png
I recommend you to open the databases (which are generating the image malformed error) using any DB viewer and re-save them on your machine or export to SQL or to any tabular format file system as per your's preference.
Hi thanks for providing these datasets for the public. I have one questio, are all these files contain same structure? I wish if they had the other feilds twitter provides with tweets so we can directly do our research?
I wonder if the other files all have three columns only, unix, text and sentiment.
Hello there! Yes, all the files have the same structure (unix, text, sentiment score). However, starting March 20 the collected tweets will also have one additional column, viz. tweet ID.
This is because, initially, the purpose of the deployed web app was not just to collect the tweets; it was more like an optimization project. However, when the corona outbreak started in China, I decided to release the collected tweets rather than just keeping them with me.
Hi
Rabindra Have the SQLite dbs been replaced with CSV with only time and sentiment score?Thanks
Hello Bevan. No, the first column in the CSV files is tweet ID. You'll have to automate the extraction of tweets using the list of tweet IDs. Twitter Policy; so I had to remove every other info except the tweet ID and sentiment score.
Thanks Rabindra for the reply - take care Bevan
Hi, Can you please upload the tweet ids and sentiment of the old file from February and early March?
Thank you
Hello Rabia! unfortunately, I had to take down all the tweets which were collected between Feb 1, 2020, and Mar 19, 2020, because the old DB files didn't have tweet IDs collected. This was because, initially, the purpose of the deployed web app was not just to collect the tweets; it was more like an optimization project. However, when the corona outbreak started in China, I decided to release the collected tweets rather than just keeping them with me. Therefore, because of twitter data sharing policies, I am not authorized to share the old files. Sorry for the inconvenience.
Thank you for your response. I completely understand this.
Hi, I'm trying to view a particular tweet using the tweet IDs that you provided with a piece of python code that you provided above after adding my credentials for (CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET), however, it always gives me the following error message:
tweepy.error.TweepError: [{'code': 144, 'message': 'No status found with that ID.'}]
Have you hashed those tweet ids that you uploaded? Any advice is appreciated.
Best regards,
Maybe the particular tweet which you're trying to view has been either removed or hidden by the user.
Thanks for replying, actually I don't think those tweets have been removed or hidden by the users, because I tried in a for loop hundreds of different tweet ids and all of them gave me the same error message. While I got some tweet id from another source they worked just fine.
Here are the some of tweet ids that I used from file number 10 for example:
1243420522592910000
1243420476824640000
1243420477235660000
1243420477646720000
1243420477894190000
1243420478238150000
1243420478535890000
1243420478829510000
1243420478951180000
1243420479706150000
1243420479844530000
1243420479982990000
1243420479924250000
1243420478837900000
1243420480205280000
1243420481744560000
1243420482075930000
1243420482201770000
1243420482222730000
1243420482084270000
1243420482814100000
1243420482935760000
1243420482629590000
Thanks,
I double-checked corona_tweets_10.csv, but I could not find any of these IDs in the file. However, I can see one pattern in the tweet IDs you've listed above: they all end with a number of zeros. Use sublime text or a simple text editor to open the CSV files. Looks like the application which you're using to open these files is somehow chopping off some digits at the back and replacing the chopped ones with zeros.
For example, the last ID you've listed 1243420482629590000 should have been 1243420482629591040. See that the last 4 digits are zeroes at your end. Same is the case with all other IDs you've mentioned above.
Yes, that's right. I read the CSV files with R, it fixed the numbers.
Also, if you have the tweet ids for March 13 to March 19, that would be great to upload it here.
Thanks;
The model has been collecting the corona-related tweets since Jan 27, 2020. However, the model was designed as a part of an optimization project and therefore it was made to only extract the tweets but not the tweet IDs. And because of Twitter's data sharing policy, I am not allowed to share them. Therefore, I started extracting and uploading the tweet IDs since March 20, 2020, only.
Thank you,
I'm haivng the exact same issue, i.e. all IDs end with four zeros while the zeros should in fact be other numbers. I was just opening it as csv file.
Could you please let me know how to fix it? Thank you very much!
Are you trying to write a script to hydrate the tweet IDs or something else? Please see the instruction given in the dataset description field.
Thank you for the reply! I've tried using the QCRI's Tweets Downloader to hydrate the tweet IDs, but same as tweepy API, the first step is to get a list of correct tweet IDs, which I don't have because of the zeros at the end of the tweet_id column in the original dataset.
I saw in the previous discussion you mentioned "For example, the last ID you've listed 1243420482629590000 should have been 1243420482629591040", could you please let me know how you get the correct tweet ID that ends with 1040? Many thanks!
Can you do one thing? Download a CSV file, and open it with Notepad or Sublime text and let me know if the last 4 digits are represented properly.
Hi
I try to download all data from twitter using user id, but the app Hydrator always stop downloading.
Is that mean the download tweets reach the rate limit?
thanks
Can you please elaborate? Also, I would recommend you to write to the app's author regarding the issue.
Congratulations for this work!
Thank you, Thiago.
Can someone share the code snippet to get the tweet text from tweet id.
Use Hydrator (https://github.com/DocNow/hydrator) or QCRI's Tweet Downloader tool (https://crisisnlp.qcri.org/data/tools/TweetsRetrievalTool-v2.0.zip) for downloading the tweets.