Coronavirus (COVID-19) Geo-tagged Tweets Dataset

4.444445
9 ratings - Please login to submit your rating.

Abstract 

This dataset contains IDs and sentiment scores of the geo-tagged tweets related to the COVID-19 pandemic. The tweets are captured by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. Complying with Twitter's content redistribution policy, only the tweet IDs are shared. You can re-construct the dataset by hydrating these IDs. The tweet IDs in this dataset belong to the tweets tweeted providing an exact location.

The paper associated with this dataset is available here: Design and analysis of a large-scale COVID-19 tweets dataset

-------------------------------------

Related datasets: 

(a) Coronavirus (COVID-19) Tweets Sentiment Trend (Global)

(b) Tweets Originating from India During COVID-19 Lockdowns

-------------------------------------

Below is a quick overview of this dataset.

— Dataset name: GeoCOV19Tweets Dataset

— Number of tweets : 328,915 tweets

— Coverage : Global

— Language : English (EN)

— Dataset usage terms : By using this dataset, you agree to (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Developer Policy and (iii) cite the following paper:

Lamsal, R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence (2020). https://doi.org/10.1007/s10489-020-02029-z

— Primary dataset : Coronavirus (COVID-19) Tweets Dataset (COV19Tweets Dataset)

— Dataset updates : Everyday

— Active keywords and hashtags: keywords.tsv

Please visit this page (primary dataset) for details regarding the collection date and time (and other notes) of each CSV file present in this dataset.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("april28_april29.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_april28_april29.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., april28_april29.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_april28_april29.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Comments

Hello sir,

twarc hydrate is not working in spite of giving correct twitter API credentials for configuring twarc. Its creating a blank json file. To test twarc, I used search, it is able to pull out tweets.

Submitted by Jayshree Ravi on Fri, 10/16/2020 - 04:30

Hello Jayshree. Please create an issue at twarc's github. And FYI, I am able to hydrate tweets at my end without any problem.

Submitted by Rabindra Lamsal on Sat, 10/17/2020 - 01:10

Thanks for your response. Only hydrate command is not working. All other commands like search, filter, users and dehydrate are able to connect to twitter and give me the requisite information. Hydrate command does not throw any error. It just produces a blank json file. I even tried with only one tweet id in the txt file. Your guidance would be of great help

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 01:20

Its working now. Thanks

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 02:19

That's great.

Submitted by Rabindra Lamsal on Mon, 10/19/2020 - 00:53

How do you get permission to access the S3 bucket? I'm getting "access denied" errors when I try to access through the aws app or the web. Thanks!

Submitted by Adam Dalton on Thu, 11/05/2020 - 10:55

For anyone looking to use the aws cli, here's what I did

1. Click on the "Access on AWS" link
2. Click "View AWS Security Credentials"
3. in "~/.aws/credentials" create an ieee profile
[ieee]
aws_access_key_id = ********
aws_secret_access_key = ********
4. Copy the files listed in "Access on AWS" into a file like covid19-geotagged.txt
5. Run `while read -r line;do aws s3 --profile=ieee cp "$line" .;done < covid19-geotagged.txt`

This should work on most unix machines. Windows will probably be slightly different.

Submitted by Adam Dalton on Thu, 11/05/2020 - 11:12

Great! Thanks for the follow-up.

Submitted by Rabindra Lamsal on Fri, 11/06/2020 - 00:04

Please make sure you have the exact AWS access ID and AWS Secret Access Key copied from your IEEE-DataPort profile.

things to note:
(a) protocol: not always required (Amazon S3)
(b) Address: not always required (s3.amazonaws.com)
(c) bucket: ieee-dataport
(d) Access Key id: enter your AWS access ID
(e) Secret: enter your AWS Secret Access Key

I hope this helps.

Submitted by Rabindra Lamsal on Sun, 11/08/2020 - 23:43

Sir, are these comments filtered from tweets which have local language tweets typed in english, for exmple Hindi message written in english ??

Submitted by GONGATI REDDY on Fri, 11/13/2020 - 05:08

Hello Gongati. Twitter adds a language identifier based on the machine-detected language of the tweet body. The tweets in this dataset are those which had "en" language identifier in their metadata.
Tweets composed in eg. romanized hindi cannot be supposed to be in English although they make use of English alphabets. I believe those kinds of tweets fall under undefined 'und' language category.

Submitted by Rabindra Lamsal on Sat, 11/14/2020 - 00:02

Thank You.....

Submitted by GONGATI REDDY on Sat, 11/14/2020 - 00:05

Glad to be of help.

Submitted by Rabindra Lamsal on Sat, 11/14/2020 - 23:53

Sir, is it true that we can fetch only 100 tweets at time? If true, is there a chance that tweets will repeating again in the next 100 from the previous 100?

Submitted by GONGATI REDDY on Sun, 11/15/2020 - 22:58

(i) Yes, 100 tweets can be fetched in a single request (v1.1 streaming API). However, Twitter puts limits on the number of requests that can be made per window period.
(ii) No, tweets do not repeat. The tweets are available via the streaming API as soon as they are tweeted (in near real-time).

Submitted by Rabindra Lamsal on Tue, 11/17/2020 - 10:33

Dear Rabindra,
thanks for the amazing work in helping make the researcher’s work easier and faster.
Though I need your help to know if I could download the entire dataset without the need to subscribe for IEEE DataPort or AWS? And if yes, how could that be?

Please help me know if anyone else knows about it.
Thanks.

Submitted by Vibhu Kumar on Thu, 12/03/2020 - 07:19

Hello Vibhu. You don't need any kind of subscription to IEEE DataPort to download the dataset. All you need is a normal IEEE.org account. The dataset is open access; therefore, no subscription is required.

You can access the AWS S3 bucket via the command line (this comment may help you out here: https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw...).

Else, you can write to IEEE DataPort via this page (https://ieee-dataport.org/contact) and ask for other ways to download the entire dataset.

Submitted by Rabindra Lamsal on Sat, 12/05/2020 - 00:43

Dear Rabindra,
Thank you for developing this dataset and the Hydrate app.
However, when I download the "ready datast"from the google drive, the Hydreate app warned that the "invalid line 1 from the file".
Any solutions on this matter? Thanks!

Submitted by Xuanyi Zhao on Wed, 12/09/2020 - 19:30

Hello Xuanyi. The aim of using Google Spreadsheet is to remove the second column (i.e. sentiment score) and only keep the first column (tweet id). Just make sure that there are only tweet ids (one id per line) in the ready file. Maybe the file which you've made ready for hydration contains extra spaces or any sort of characters (by mistake) other than digits (ids). Please double-check this.

Submitted by Rabindra Lamsal on Wed, 12/09/2020 - 22:56

Hi Rabrindra,

Could I ask for some advice please, is it possible to filter the tweets to isolate those coming from one country? i.e. The united Kingdom? If so, how would I go about doing that?

Thanks.

Submitted by Craig Cowan on Sun, 01/24/2021 - 10:46

Hello Craig. Yes, it is possible to filter the tweets coming from one country. Please refer to my previous comments:

(i) https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw...
(ii) https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw...

I hope this helps.

Submitted by Rabindra Lamsal on Sun, 01/24/2021 - 23:56

Hi Rabindra,

Thanks a lot thats great information.

Submitted by Craig Cowan on Mon, 01/25/2021 - 07:50

I am however struggling to Hydrate the tweets using the hydrator which you have reccomended, DocNow's Hydrator.

I noticed in a previous comment you have said that when hydrating not to use Excels .csv files due to the numbers truncating, and to use google sheets instead. How do go about doing this? As when uploading the file to Google sheets, the numbers still truncate.

Any help is appreciated, thanks.

Submitted by Craig Cowan on Mon, 01/25/2021 - 08:03

I have not experienced the truncation issue with Google Sheets. If you are still having the issue, I would recommend you to use python or any language you're comfortable with to drop the second column of the CSV files.
(or you can simply make use of online "column drop" applications; there are multiple of them).

Submitted by Rabindra Lamsal on Tue, 01/26/2021 - 06:59

Hello,

first of all thank you for you effort, I greatly appreciate it.

I´m a little bit confused about the twitter API, maybe you can help me. When a tweet has a value for "coordinates", does this automatically assign the correct value for "place"? In the Twitter documentation (and also in your paper), it is said that "place" does not necessarily mean, that the tweet was posted in the particular location.

Thank you

Submitted by Sebastian Dueker on Wed, 02/17/2021 - 11:22

Hello Sebastian. If you go through the Geo objects documentation (https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/obj...), you'll see that you can extract location data in two different levels (I'm excluding the third one i.e. location from a user's profile).
If the "coordinates" object is NOT NULL, the "place" object will have exact location information (point location). However, if the "coordinates" object is NULL, you can still have place information (Twitter place, if available). But, keep in mind that location information extract from the latter case might not represent the tweet's origin location.
I hope this clarifies your doubt.

Submitted by Rabindra Lamsal on Fri, 02/19/2021 - 01:00

Hello Rabindra,

thanks for clarifying. Still, I have another question:

I used your datset by using the "Hydrator" application. In the concluding dataset, there are no instances of tweets where the "coordinates" column is empty. Does this mean that all tweets in your dataset represent an excact location? Or does Twitter assign coordinates according to a "place" which was selected by a user afterwards?

For my research, I´m looking for Tweets which are from New York and London, therefore I´ll need the excact location from Tweets which were actually posted from these locations.

Submitted by Sebastian Dueker on Fri, 02/19/2021 - 04:36

Yes. The tweets in this dataset have exact locations. Tweets with "Twitter place" do not make their way to this dataset.

Submitted by Rabindra Lamsal on Fri, 02/19/2021 - 10:49

Hi Rabindra,
First of all, thank you so much for this data set. I have downloaded the data of a particular day into my pc but for that data total number of tweets is very low, only 1289. I am getting a very low accuracy or f1 score for my machine learning model. Can I get more data together for a larger span like one month so that I can feed a large volume of data?
Please help me.

Submitted by SUBHADIP MAITY on Tue, 02/23/2021 - 08:12

Hello Subhadip. I have emailed IEEE to implement a Combined_Files section to this dataset. I believe the section will be implemented by tomorrow.

And once you download all the CSV files in a zip (once the implementation is finished), you can concatenate all the files using python (or consider only the files you're interested in). Dropping the second column should be easy. Then you can easily hydrate the tweets. I would suggest you use twarc to hydrate the tweets, and while you're hydrating, you can also extract the corresponding sentiment scores via your custom-built sentiment classifier/regressor or third-party libraries.

I hope this helps.

[update] the Combined_Files section is now available.

Submitted by Rabindra Lamsal on Thu, 02/25/2021 - 11:11

Hi Rabindra, I have got the combined files. Thank you so much.

Submitted by SUBHADIP MAITY on Thu, 02/25/2021 - 21:30

Glad to be of help.

Submitted by Rabindra Lamsal on Fri, 02/26/2021 - 12:41

Hi Rabindra,

I realized that a lot of the tweets in the dateset are cut-off. Do you have any idea what´s the reason for that?

Submitted by Sebastian Dueker on Wed, 02/24/2021 - 07:14

Do you mean that the tweets are "not available"?
Yes, you will not be able to hydrate the tweet IDs of the tweets which have been deleted or made private. If you go through the paper, I have mentioned regarding this. However, the number of "not available" tweets is not that significant in comparison to the primary dataset.

Submitted by Rabindra Lamsal on Thu, 02/25/2021 - 00:32

I´m not sure if we´re talking about the same thing. I´m able to hydrate these tweets, but when I try to look at the text, I can only see a small part of the text and the rest is truncated by three dots at the end of the tweet.

In your paper, you wrote that about 2.80% of the tweets were either private or deleted. For my research I looked at a lot of the tweets manually. Maybe i just got the wrong tweets, but about 80% of the tweets were affected this way.

Submitted by Sebastian Dueker on Fri, 02/26/2021 - 03:05

Ohh, okay. You are talking about tweets getting truncated. Don't worry, you're getting the correct tweets. There is a Twitter object "truncated" which indicates if a tweet is truncated. Truncated tweets end in ellipsis, like this ... .

Dealing with truncated tweets: You can simply get the full tweet text by requesting data['extended_tweet']['full_text']. If the "truncated"
object is "false" just pull the tweet text with data['text'] and when the "truncated" object is "true" use data['extended_tweet']['full_text'].

if data['truncated']:
tweet = data['extended_tweet']['full_text']
else:
tweet = data['text']

And for a retweeted tweet (if there is truncation), the full text is placed under 'retweeted_status'.

I hope this helps.

Submitted by Rabindra Lamsal on Fri, 02/26/2021 - 09:22

Thanks for replying. But I think the problem I´m facing lies somewhere else: A lot of these truncated tweets just contain links to other social media posts (especially Instagram) at the end of the tweets. Can the full text of these kind of tweets be shown as well? Or is there any other way to resolve this?

Submitted by Sebastian Dueker on Fri, 02/26/2021 - 09:46

You can always check a tweet online using this URL: http://twitter.com/check/status/tweet_id.

Just replace "tweet_id" in the above URL with numeric ID. You can then check if the tweet is really truncated or the tweet has been written in that very way.

Submitted by Rabindra Lamsal on Fri, 02/26/2021 - 11:19

Yeah, sure. Here are some example IDs:
1245248640421179393
1245289072987443201
1245304684237090817

Submitted by Sebastian Dueker on Fri, 02/26/2021 - 11:21

You can use the above URL pattern to see the full tweet body of any tweet using its ID. I went through the tweet body of the IDs you shared. In the case of these tweets, there is nothing much we can do. It is due to character limitation on Twitter's side. And this case is pretty common when people share their Insta posts on Twitter.

Submitted by Rabindra Lamsal on Fri, 02/26/2021 - 12:40

Pages

Dataset Files

LOGIN TO ACCESS DATASET FILES