Coronavirus (COVID-19) Geo-tagged Tweets Dataset

4.285715
7 ratings - Please login to submit your rating.

Abstract 

This dataset contains IDs and sentiment scores of the geo-tagged tweets related to the COVID-19 pandemic. The tweets are captured by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. Complying with Twitter's content redistribution policy, only the tweet IDs are shared. You can re-construct the dataset by hydrating these IDs. The tweet IDs in this dataset belong to the tweets tweeted providing an exact location.

The paper associated with this dataset is available here: Design and analysis of a large-scale COVID-19 tweets dataset

-------------------------------------

Related datasets: 

(a) Coronavirus (COVID-19) Tweets Sentiment Trend (Global)

(b) Tweets Originating from India During COVID-19 Lockdowns

-------------------------------------

Below is the quick overview of this dataset.

— Dataset name: GeoCOV19Tweets Dataset

— Number of tweets : 272,404 tweets

— Coverage : Global

— Language : English (EN)

— Dataset usage terms : By using this dataset, you agree to (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Developer Policy and (iii) cite the following paper:

Lamsal, R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence (2020). https://doi.org/10.1007/s10489-020-02029-z

— Primary dataset : Coronavirus (COVID-19) Tweets Dataset (COV19Tweets Dataset)

— Dataset updates : Everyday

— Active keywords and hashtags: keywords.tsv

Please visit this page (primary dataset) for details regarding the collection date and time (and other notes) of each CSV file present in this dataset.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("april28_april29.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_april28_april29.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., april28_april29.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_april28_april29.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Comments

Hello sir,

twarc hydrate is not working in spite of giving correct twitter API credentials for configuring twarc. Its creating a blank json file. To test twarc, I used search, it is able to pull out tweets.

Submitted by Jayshree Ravi on Fri, 10/16/2020 - 04:30

Hello Jayshree. Please create an issue at twarc's github. And FYI, I am able to hydrate tweets at my end without any problem.

Submitted by Rabindra Lamsal on Sat, 10/17/2020 - 01:10

Thanks for your response. Only hydrate command is not working. All other commands like search, filter, users and dehydrate are able to connect to twitter and give me the requisite information. Hydrate command does not throw any error. It just produces a blank json file. I even tried with only one tweet id in the txt file. Your guidance would be of great help

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 01:20

Its working now. Thanks

Submitted by Jayshree Ravi on Sun, 10/18/2020 - 02:19

That's great.

Submitted by Rabindra Lamsal on Mon, 10/19/2020 - 00:53

How do you get permission to access the S3 bucket? I'm getting "access denied" errors when I try to access through the aws app or the web. Thanks!

Submitted by Adam Dalton on Thu, 11/05/2020 - 10:55

For anyone looking to use the aws cli, here's what I did

1. Click on the "Access on AWS" link
2. Click "View AWS Security Credentials"
3. in "~/.aws/credentials" create an ieee profile
[ieee]
aws_access_key_id = ********
aws_secret_access_key = ********
4. Copy the files listed in "Access on AWS" into a file like covid19-geotagged.txt
5. Run `while read -r line;do aws s3 --profile=ieee cp "$line" .;done < covid19-geotagged.txt`

This should work on most unix machines. Windows will probably be slightly different.

Submitted by Adam Dalton on Thu, 11/05/2020 - 11:12

Great! Thanks for the follow-up.

Submitted by Rabindra Lamsal on Fri, 11/06/2020 - 00:04

Please make sure you have the exact AWS access ID and AWS Secret Access Key copied from your IEEE-DataPort profile.

things to note:
(a) protocol: not always required (Amazon S3)
(b) Address: not always required (s3.amazonaws.com)
(c) bucket: ieee-dataport
(d) Access Key id: enter your AWS access ID
(e) Secret: enter your AWS Secret Access Key

I hope this helps.

Submitted by Rabindra Lamsal on Sun, 11/08/2020 - 23:43

Sir, are these comments filtered from tweets which have local language tweets typed in english, for exmple Hindi message written in english ??

Submitted by GONGATI REDDY on Fri, 11/13/2020 - 05:08

Hello Gongati. Twitter adds a language identifier based on the machine-detected language of the tweet body. The tweets in this dataset are those which had "en" language identifier in their metadata.
Tweets composed in eg. romanized hindi cannot be supposed to be in English although they make use of English alphabets. I believe those kinds of tweets fall under undefined 'und' language category.

Submitted by Rabindra Lamsal on Sat, 11/14/2020 - 00:02

Thank You.....

Submitted by GONGATI REDDY on Sat, 11/14/2020 - 00:05

Glad to be of help.

Submitted by Rabindra Lamsal on Sat, 11/14/2020 - 23:53

Sir, is it true that we can fetch only 100 tweets at time? If true, is there a chance that tweets will repeating again in the next 100 from the previous 100?

Submitted by GONGATI REDDY on Sun, 11/15/2020 - 22:58

(i) Yes, 100 tweets can be fetched in a single request (v1.1 streaming API). However, Twitter puts limits on the number of requests that can be made per window period.
(ii) No, tweets do not repeat. The tweets are available via the streaming API as soon as they are tweeted (in near real-time).

Submitted by Rabindra Lamsal on Tue, 11/17/2020 - 10:33

Pages

Dataset Files

LOGIN TO ACCESS DATASET FILES