Coronavirus (COVID-19) Tweets Sentiment Trend (Global)

0
0 ratings - Please login to submit your rating.

Abstract 

This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.

The dataset will be updated weekly and will continue until the development of the Coronavirus (COVID-19) Tweets Dataset is ongoing.

— Dataset usage terms : By using this dataset, you agree to (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Developer Policy and (iii) cite the following paper:

Lamsal, R. Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence (2020). https://doi.org/10.1007/s10489-020-02029-z

-------------------------------------

Related datasets:

(a) Coronavirus (COVID-19) Tweets Dataset

(b) Coronavirus (COVID-19) Geo-tagged Tweets Dataset

(c) Tweets Originating from India During COVID-19 Lockdowns

-------------------------------------

A quick overview of the dataset

The sentiment scores are defined in the range [-1,0), 0, and (0,+1] for negative sentiment, neutral sentiment, and positive sentiment. Since the number of negative sentiment tweets is always less than the combined number of neutral and positive sentiment tweets, the majority of the time, the average sentiment falls pretty close to +0.05. So if we consider +0.05 as the neutral point for average sentiment score, then any score greater than +0.1 (peaks) and smaller than 0 (drops) can be regarded as a point of interest for further scrutinizing. Following are the dates when the Twitter stream (based on the tweets present in the Coronavirus (COVID-19) Tweets Dataset) experienced those peaks and drops:

Positive peaks: In 2020 (April 30, May 3, May 23, May 24, May 25, May 26, June 2, June 22, June 28, July 3, July 12, July 26, August 15, August 16, August 18, August 21, August 24, August 31, September 1, September 2, September 4, September 5, September 9, September 21, September 23, October 2, October 9, October 18, October 22, November 4, November 6, November 7, November 8, November 9, November 10, November 16, November 18, November 19, November 23, November 24, November 26, November 30, December 13, December 14, December 15, December 18, December 24, December 25, December 27). In 2021 (January 1, January 3)

Negative Peaks: In 2020 (May 28, May 30, May 31, June 1, June 2, June 7, June 8, June 12, June 13, June 14, June 15, June 21, June 24, June 25, July 6, July 7, July 10, August 26, September 1, September 3, September 13, September 17, September 25, September 26, September 28, October 5, October 9, October 10, October 15, October 26, November 1, November 8, November 9, November 13, November 15, November 21, November 22, December 1, December 6, December 19). In 2021 (January 7)

What's inside the dataset files?

Tweets collected every 10 minutes are sampled together, and an average sentiment score is computed. This dataset contains TXT files, each with two columns: (i) date/time (in UTC) and (ii) average sentiment. The first column is date/time and is by default in Unix timestamp (in ms). You can use this formula =cell/1000/60/60/24 + DATE(1970,1,1) in Spreadsheets, or this pd.to_datetime(dataframe_name[column],unit='ms') if you're comfortable with Python, to convert the Unix timestamp to human-readable format. Note that there are multiple instances where the average sentiment score is NULL because of technical issues (networking (at cloud service) and API).

Instructions: 

The TXT files in this dataset can be used in generating the trend graph. The peaks and drops in the trend graph can be made more meaningful by computing n-grams for those periods. To compute the n-grams, the tweet IDs of the Coronavirus (COVID-19) Tweets Dataset should be hydrated to form a tweets corpus.

Pseudo-code for generating similar trend dataset

current = int(time.time()*1000)     #we receive the timestamp in ms from twitter

off = 600*1000    #we're looking for 10-minute (600 seconds) average data (offset)

past = current - off     #getting timestamp of 10-minute past the current time

df = select recent most 60,000    #even if we receive 100 tweets per second, the no. of tweets do not cross this number in an interval of 10 minutes

new_df = df[df.unix > past]     #here "unix" is the timestamp column name in the primary tweets dataset

avg_sentiment = new_df["sentiment"].mean()    #calculate mean

store current, avg_sentiment into a database

Pseudo-code for extracting top 100 "unigrams" and "bigrams" from a tweets corpus

import nltk

from collections import Counter

#loading a tweet corpus

with open ("/path/to/the/tweets/corpus", "r", encoding="UTF-8") as myfile:

     data=myfile.read().replace('\n', ' ')

data = preprocess your data (use regular expression-perform find and replace operations)

data = data.split(' ')

stopwords = nltk.corpus.stopwords.words('english')

clean_data=[]

#removing stopwords from each tweet

for w in data:

     if w not in stopwords:

          clean_data.append(w)

#extracting top 100 n-grams

unigram = Counter(clean_data)

unigram_top = unigram.most_common(100)

bigram = Counter(zip(clean_data, clean_data[1:]))

bigram_top = bigram.most_common(100)

Comments

Hi,

How can I get tweet's ids from timestamps?

Thank you,
Carolina

Submitted by Carolina Marreiros on Wed, 12/09/2020 - 05:53

Hello Carolina. The following steps let you convert timestamp to tweet id.

milisecond epoch = ... #convert the preferred date & time to ms epoch
epoch = milisecond epoch - 1288834974657
tweet_id = epoch << 22 #applying left shift operator

However, please note that the timestamps given in this dataset are not associated with tweet ids. You'll have to make use of the tweet ids given in the primary dataset for hydration. Timestamps here in this dataset give the cursory view of the tweets captured in the primary dataset.

Submitted by Rabindra Lamsal on Fri, 12/11/2020 - 22:54