Coronavirus (COVID-19) Tweets Sentiment Trend (Global)

0
0 ratings - Please login to submit your rating.

Abstract 

This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.

The dataset will be updated weekly and will continue until the development of the Coronavirus (COVID-19) Tweets Dataset is ongoing.

-------------------------------------

Related: Coronavirus (COVID-19) Tweets Dataset, Coronavirus (COVID-19) Geo-tagged Tweets Dataset and Tweets Originating from India During COVID-19 Lockdowns 1, 2, 3, 4

-------------------------------------

A quick overview of the dataset

The sentiment scores are defined in the range [-1,0), 0, and (0,+1] for negative sentiment, neutral sentiment, and positive sentiment. Since the number of negative sentiment tweets is always less than the combined number of neutral and positive sentiment tweets, the majority of the time, the average sentiment falls pretty close to +0.05. So if we consider +0.05 as the neutral point for average sentiment score, then any score greater than +0.1 (peaks) and smaller than 0 (drops) can be regarded as a point of interest for further scrutinizing. Following are the dates when the Twitter stream (based on the tweets present in the Coronavirus (COVID-19) Tweets Dataset) experienced those peaks and drops:

Positive peaks: In 2020 (April 30, May 3, May 23, May 24, May 25, May 26, June 2, June 22, June 28, July 3, July 12, July 26, August 15, August 16, August 18, August 21, August 24, August 31, September 1, September 2, September 4, September 5, September 9, September 21, September 23, October 2)

Negative Peaks: In 2020 (May 28, May 30, May 31, June 1, June 2, June 7, June 8, June 12, June 13, June 14, June 15, June 21, June 24, june 25, July 6, July 7, July 10, August 26, September 1, September 3, September 13, September 17, September 25, September 26, September 28, October 5)

What's inside the dataset files?

Tweets collected every 10 minutes are sampled together, and an average sentiment score is computed. This dataset contains TXT files, each with two columns: (i) date/time (in UTC) and (ii) average sentiment. The first column is date/time and is by default in Unix timestamp (in ms). You can use this formula =cell/1000/60/60/24 + DATE(1970,1,1) in Spreadsheets, or this pd.to_datetime(dataframe_name[column],unit='ms') if you're comfortable with Python, to convert the Unix timestamp to human-readable format. There are 44 instances where the average sentiment score is NULL because of networking issues with the cloud service.

Instructions: 

The TXT files in this dataset can be used in generating the trend graph. The peaks and drops in the trend graph can be made more meaningful by computing n-grams for those periods. To compute the n-grams, the tweet IDs of the Coronavirus (COVID-19) Tweets Dataset should be hydrated to form a tweets corpus.

Pseudo-code for generating similar trend dataset

current = int(time.time()*1000)     #we receive the timestamp in ms from twitter

off = 600*1000    #we're looking for 10-minute (600 seconds) average data (offset)

past = current - off     #getting timestamp of 10-minute past the current time

df = select recent most 60,000    #even if we receive 100 tweets per second, the no. of tweets do not cross this number in an interval of 10 minutes

new_df = df[df.unix > past]     #here "unix" is the timestamp column name in the primary tweets dataset

avg_sentiment = new_df["sentiment"].mean()    #calculate mean

store current, avg_sentiment into a database

Pseudo-code for extracting top 100 "unigrams" and "bigrams" from a tweets corpus

import nltk

from collections import Counter

#loading a tweet corpus

with open ("/path/to/the/tweets/corpus", "r", encoding="UTF-8") as myfile:

     data=myfile.read().replace('\n', ' ')

data = preprocess your data (use regular expression-perform find and replace operations)

data = data.split(' ')

stopwords = nltk.corpus.stopwords.words('english')

clean_data=[]

#removing stopwords from each tweet

for w in data:

     if w not in stopwords:

          clean_data.append(w)

#extracting top 100 n-grams

unigram = Counter(clean_data)

unigram_top = unigram.most_common(100)

bigram = Counter(zip(clean_data, clean_data[1:]))

bigram_top = bigram.most_common(100)