Datasets
Open Access
Coronavirus (COVID-19) Tweets Sentiment Trend
- Citation Author(s):
- Submitted by:
- Rabindra Lamsal
- Last updated:
- Fri, 11/04/2022 - 08:05
- DOI:
- 10.21227/t263-8x74
- Data Format:
- Links:
- License:
Abstract
This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.
Announcement: We also release a million-scale COVID-19-specific geotagged tweets dataset—MegaGeoCOV (on GitHub). The dataset is introduced in the paper "Twitter conversations predict the daily confirmed COVID-19 cases".
Related publications:
- Lamsal, R. (2021). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 51(5), 2790-2804.
- Lamsal, R., Harwood, A., & Read, M. R. (2022). Socially Enhanced Situation Awareness from Microblogs using Artificial Intelligence: A Survey. ACM Computing Surveys.
- Lamsal, R., Harwood, A., & Read, M. R. (2022). Twitter conversations predict the daily confirmed COVID-19 cases. Applied Soft Computing, 109603.
- Lamsal, R., Harwood, A., & Read, M. R. (2022). Addressing the location A/B problem on Twitter: the next generation location inference research. In Proceedings of the 6th ACM SIGSPATIAL LocalRec (pp. 1-4).
— Dataset usage terms : By using this dataset, you agree to (i) use the content of this dataset and the data generated from the content of this dataset for non-commercial research only, (ii) remain in compliance with Twitter's Developer Policy and (iii) cite the following paper:
Lamsal, R. (2020). Design and analysis of a large-scale COVID-19 tweets dataset. Applied Intelligence, 1-15. DOI: https://doi.org/10.1007/s10489-020-02029-z
BibTeX:
@article{lamsal2020design, title={Design and analysis of a large-scale COVID-19 tweets dataset}, author={Lamsal, Rabindra}, journal={Applied Intelligence}, pages={1--15}, year={2020}, publisher={Springer} }
-------------------------------------
Related datasets:
(a) Coronavirus (COVID-19) Tweets Dataset
(b) Coronavirus (COVID-19) Geo-tagged Tweets Dataset
(c) Tweets Originating from India During COVID-19 Lockdowns
-------------------------------------
A quick overview of the dataset
Note: This dataset is no longer maintained.
The sentiment scores are defined in the range [-1,0), 0, and (0,+1] for negative sentiment, neutral sentiment, and positive sentiment. Since the number of negative sentiment tweets is always less than the combined number of neutral and positive sentiment tweets, the majority of the time, the average sentiment falls pretty close to +0.05. So if we consider +0.05 as the neutral point for average sentiment score, then any score greater than +0.1 (peaks) and smaller than 0 (drops) can be regarded as a point of interest for further scrutinizing. Following are the dates when the Twitter stream (based on the tweets present in the Coronavirus (COVID-19) Tweets Dataset) experienced those peaks and drops:
Positive peaks: In 2020 (April 30, May 3, May 23, May 24, May 25, May 26, June 2, June 22, June 28, July 3, July 12, July 26, August 15, August 16, August 18, August 21, August 24, August 31, September 1, September 2, September 4, September 5, September 9, September 21, September 23, October 2, October 9, October 18, October 22, November 4, November 6, November 7, November 8, November 9, November 10, November 16, November 18, November 19, November 23, November 24, November 26, November 30, December 13, December 14, December 15, December 18, December 24, December 25, December 27). In 2021 (January 1, January 3, January 12, January 18, January 22, January 25, January 26, January 28, January 29)
Negative Peaks: In 2020 (May 28, May 30, May 31, June 1, June 2, June 7, June 8, June 12, June 13, June 14, June 15, June 21, June 24, June 25, July 6, July 7, July 10, August 26, September 1, September 3, September 13, September 17, September 25, September 26, September 28, October 5, October 9, October 10, October 15, October 26, November 1, November 8, November 9, November 13, November 15, November 21, November 22, December 1, December 6, December 19). In 2021 (January 7)
What's inside the dataset files?
Tweets collected every 10 minutes are sampled together, and an average sentiment score is computed. This dataset contains TXT files, each with two columns: (i) date/time (in UTC) and (ii) average sentiment. The first column is date/time and is by default in Unix timestamp (in ms). You can use this formula =cell/1000/60/60/24 + DATE(1970,1,1) in Spreadsheets, or this pd.to_datetime(dataframe_name[column],unit='ms') if you're comfortable with Python, to convert the Unix timestamp to human-readable format. Note that there are multiple instances where the average sentiment score is NULL because of technical issues (networking (at cloud service) and API).
The TXT files in this dataset can be used in generating the trend graph. The peaks and drops in the trend graph can be made more meaningful by computing n-grams for those periods. To compute the n-grams, the tweet IDs of the Coronavirus (COVID-19) Tweets Dataset should be hydrated to form a tweets corpus.
Pseudo-code for generating similar trend dataset
current = int(time.time()*1000) #we receive the timestamp in ms from twitter
off = 600*1000 #we're looking for 10-minute (600 seconds) average data (offset)
past = current - off #getting timestamp of 10-minute past the current time
df = select recent most 60,000 #even if we receive 100 tweets per second, the no. of tweets do not cross this number in an interval of 10 minutes
new_df = df[df.unix > past] #here "unix" is the timestamp column name in the primary tweets dataset
avg_sentiment = new_df["sentiment"].mean() #calculate mean
store current, avg_sentiment into a database
Pseudo-code for extracting top 100 "unigrams" and "bigrams" from a tweets corpus
import nltk
from collections import Counter
#loading a tweet corpus
with open ("/path/to/the/tweets/corpus", "r", encoding="UTF-8") as myfile:
data=myfile.read().replace('\n', ' ')
data = preprocess your data (use regular expression-perform find and replace operations)
data = data.split(' ')
stopwords = nltk.corpus.stopwords.words('english')
clean_data=[]
#removing stopwords from each tweet
for w in data:
if w not in stopwords:
clean_data.append(w)
#extracting top 100 n-grams
unigram = Counter(clean_data)
unigram_top = unigram.most_common(100)
bigram = Counter(zip(clean_data, clean_data[1:]))
bigram_top = bigram.most_common(100)
Dataset Files
- trend_april24_april30.txt (32.81 kB)
- trend_may1_may7.txt (34.42 kB)
- trend_may8_may14.txt (34.54 kB)
- trend_may15_may21.txt (34.56 kB)
- trend_may22_may31.txt (49.36 kB)
- trend_june1_june7.txt (34.80 kB)
- trend_june8_june14.txt (34.69 kB)
- trend_june15_june21.txt (34.75 kB)
- trend_june22_june30.txt (44.45 kB)
- trend_july1_july7.txt (34.73 kB)
- trend_july8_july14.txt (34.64 kB)
- trend_july15_july21.txt (34.69 kB)
- trend_july22_july31.txt (49.50 kB)
- trend_august1_august7.txt (34.61 kB)
- trend_august8_august14.txt (34.61 kB)
- trend_august15_august21.txt (34.53 kB)
- trend_august22_august31.txt (49.23 kB)
- trend_september1_september7.txt (34.62 kB)
- trend_september8_september14.txt (34.70 kB)
- trend_september15_september21.txt (34.48 kB)
- trend_september22_september30.txt (44.46 kB)
- trend_october1_october7.txt (34.67 kB)
- trend_october8_october14.txt (34.58 kB)
- trend_october15_october21.txt (34.40 kB)
- trend_october22_october31.txt (46.17 kB)
- trend_november1_november7.txt (34.44 kB)
- trend_november8_november14.txt (34.61 kB)
- trend_november15_november21.txt (34.51 kB)
- trend_november22_november30.txt (44.52 kB)
- trend_december1_december7.txt (34.59 kB)
- trend_december8_december14.txt (33.53 kB)
- trend_december15_december21.txt (33.55 kB)
- trend_december22_december31.txt (48.05 kB)
- trend_2021_january1_january7.txt (33.66 kB)
- trend_2021_january8_january14.txt (33.67 kB)
- trend_2021_january15_january21.txt (33.42 kB)
- trend_2021_january22_january31.txt (47.09 kB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.
Comments
Hi,
How can I get tweet's ids from timestamps?
Thank you,
Carolina
Hello Carolina. The following steps let you convert timestamp to tweet id.
milisecond epoch = ... #convert the preferred date & time to ms epoch
epoch = milisecond epoch - 1288834974657
tweet_id = epoch << 22 #applying left shift operator
However, please note that the timestamps given in this dataset are not associated with tweet ids. You'll have to make use of the tweet ids given in the primary dataset for hydration. Timestamps here in this dataset give the cursory view of the tweets captured in the primary dataset.
I am unable to get tweets from tweet_id.
Whenever I do the search it shows "Sorry, that tweet has been deleted"
Hello Himanshu. This dataset does not contain tweet ids; only (i) date/time (in UTC) and (ii) average sentiment data is made available through this dataset.
If you are looking for COVID-19 specific tweet ids, you can get them from:
(i) (COVID-19 Tweets Dataset) https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset, and
(ii) (Geo-version) https://ieee-dataport.org/open-access/coronavirus-covid-19-geo-tagged-tw...
I hope this helps.