This India-specific COVID-19 tweets dataset has been developed using the large-scale Coronavirus (COVID-19) Tweets Dataset, which currently contains more than 700 million COVID-19 specific English language tweets. This dataset contains tweets originating from India during the first week of each four phases of nationwide lockdowns initiated by the Government of India.

Instructions: 

The zipped files contain .db (SQLite database) files. Each .db file has a table 'geo'. To hydrate the IDs you can import the .db file as a pandas dataframe and then export it to .CSV or .TXT for hydration. For more details on hydrating the IDs, please visit the primary dataset page.

conn = sqlite3.connect('/path/to/the/db/file')

c = conn.cursor()

data = pd.read_sql("SELECT tweet_id FROM geo", conn)

Categories:
848 Views

This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.

Instructions: 

The TXT files in this dataset can be used in generating the trend graph. The peaks and drops in the trend graph can be made more meaningful by computing n-grams for those periods. To compute the n-grams, the tweet IDs of the Coronavirus (COVID-19) Tweets Dataset should be hydrated to form a tweets corpus.

Pseudo-code for generating similar trend dataset

current = int(time.time()*1000)     #we receive the timestamp in ms from twitter

off = 600*1000    #we're looking for 10-minute (600 seconds) average data (offset)

past = current - off     #getting timestamp of 10-minute past the current time

df = select recent most 60,000    #even if we receive 100 tweets per second, the no. of tweets do not cross this number in an interval of 10 minutes

new_df = df[df.unix > past]     #here "unix" is the timestamp column name in the primary tweets dataset

avg_sentiment = new_df["sentiment"].mean()    #calculate mean

store current, avg_sentiment into a database

Pseudo-code for extracting top 100 "unigrams" and "bigrams" from a tweets corpus

import nltk

from collections import Counter

#loading a tweet corpus

with open ("/path/to/the/tweets/corpus", "r", encoding="UTF-8") as myfile:

     data=myfile.read().replace('\n', ' ')

data = preprocess your data (use regular expression-perform find and replace operations)

data = data.split(' ')

stopwords = nltk.corpus.stopwords.words('english')

clean_data=[]

#removing stopwords from each tweet

for w in data:

     if w not in stopwords:

          clean_data.append(w)

#extracting top 100 n-grams

unigram = Counter(clean_data)

unigram_top = unigram.most_common(100)

bigram = Counter(zip(clean_data, clean_data[1:]))

bigram_top = bigram.most_common(100)

Categories:
1885 Views

This dataset contains IDs and sentiment scores of the geo-tagged tweets related to the COVID-19 pandemic. The tweets are captured by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. Complying with Twitter's content redistribution policy, only the tweet IDs are shared. You can re-construct the dataset by hydrating these IDs.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("april28_april29.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_april28_april29.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., april28_april29.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_april28_april29.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Categories:
21457 Views

This dataset includes CSV files that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The tweets have been collected by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. This dataset has been wholly re-designed on March 20, 2020, to comply with the content redistribution policy set by Twitter.

Instructions: 

Each CSV file contains a list of tweet IDs. You can use these tweet IDs to download fresh data from Twitter (hydrating the tweet IDs). To make it easy for the NLP researchers to get access to the sentiment analysis of each collected tweet, the sentiment score computed by TextBlob has been appended as the second column. To hydrate the tweet IDs, you can use applications such as Hydrator (available for OS X, Windows and Linux) or twarc (python library).

Getting the CSV files of this dataset ready for hydrating the tweet IDs:

import pandas as pd

dataframe=pd.read_csv("corona_tweets_10.csv", header=None)

dataframe=dataframe[0]

dataframe.to_csv("ready_corona_tweets_10.csv", index=False, header=None)

The above example code takes in the original CSV file (i.e., corona_tweets_10.csv) from this dataset and exports just the tweet ID column to a new CSV file (i.e., ready_corona_tweets_10.csv). The newly created CSV file can now be consumed by the Hydrator application for hydrating the tweet IDs. To export the tweet ID column into a TXT file, just replace ".csv" with ".txt" in the to_csv function (last line) of the above example code.

If you are not comfortable with Python and pandas, you can upload these CSV files to your Google Drive and use Google Sheets to delete the second column. Once finished with the deletion, download the edited CSV files: File > Download > Comma-separated values (.csv, current sheet). These downloaded CSV files are now ready to be used with the Hydrator app for hydrating the tweets IDs.

Categories:
88257 Views