Datasets
Open Access
Corona Virus (COVID-19) Turkish Tweets Dataset
- Citation Author(s):
- Submitted by:
- Ibrahim Sabuncu
- Last updated:
- Tue, 05/19/2020 - 07:48
- DOI:
- 10.21227/0wf0-0792
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This data set includes Covid-19 related Tweet messages written in Turkish that contain at least one of four keywords (Covid, Kovid, Corona, Korona). These keywords are used to express Covid-19 virus in Turkey. Tweets collection was started from 11th March 2020, the first Covid-19 case seen in Turkey.
Currently dataset contain 4,8 million tweets with 6 different attribute of each tweets that were sent from 9 March 2020 until 6 May 2020.
The data file contains comma separated values (CSV). It contains the following information (6 Column) for each tweet in the data file:
Created-At: Exact creation time of the tweet
From-User-Id: Sender User Id
To-User-Id: if it is sent to a user, its user ID
Language: All Turkish
Retweet-Count: number of retweets
Id: ID of tweet that is unique for all tweets
Search Twitter Operator of RapidMiner Software was used to collect tweets via the Twitter API. Due to the differentiation of keywords used in the time period and because of some technical constraints, the number of tweets collected daily for some days was less than normal until 30 March. After March 30, all Turkish tweets about covid-19 were collected continuously. The details of this subject are explained at the below.
The data collection study started on March 17. In the Twitter API used, there is a 10,000-tweet upper limit and a last week time limit in each search to collect past tweets. Therefore, the oldest tweets that can be collected belong to March 11. Detecting the first cases in Turkey were also held on 11 March. So, tweets been collected since the first cases detected in Turkey.
In order to collect data, RapidMiner Data mining software was used, and a maximum of 10,000 tweets were collected for each day, from 11 March until 17 March. In this way, after the past data of the last week were collected, the last sent 10,000 tweets were taken at intervals of twenty minutes (Twitter API can be used with 15 minutes interval, added 5 minutes more for precautionary). Thus, if more than 10,000 tweets were not posted within 20 minutes, it was possible to gather all the tweets. Of course, in less than 10,000 tweets were sent within 20 minutes, the same tweets were repeatedly drawn in different iterations. For this reason, duplicate records were deleted using the Tweet ID number. RapidMiner Turbo Prep application was used for this process.
While Turkish tweets containing the word "Corona" were collected as of March 11, the ones containing the word "Covid" started to be collected after March 16. With the widespread use of the words "Kovid" and "Korona", since March 30, all Turkish Tweets containing at least one of 4 keywords were collected using the search phrase "Covid OR Kovid OR Korona OR Corona".
Currently dataset contain 4,8 million tweets with 6 different attribute of each tweets that were sent from 9 March 2020 until 6 May 2020.
Original CSV data file is zipped by WinRAR to upload and download easily. The zipped file size is 76 MB.
This data can be used for text mining such as topic modelling, sentiment analysis etc.
The data file contains comma separated values (CSV). It contains the following information (6 Column) for each tweet in the data file:
Created-At: Exact creation time of the tweet
From-User-Id: Sender User Id
To-User-Id: if it is sent to a user, its user ID
Language: All Turkish
Retweet-Count: number of retweets
Id: ID of tweet that is unique for all tweets
Dataset Files
- Covid19_TR_Tweets_6May_ID_Only.zip (81.01 MB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.
Comments
Thanks your effort
Thanks a lot dear İbrahim Sabuncu and Zeynep Yurek