Leaderboard (numbers are kW MAE):

Teams with more than 5 missing submissions are eliminated from the leaderboard.


Last Updated On: 
Wed, 11/17/2021 - 21:02

This India-specific COVID-19 tweets dataset has been developed using the large-scale Coronavirus (COVID-19) Tweets Dataset, which currently contains more than 700 million COVID-19 specific English language tweets. This dataset contains tweets originating from India during the first week of each four phases of nationwide lockdowns initiated by the Government of India.


The zipped files contain .db (SQLite database) files. Each .db file has a table 'geo'. To hydrate the IDs you can import the .db file as a pandas dataframe and then export it to .CSV or .TXT for hydration. For more details on hydrating the IDs, please visit the primary dataset page.

conn = sqlite3.connect('/path/to/the/db/file')

c = conn.cursor()

data = pd.read_sql("SELECT tweet_id FROM geo", conn)


This dataset gives a cursory glimpse at the overall sentiment trend of the public discourse regarding the COVID-19 pandemic on Twitter. The live scatter plot of this dataset is available as The Overall Trend block at https://live.rlamsal.com.np. The trend graph reveals multiple peaks and drops that need further analysis. The n-grams during those peaks and drops can prove beneficial for better understanding the discourse.


The TXT files in this dataset can be used in generating the trend graph. The peaks and drops in the trend graph can be made more meaningful by computing n-grams for those periods. To compute the n-grams, the tweet IDs of the Coronavirus (COVID-19) Tweets Dataset should be hydrated to form a tweets corpus.

Pseudo-code for generating similar trend dataset

current = int(time.time()*1000)     #we receive the timestamp in ms from twitter

off = 600*1000    #we're looking for 10-minute (600 seconds) average data (offset)

past = current - off     #getting timestamp of 10-minute past the current time

df = select recent most 60,000    #even if we receive 100 tweets per second, the no. of tweets do not cross this number in an interval of 10 minutes

new_df = df[df.unix > past]     #here "unix" is the timestamp column name in the primary tweets dataset

avg_sentiment = new_df["sentiment"].mean()    #calculate mean

store current, avg_sentiment into a database

Pseudo-code for extracting top 100 "unigrams" and "bigrams" from a tweets corpus

import nltk

from collections import Counter

#loading a tweet corpus

with open ("/path/to/the/tweets/corpus", "r", encoding="UTF-8") as myfile:

     data=myfile.read().replace('\n', ' ')

data = preprocess your data (use regular expression-perform find and replace operations)

data = data.split(' ')

stopwords = nltk.corpus.stopwords.words('english')


#removing stopwords from each tweet

for w in data:

     if w not in stopwords:


#extracting top 100 n-grams

unigram = Counter(clean_data)

unigram_top = unigram.most_common(100)

bigram = Counter(zip(clean_data, clean_data[1:]))

bigram_top = bigram.most_common(100)


Recently, the coronavirus pandemic has made the use of facial masks and respirators common, the former to reduce the likelihood of spreading saliva droplets and the latter as Personal Protective Equipment (PPE). As a result, this caused problems for the existing face detection algorithms. For this reason, and for the implementation of other more sophisticated systems, able to recognize the type of facial mask or respirator and to react given this information, we created the Facial Masks and Respirators Database (FMR-DB).


For reasons related to the copyright of the images, we cannot publish the entire database here. If you are a student, a professor, or a researcher and you want to use it for research purposes, send an email to antonio.marceddu@polito.it attaching the license, duly completed, which you can find here on IEEE DataPort.



The dataset links to the survey performed on students and professors of Biological Engineering introductory course, as the Department of Biological Engineering, University of the Republic, Uruguay.


The dataset is meant for pure academic and non-commerical use.

For queries, please consult the corresponding author (Parag Chatterjee, paragc@ieee.org).


Urban informatics and social geographic computing, spatial and temporal big data processing and spatial measurement, map service and natural language processing.


Urban informatics and social geographic computing, spatial and temporal big data processing and spatial measurement, map service and natural language processing.


This dataset has the following data about the COVID-19 pandemic in the State of Maranhão, Brazil:

  • Number of daily cases
  • Number of daily deaths

In addition, this dataset also contains data from Google Trends on some subjects related to the pandemic, related to searches carried out in the State of Maranhão.

The data follows a timeline that begins on March 20, 2020, the date of the first case of COVID-19 in the State of Maranhão, until July 9, 2020.


The last decade faced a number of pandemics [1]. The current outbreak of COVID is creating havoc globally. The daily incidences of COVID-2019 from 11th January 2020 to 9th May 2020 were collected from the official COVID dashboard of world health organization (WHO) [2] , i.e. https://covid19.who.int/explorer. The data is updated with the population of the countries and further Case fatality rate, Basic Attack Rate (BAR) and Household Secondary Attack Rate (HSAR) are computed for all the countries.


The data will be used by epidemiologist, statisticians, data scientists for assessing the risk of the Covid 2019 globally and would be used as a model to predict the case fatality rate along with the possible spread of the disease along with its attack rate.Data was in raw format. A detailed analysis is carried out from Epidemiology point of view and a datasheet is prepared through the identification of the Risk Factor in a Defined Population.The daily incidences of COVID-2019 from 11th January 2020 to 9th May 2020 were collected form the official covid dashboard of world health organization (WHO), i.e. https://covid19.who.int/explorer. The data is compiled in Excel 2016 and a database is created. The database is updated with the population of the countries and Case fatality rate, Basic Attack Rate (BAR) and Household Secondary Attack Rate (HSAR) is computed for all the countries.  



A set of chest CT data sets from multi-centre hospitals included five categories


We present GeoCoV19, a large-scale Twitter dataset related to the ongoing COVID-19 pandemic. The dataset has been collected over a period of 90 days from February 1 to May 1, 2020 and consists of more than 524 million multilingual tweets. As the geolocation information is essential for many tasks such as disease tracking and surveillance, we employed a gazetteer-based approach to extract toponyms from user location and tweet content to derive their geolocation information using the Nominatim (Open Street Maps) data at different geolocation granularity levels. In terms of geographical coverage, the dataset spans over 218 countries and 47K cities in the world. The tweets in the dataset are from more than 43 million Twitter users, including around 209K verified accounts. These users posted tweets in 62 different languages.


GeoCoV19 Dataset Description 

The GeoCoV19 Dataset comprises several TAR files, which contain zip files representing daily data. Each zip file contains a JSON with the following format:

{ "tweet_id": "122365517305623353", "created_at": "Sat Feb 01 17:11:42 +0000 2020", "user_id": "335247240", "geo_source": "user_location", "user_location": { "country_code": "br" }, "geo": {}, "place": { }, "tweet_locations": [ { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "us" }, { "country_code": "ru", "state": "Voronezh Oblast", "county": "Petropavlovsky District" }, { "country_code": "at", "state": "Upper Austria", "county": "Braunau am Inn" }, { "country_code": "it", "state": "Trentino-Alto", "county": "Pustertal - Val Pusteria" }, { "country_code": "cn" }, { "country_code": "in", "state": "Himachal Pradesh", "county": "Jubbal" } ] }

Description of all the fields in the above JSON 

Each JSON in the Geo file has the following eight keys:

1. Tweet_id: it represents the Twitter provided id of a tweet

2. Created_at: it represents the Twitter provided "created_at" date and time in UTC

3. User_id: it represents the Twitter provided user id

4. Geo_source: this field shows one of the four values: (i) coordinates, (ii) place, (iii) user_location, or (iv) tweet_text. The value depends on the availability of these fields. However, priority is given to the most accurate fields if available. The priority order is coordinates, places, user_location, and tweet_text. For instance, when a tweet has GPS coordinates, the value will be "coordinates" even though all other location fields are present. If a tweet does not have GPS, place, and user_location information, then the value of this field will be "tweet_text" if there is any location mention in the tweet text.

The remaining keys can have the following location_json inside them. Sample location_json: {"country_code":"us","state":"California","county":"San Francisco","city":"San Francisco"}. Depending on the available granularity, country_code, state, county or city keys can be missing in the location_json.

5. user_location: It can have a "location_json" as described above or an empty JSON {}. This field uses the "location" profile meta-data of a Twitter user and represents the user declared location in the text format. We resolve the text to a location.

6. geo: represents the "geo" field provided by Twitter. We resolve the provided latitude and longitude values to locations. It can have a "location_json" as described above or an empty JSON {}.

7. tweet_locations: This field can have an array of "location_json" as described above [location_json1, location_json2] or an empty array []. This field uses the tweet content (i.e., actual tweet message) to find toponyms. A tweet message can have several mentions of different locations (i.e., toponyms). That is why we have an array of locations representing all those toponyms in a tweet. For instance, in a tweet like "The UK has over 65,000 #COVID19 deaths. More than Qatar, Pakistan, and Norway.", there are four location mentions. Our tweet_locations array should represent these four separately.

8. place: It can have a "location_json" described above or an empty JSON {}. It represents the Twitter-provided "place" field.


Tweets hydrators:

CrisisNLP (Java): https://crisisnlp.qcri.org/#resource8

Twarc (Python): https://github.com/DocNow/twarc#dehydrate

Docnow (Desktop application): https://github.com/docnow/hydrator

If you have doubts or questions, feel free to contact us at: uqazi@hbku.edu.qa and mimran@hbku.edu.qa