The MCData was designed and produced for mouth cavity detection and segmentation. This dataset can be utilized for training and testing of mouth cavity instance segmentation networks. This dataset is the first available dataset for detecting and segmentation of mouth cavity main components to the best of the authors’ knowledge.


BIMCV-COVID19+ dataset is a large dataset with chest X-ray images CXR (CR, DX) and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin G (IgG) and immunoglobulin M (IgM) diagnostic antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).


Once all the compressed files have been downloaded, use for their correct decompression. For more information, you could see the links on this page


Automatic detection of COVID-19 and community-acquired pneumonia on CT images with artificial intelligence


This data resource is an outcome of the NSF RAPID project titled "Democratizing Genome Sequence Analysis for COVID-19 Using CloudLab" awarded to University of Missouri-Columbia.

The resource contains the output of variant analysis (along with CADD scores) on human genome sequences obtained from the COVID-19 Data Portal. The variants include single nucleotide polymorphisms (SNPs) and short insert and deletes (indels).


1. Download a .zip file.

2. Unzip the file and extract it into a folder. 

3. There will be two folders, namely, VCF and CADD_Scores. These folders contain the compressed .vcf and .tsv files. The .vcf files are filtered VCF files produced by the GATK best practice workflow for RNA-seq data. The reference genome hg19 was used. There is also a .xlsx file containing the run accession IDs (e.g., SRR12095153) and URLs (e.g., from where the paired end sequences were downloaded. Complete description of the sequences can be found via these URLs.

4. Check for new .zip files.


The 3DLSC-COVID datset  includes a total of  1,805 3D chest CT scans with more than 570,000 CT slices were collected from 2 standard CT scanners of Liyuan Hospital, i.e.,  UIH uCT 510 and GE Optima CT600.  Among all CT scans, there were 794 positive cases of COVID-19, which were further confirmed by clinical symptoms and RT-PCR from January 16 to April 16, 2020.


This dataset is consist news articles related to COVID-19 from UK, India, Japan and South Korea newspapers. 


The dataset contains the data on ICU-transferred (N=100) and Stable (N=131) patients with COVID-19 (N=156) and Non-COVID-19 viral pneumonia (N=75). Among COVID-19 patients of this study, 82 patients developed Refractory Respiratory Failure (RRF) or Severe Acute Respiratory Distress Syndrome (SARDS) and were transferred to Intensive Care Unit (ICU), 74 patients had a Stable course of disease and were not transferred to ICU.




This repository contains:

  • age-stratified Covid-19 case and fatality data for different countries and at different points in time, and
  • an interactive Jupyter notebook for mediation analysis of age-related causal effects on case fatality rates,

published as part of the following paper:

"Simpson's paradox in Covid-19 case fatality rates: a mediation analysis of age-related causal effects". J von Kügelgen*, L Gresele*, B Schölkopf. (*equal contribution).

We provide the following three separate datasets:

  • a dataset containing only the most recent numbers from: Argentina, China, Colombia, Italy, Netherlands, Portugal, South Africa, Spain, Sweden, Switzerland, South Korea and the Diamond Princess cruise ship (last checked: end of May 2020)
  • a longitudinal dataset containing several reports from Italy (9 March - 26 May 2020)
  • a longitudinal dataset containing several reports from Spain (22 March - 29 May 2020)

All numbers of confirmed cases and fatalities are stratified by age into groups of 10 years (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80+), and contain the date and country of reporting, as well as links to the corresponding sources (generally health agenices/ministries, or scientific publications).

Please consult the paper and notebook for further details.


Leaderboard (numbers are kW MAE):

Teams with more than 5 missing submissions are eliminated from the leaderboard.


Last Updated On: 
Mon, 05/10/2021 - 20:31

This India-specific COVID-19 tweets dataset has been developed using the large-scale Coronavirus (COVID-19) Tweets Dataset, which currently contains more than 700 million COVID-19 specific English language tweets. This dataset contains tweets originating from India during the first week of each four phases of nationwide lockdowns initiated by the Government of India.


The zipped files contain .db (SQLite database) files. Each .db file has a table 'geo'. To hydrate the IDs you can import the .db file as a pandas dataframe and then export it to .CSV or .TXT for hydration. For more details on hydrating the IDs, please visit the primary dataset page.

conn = sqlite3.connect('/path/to/the/db/file')

c = conn.cursor()

data = pd.read_sql("SELECT tweet_id FROM geo", conn)