Variant Analysis of Human Genome Sequences for COVID-19 Research

Citation Author(s):
Praveen
Rao
University of Missouri-Columbia
Arun
Zachariah
University of Missouri-Columbia
Deepthi
Rao
University of Missouri-Columbia
Peter
Tonellato
University of Missouri-Columbia
Wesley
Warren
University of Missouri-Columbia
Eduardo
Simoes
University of Missouri-Columbia
Submitted by:
Praveen Rao
Last updated:
Sat, 12/04/2021 - 12:39
DOI:
10.21227/b0ph-s175
Data Format:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This data resource is an outcome of the NSF RAPID project titled "Democratizing Genome Sequence Analysis for COVID-19 Using CloudLab" awarded to University of Missouri-Columbia.

The resource contains the output of variant analysis (along with CADD scores) on human genome sequences obtained from the COVID-19 Data Portal. The variants include single nucleotide polymorphisms (SNPs) and short insert and deletes (indels).

For variant analysis, we used the GATK Best Practices workflow for RNA-seq data published by the Broad Institute. This workflow was executed on CloudLab, an NSF-funded experimental testbed.

We will be releasing the variant analysis output of human genome sequences periodically. Also, more sequences are being made available on the COVID-19 Data Portal. Please visit this page regularly for updates.

If you have comments or questions, please post them in the comments section below.

Acknowledgments

This work is supported by the National Science Foundation under Grant No. 2034247.

Instructions: 

1. Download a .zip file.

2. Unzip the file and extract it into a folder. 

3. There will be two folders, namely, VCF and CADD_Scores. These folders contain the compressed .vcf and .tsv files. The .vcf files are filtered VCF files produced by the GATK best practice workflow for RNA-seq data. The reference genome hg19 was used. There is also a .xlsx file containing the run accession IDs (e.g., SRR12095153) and URLs (e.g., https://www.ebi.ac.uk/ena/browser/view/SRR12095153) from where the paired end sequences were downloaded. Complete description of the sequences can be found via these URLs.

4. Check for new .zip files.

Comments

CADD score files have been updated.

Submitted by Praveen Rao on Mon, 04/12/2021 - 21:49

If you would like to perform variant annotation on the VCF files, please use SnpEFF (http://pcingola.github.io/SnpEff/). See https://pcingola.github.io/SnpEff/se_running/ for instructions to download and execute.

Here is an example on how to annotate variants using snpEff.jar:

$ java -Xmx8g -jar snpEff.jar hg19 SRR13113910.unmapped.variant_filtered.vcf > SRR13113910.unmapped.variant_filtered_ann_hg19.vcf

Submitted by Praveen Rao on Mon, 11/01/2021 - 14:56

Dataset Files

LOGIN TO ACCESS DATASET FILES