MinION Nanopore Sequencing Data and SNV and Indel Variant Calls Obtained Using BEI Resources' Metrology Standard RNA for Zaire Mayinga Ebola Virus

Primary tabs

Citation Author(s):
Robert
Boissy
SeqStream PBC
Marilynn
Larson
University of Nebraska Medical Center
Steve
Hinrichs
University of Nebraska Medical Center
Submitted by:
Robert Boissy
Last updated:
Tue, 01/28/2020 - 16:35
DOI:
10.21227/79ne-6693
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

In an infectious disease outbreak the identification of pathogen genome sequence variants provides epidemiologists with high-resolution transmission diagnostics that can help cluster patients; identify cohorts of individuals who need testing; and identify new variants that may compromise existing vaccines, therapeutics, and low-resolution detection diagnostics.  The Oxford Nanopore MinION™ is a uniquely portable nucleic acid sequencing device that has been used in limited-resource settings for this purpose, e.g., during the 2014-2016 outbreak of Ebolavirus (EBOV) disease in Africa.  We describe prototype software (metrovar) designed to support the reliable identification of RNA virus transmission diagnostic variants (TDV) in nanopore sequencing data. We obtained single-molecule nanopore sequencing reads spanning a 4-kbp RT-PCR amplicon from the EBOV RNA polymerase (L) gene using a clonal EBOV RNA metrology standard from BEI Resources that is traceable to a cognate RefSeq genome (Zaire Mayinga, NCBI accession NC_002549).  Metrovar features a protocol graph for compute task parallelism, and a periodic “sweep and tranche” strategy for concurrent real-time nucleotide sequence variant detection using uncorrected or corrected reads from multiple MinION devices, multiple samples, multiple amplicons, and up to 49 (7x7) different nucleotide sequence aligner and variant caller combinations.  Nanopore sequencing read base-call error correction (using the LoRDEC software from Prof. Eric Rivals’ research group) was found to reduce NC_002549 alignment mismatch and indel error rates by up to 40- and 9-fold, respectively.  We defined three Makona-strain-relevant “divergence sets” (1, 2, or 3% mismatch and 0.5% indel vis-à-vis NC_002549) during the creation of non-cognate (mock) reference sequences and TDV call truthsets. which enabled MPG to optimize TDV filtering criteria selection and threshold value determination (training), and independent confirmation (checking).  For these three divergence sets, respectively (@ n=60), we obtained apparent F-measure optima for single nucleotide and indel TDV calls, respectively, as documented in detail and in summary files found within this dataset.  This strategy enables investigators and public health responders to run Metrovar’s training and checking modes in a more well-equipped central laboratory to establish rigorous, metrology standard-based variant call filtering parameter values (and determine their preferred combination of aligner and variant caller software), and thereby be prepared in advance to deploy and use Metrovar’s scoring mode to call variants accurately and efficiently in limited-resource settings during an infectious disease outbreak.  Metrovar supports many important features such as: the use of SeqAn’s mason_variator software to automate the creation of mock reference sequences for an amplicon of interest; the logging of all input parameter values in machine-readable JSON files for reproducibility; the logging of intermediate output summary files for debugging; efficient, automatic sample and amplicon barcode trimming (using SeqAn's flexbar software); nanopore sequence read depth normalization; optional read error correction; auto-tuned sequence alignment parameters (using the last-train software from Prof. Martin Frith’s research group), and auto-tuned variant call criteria selection and variant call criteria filtering (using Real-Time Genomics’ vcfeval software).

Instructions: 

Multiple README files are found within the compressed archives in this dataset.  Most files are self-explanatory for biomedical research scientists who are familiar with the analysis of variants in nucleotide sequence data.

Dataset Files

ACCESS ON AWS