Adaptive Conformation Sampling Dataset Zaman_TCBB21

Citation Author(s):: Ahmed Bin Zaman (George Mason University)

Toki Tahmid Inan (George Mason University)

Kenneth De Jong (George Mason University)

Amarda Shehu (George Mason University)
Submitted by:: Ahmed Bin Zaman
Last updated:: Fri, 11/19/2021 - 22:05
DOI:: 10.21227/jff2-x257
Data Format:: .zip

211 views

Categories:

Keywords:

protein structure

conformation sampling

stochastic optimization

CITE

Abstract

We have long known that the characterization of protein three-dimensional structure is key to obtaining a detailed understanding of protein function. Computational approaches to protein structure characterization have largely addressed a narrow formulation of the problem, where the goal is the determination of one structure, also known as the native structure, from a given protein amino-acid sequence. However, many researchers over the years have argued for broadening our view of proteins to account for the multiplicity of native structures. Our understanding of proteins has become more nuanced, and we now know of many protein molecules that make use of large motions, often of several angstroms, to switch between different structures that allow them to tune/regulate interactions with diverse molecular partners (and so engage in complex cellular reactions). Elucidating such structures de novo is considered to be an exceptionally difficulty problem, as it requires exploration of possibly a very large structure space in search of competing, near-optimal energy minima. This dataset is associated with our paper titled, "Adaptive Stochastic Optimization to Improve Protein Conformation Sampling", where we report on a novel stochastic optimization method capable of revealing very distinct structures for a given protein from knowledge of its amino-acid sequence. The method leverages evolutionary search techniques and adapts its exploration of the vast structure space to balance between exploration and exploitation in the presence of a computational budget. This dataset provides the biologically-active conformations of the protein targets used for evaluation and necessary data (sequence, fragment files) for conformation sampling. The dataset includes a benchmark metamorphic test dataset for researchers to continue advancing work on this problem. The paper is under review and we will update the link to the paper once it is published. The codes associated with this dataset can be found in, https://github.com/psp-codes/adaptive-conformation-sampling

Instructions:

The .zip file contains 3 folders when unzipped. We provide the details of each folder below.

“monomorphic_benchmark_targets” folder: Contains 20 protein targets organized into 20 subfolders. Data for each protein is provided in a subfolder named with its pdb id. Each such subfolder contains the following 4 files.

A .fasta file containing the amino-acid sequence of the protein.
A .pdb file containing the native tertiary conformation coordinates. Detailed format for a .pdb file can be found in http://www.wwpdb.org/documentation/file-format
A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/
A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

“monomorphic_casp_targets” folder: Contains 10 protein targets organized into 10 subfolders. Data for each protein is provided in a subfolder named with its casp id. Each such subfolder contains the following 4 files.

A .fasta file containing the amino-acid sequence of the protein.
A .pdb file containing the native tertiary conformation coordinates.
A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/
A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

“metamorphic_benchmark_targets” folder: Contains 18 pairs of protein targets organized into 18 subfolders. Data for each target pair is provided in a subfolder named with its pair id (as indicated in the paper). Each such subfolder contains the following 5 files.

A .fasta file containing the amino-acid sequence common to the pair of target proteins.
A .pdb file containing the native tertiary conformation coordinates for the first target in the target pair.
A .pdb file containing the native tertiary conformation coordinates for the second target in the target pair.
A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/
A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/