Congratulations!  You have been automatically subscribed to IEEE DataPort and can access all datasets on IEEE DataPort!
First Name: 
Ahmed Bin
Last Name: 
Zaman

Datasets & Competitions

We have long known that the characterization of protein three-dimensional structure is key to obtaining a detailed understanding of protein function. Computational approaches to protein structure characterization have largely addressed a narrow formulation of the problem, where the goal is the determination of one structure, also known as the native structure, from a given protein amino-acid sequence. However, many researchers over the years have argued for broadening our view of proteins to account for the multiplicity of native structures.

Instructions: 

The .zip file contains 3 folders when unzipped. We provide the details of each folder below.

 

“monomorphic_benchmark_targets” folder: Contains 20 protein targets organized into 20 subfolders. Data for each protein is provided in a subfolder named with its pdb id. Each such subfolder contains the following 4 files.

  1. A .fasta file containing the amino-acid sequence of the protein.

  2. A .pdb file containing the native tertiary conformation coordinates. Detailed format for a .pdb file can be found in http://www.wwpdb.org/documentation/file-format

  3. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  4. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

 

“monomorphic_casp_targets” folder: Contains 10 protein targets organized into 10 subfolders. Data for each protein is provided in a subfolder named with its casp id. Each such subfolder contains the following 4 files.

  1. A .fasta file containing the amino-acid sequence of the protein.

  2. A .pdb file containing the native tertiary conformation coordinates.

  3. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  4. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

 

“metamorphic_benchmark_targets” folder: Contains 18 pairs of protein targets organized into 18 subfolders. Data for each target pair is provided in a subfolder named with its pair id (as indicated in the paper). Each such subfolder contains the following 5 files.

  1. A .fasta file containing the amino-acid sequence common to the pair of target proteins.

  2. A .pdb file containing the native tertiary conformation coordinates for the first target in the target pair.

  3. A .pdb file containing the native tertiary conformation coordinates for the second target in the target pair.

  4. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  5. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

Categories:
70 Views

 

 

Instructions: 

The .zip file contains 6 folders when unzipped. We provide the details of each folder below.

 

“Proteins” folder: Contains 20 protein targets organized into two folders (Benchmark and CASP) depending on the family each target belongs to. Data for each protein is provided in a subfolder named with its id. Each such subfolder contains the following 4 files.

  1. A .fasta file containing the amino-acid sequence of the protein.

  2. A .pdb file containing the native tertiary structure coordinates. Detailed format for a .pdb file can be found in http://www.wwpdb.org/documentation/file-format

  3. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  4. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

 

“Generation” folder: Contains the generated ensembles for the protein targets in 20 subfolders, one for each target, named with their ids. Each subfolder contains 5 files, each containing the generated ensemble for one run. Each such file contains 14 columns and each row represents one generated structure. The first column provides the Rosetta score4 energy, the second column provides the lRMSD to the native structure, and each of the rest of the 12 columns provides one USR feature for the structure.

 

“Reduced” folder: Contains the reduced ensembles for each clustering technique in separate folders. Each such folder contains 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Truncation” folder: Contains the reduced ensembles via truncation for the protein targets in 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Ks” folder: Contains 4 separate files, one for each clustering technique, containing the number of clusters for each run of each protein target. These files can be used to plot the distributions for the number of clusters.

 

“Bars” folder: Contains 3 separate subfolders containing the information needed to plot the bar charts for the minimum, average, and standard deviation of lRMSDs to the native structure for the CASP targets. Each subfolder contains 10 files, one for each target. Each file contains 6 rows that provide the lRMSD value for original ensemble, reduced ensemble for hierarchical clustering, reduced ensemble for k-means clustering, reduced ensemble for GMM clustering, reduced ensemble for gmx-cluster clustering, and reduced ensemble for truncation, respectively.

Categories:
106 Views