Datasets
Standard Dataset
Proteins that can be secreted into bronchoalveolar lavage fluid
- Citation Author(s):
- Submitted by:
- Guangzhao Zhang
- Last updated:
- Thu, 10/10/2024 - 21:01
- DOI:
- 10.21227/pypf-en48
- License:
- Categories:
- Keywords:
Abstract
The positive dataset, derived from the HBFP database, comprised 3,434 proteins. The initial negative dataset was constructed by selecting proteins from Pfam families with no overlap with the positive dataset, totaling 8,029 proteins. This set was further refined using protein-protein interaction (PPI) networks across various databases, leading to an expanded collection of 13,912 proteins, which was later narrowed down to 6,740 after exclusions. Following a curation process to remove sequence redundancy, the datasets were finalized with 3,319 positive and 6,599 negative proteins. Given the scarcity of available structural data from the Protein Data Bank (PDB), AlphaFold v2.0 was utilized to predict high-quality 3D structures, thereby enriching the dataset with structural details for 9,702 proteins.
This document describes the data used in SecProGNN.
Dataset Files
- proteins samples.csv (5.69 MB)
- proteins data.tar.gz (695.23 MB)
Documentation
Attachment | Size |
---|---|
File Format.txt | 735 bytes |
Comments
for test purpsoe