Data for Error-correcting output codes and multi-view learning in the tissue of origin classification

Citation Author(s):
Mira
Han
Submitted by:
Mira Han
Last updated:
Mon, 11/11/2024 - 17:22
DOI:
10.21227/d3qw-v541
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

As various modalities of genomic data are accumulating, methods to integrate across multi-omics datasets are becoming important. Error-correcting output codes (ECOC) is an ensemble learning strategy for solving a multiclass problem thru a decoding process that aggregates the predictions of multiple classifiers. Thus, it lends itself naturally to aggregating predictions across multiple views as well. We applied the ECOC to multi-view learning to see if this strategy can enhance classifier performance as compared to traditional techniques. We designed experiments to predict tissue types for hundreds of samples using measures of the transcriptome and methylome. We tested our ECOC design for multi-view learning, where the feature sets for RNA-Seq and DNA methylation were encoded separately and decoded together, to see if we could achieve better performance as compared to the traditional uses of feature sets. Our analyses revealed that multi-view ensemble ECOC achieved higher classifier performance in certain experimental designs. The novel multi-view ensemble ECOC method merits consideration by other researchers to potentially attain superior classification results.

Instructions: 

- rows are sample ids
- columns are features, except for the first column of ground truth for tissue type
- there are six total files, three of them for TCGA data and three of them for GEO data
- for each of the three files for either TCGA or GEO, the files are RNA-Seq features, DNA methylation features, or a concatentation of both