SynGen6: Synthetic Genomic Dataset with Diverse Ancestry

Citation Author(s):
Xinyue
Wang
Renmin University of China
Sitao
Min
Rutgers University
Jaideep
Vaidya
Rutgers University
Submitted by:
Jaideep Vaidya
Last updated:
Fri, 10/11/2024 - 17:52
DOI:
10.21227/j3s2-xe98
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

SynGen6 is a synthetic genomic dataset that encompasses six distinct populations.  We utilized Principal Component Analysis (PCA) and ϵ-local differential privacy (LDP) to generate synthetic samples. We then simulated phenotype vectors associated with significant SNPs, mirroring real-world gene-disease associations. We also generated synthetic SNPs to watermark the dataset enabling verification of outsourced computations. Lastly, synthetic relatives were created to support research on kinship inference and family-based genomic analyses. The actual SynGen6 data can be created by running
our scripts in the All of Us Research Hub WorkBench. Here, we provide a toy example based on the 1000 genomes public dataset.

Instructions: 

Sample SNP Data (CSV): This file contains the SNP data for all individuals in the dataset. Each row corresponds to a unique individual.
– Column 1: Sample ID. A unique identifier for each individual.
– Column 2: Ancestry. The ancestry group (e.g., African, European, etc.).
– Columns 3 onward: Each column represents a specific SNP, with values reflecting the genotype (e.g., 0, 1, or 2).

Phenotype Condition Data (CSV): This file contains phenotype information for each individual.
– Column 1: Sample ID. Unique identifier for each individual.
– Column 2: Phenotype Condition. A binary variable representing the presence (1) or absence (0) of the simulated condition.

Watermark SNP Data (CSV): This file includes the synthetic watermark SNPs designed to ensure data integrity.
– Columns 1-20: Watermark SNPs ID. Synthetic SNPs used for validation purposes.
– Row 1-30000: Sample ID. Each row represents the SNPs values for each sample.
– Row 30001: p-values. The p-values indicates the statistical association between each watermark SNP and the phenotype condition.

Kinship-Relatedness Data (CSV): This file provides information on the synthetic relatives in the dataset.
– Column 1: Sample ID - The ID of the synthetic individual related to a sample in the Sample SNP Data file.
– Column 2: Ancestor ID - The Sample ID from the Sample SNP Data to which the synthetic individual is related.
– Column 3: Relatedness - Presents the kinship relationship.
– Column 4: Kinship Coefficient - Provides the calculated kinship coefficients between the SNP data of the synthetic individual and its ancestor.

Synthetic Relatives SNP data: This file provides SNP information on the synthetic relatives in the dataset.
– Column 1: Sample ID. A unique identifier for each individual.
– Columns 2 onward: Each column represents a specific SNP, with values reflecting the genotype (e.g., 0, 1, or 2).

Funding Agency: 
National Institutes of Health; National Science Foundation
Grant Number: 
R35GM134927; R01LM014520; CNS-2333225