This data resource is an outcome of the NSF RAPID project titled "Democratizing Genome Sequence Analysis for COVID-19 Using CloudLab" awarded to University of Missouri-Columbia.
The resource contains the output of variant analysis (along with CADD scores) on human genome sequences obtained from the COVID-19 Data Portal. The variants include single nucleotide polymorphisms (SNPs) and short insert and deletes (indels).
1. Download a .zip file.
2. Unzip the file and extract it into a folder.
3. There will be two folders, namely, VCF and CADD_Scores. These folders contain the compressed .vcf and .tsv files. The .vcf files are filtered VCF files produced by the GATK best practice workflow for RNA-seq data. The reference genome GRCh38 was used. There is also a .xlsx file containing the run accession IDs (e.g., SRR12095153) and URLs (e.g., https://www.ebi.ac.uk/ena/browser/view/SRR12095153) from where the paired end sequences were downloaded. Complete description of the sequences can be found via these URLs.
4. Check for new .zip files.
In February 2016, LIGO announced the first observation of gravitational waves from a binary black hole merger, known as GW150914. To establish the confidence of this detection, large-scale scientific workflows were used to measure the event's statistical significance. These workflows used code written by the LIGO Scientific Collaboration and were executed on the LIGO Data Grid.
This data set contains a copy of the PyCBC v1.3.2 PyInstaller bundled executables used by the analysis in "Observation of Gravitational Waves from a Binary Black Hole Merger" B. P. Abbott et al. (LIGO Scientific Collaboration and Virgo Collaboration) Phys. Rev. Lett. 116, 061102 (2016). The executables are used by the generate_workflow.sh script to create and execute the PyCBC search workflow.
The data set also contains the HDF5 file that was produced by running the workflow on a mixure of Open Science Grid and USC/ISI compute resources to reproduce the LIGO result. This HDF5 file is used as input to the make_pycbc_hist.sh script to create the right-hand histogram shown in Figure 4 of Abbott et al.
For more information on the files, see the file README.md
The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A.
The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A, the IMSI consists of MCC|MNC: '274012'.
In the 5G system, the globally unique 5G subscription permanent identifier is called SUPI as defined in 3GPP TS 23.501. For privacy reasons, the SUPI from the 5G devices should not be transferred in clear text, and is instead concealed inside the privacy preserving SUCI. Consequently, the SUPI is privacy protected over-the-air of the 5G radio network by using the SUCI. For SUCIs containing IMSI based SUPI, the UE in essence conceals the MSIN (Mobile Subscriber Identification Number) part of the IMSI. On the 5G operator-side, the SIDF (Subscription Identifier De-concealing Function) of the UDM (Unified Data Management) is responsible for de-concealment of the SUCI and resolves the SUPI from the SUCI based on the protection scheme used to generate the SUCI.
The SUCI protection scheme used in this data set is ECIES Profile A. The size of the scheme-output is a total of 256-bit public key, 64-bit MAC & 40-bit encrypted MSIN. The SUCI scheme-input MSIN is coded as hexadecimal digits using packed BCD coding where the order of digits within an octet is same as the order of MSIN. As the MSINs are odd number of digits, bits 5 to 8 of final octet is coded as ‘1111’.
# Example Python code to load data into Spark DataFrame
df = spark.read.format("csv").option("inferSchema","true").option("header","true").option("sep",",").load(“5g_suci_using_ecies_profile_a_100k.gz”)