Dataset Information

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data.

ABSTRACT: BACKGROUND:Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS:To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS:Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.

SUBMITTER: Cole C

PROVIDER: S-EPMC3916084 | biostudies-literature | 2014 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data.

Cole Charles C Krampis Konstantinos K Karagiannis Konstantinos K Almeida Jonas S JS Faison William J WJ Motwani Mona M Wan Quan Q Golikov Anton A Pan Yang Y Simonyan Vahan V Mazumder Raja R

BMC bioinformatics 20140127

<h4>Background</h4>Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.<h4>Results</h4>To address the above challenge, we have implemented a NGS biocuration workflow and are analyzin ...[more]

PMID: 24467687

Similar Datasets

Project description:The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations.

Dataset Information

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data.

Publications

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets