Dataset Information

Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery.

ABSTRACT: Artificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We found that embedding can extract biologically relevant information from The Cancer Genome Atlas (TCGA) gene expression dataset by learning a vector representation through gene co-occurrence. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list as in the case of immunotherapy response. 73 related genes are singled out while the relatedness of 55 genes with immune checkpoint proteins (PD-1, PD-L1, and CTLA-4) are supported by literature. 16 novel genes (ACAP1, C11orf45, CD79B, CFP, CLIC2, CMPK2, CXCR2P1, CYTIP, FER, MCTO1, MMP25, RASGEF1B, SLFN12, TBC1D10C, TRAF3IP3, TTC39B) related to immune checkpoint proteins were identified. Thus, this method is feasible to mine big volume of biological data, and embedding would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference for gene relatedness in the context of cancer and is readily applicable to biomarker discovery of any molecular targeted therapy.

SUBMITTER: Choy CT

PROVIDER: S-EPMC6329279 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery.

Choy Chi Tung CT Wong Chi Hang CH Chan Stephen Lam SL

Frontiers in genetics 20190104

Artificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We found that embedding can extract biologically relevant information from The Cancer Genome Atlas (TCGA) gene expression dataset by learning a vector representation through gene co-occurrence. Ground truth relationship, such as cancer types of the input sample and semantic meaning of g ...[more]

PMID: 30662451

Similar Datasets

Project description:Motivation: Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes. Our objective was to identify biomarker genes that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. Results: We developed a discriminant analysis and cluster (DAC) pipeline to analyze a 248-array dataset. First, a total of 869 significantly changed genes in response to TNT or RDX exposure were inferred by class comparison statistical algorithms. Then, nine decision tree-based algorithms were applied to generate classification rules and a set of 286 classifier genes. These classifier genes were ranked by their overall weight of significance, and were used to build support vector machines (SVMs). An SVM containing all 286 classifier genes had the highest classification accuracy (91.5%). An unsupervised clustering method was used to cluster the worm samples and results show that the use of the top 100 classifier genes can assign the largest number of worm samples into the three reference clusters obtained by using all the 14,188 filtered genes, suggesting that these top-ranked genes may be potential candidates for biomarkers. This study demonstrates that the DAC pipeline can be used to identify a small set of biomarker genes from high dimensional datasets and generate a reliable SVM classification model for multiple classes. Adult earthworms (E. fetida) were exposed in soil spiked with TNT (0, 6, 12, 24, 48, or 96 mg/kg) or RDX (8, 16, 32, 64, or 128 mg/kg) for 0, 4 or 14 days. The 4-day treatment was repeated with RDX concentration being 2, 4, 8, 16 or 32 mg/kg soil. Each treatment originally had 10 replicate worms with 8-10 survivors at the end of exposure. Total RNA was isolated from the surviving worms. A total of 248 worm RNA samples were hybridized to a custom-designed oligo array using Agilentâs one-color Low RNA Input Linear Amplification Kit. The array contains 15,208 non-redundant 60-mer probes, each targeting a unique E. fetida transcript (Gong et al. 2009). After hybridization and scanning, gene expression data were acquired using Agilentâs Feature Extraction Software (v.9.1.3). The 248-array dataset consists of three worm groups: 32 untreated controls, 96 TNT-treated, and 120 RDX-treated.

Dataset Information

Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery.

Publications

Embedding of Genes Using Cancer Gene Expression Data: Biological Relevance and Potential Application on Biomarker Discovery.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets