Dataset Information

Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence.

ABSTRACT: Deep learning methodologies have revolutionized prediction in many fields and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two approaches that account for evolutionary relatedness in machine learning models: (i) gene-family-guided splitting and (ii) ortholog contrasts. The first approach accounts for evolution by constraining model training and testing sets to include different gene families. The second approach uses evolutionarily informed comparisons between orthologous genes to both control for and leverage evolutionary divergence during the training process. The two approaches were explored and validated within the context of mRNA expression level prediction and have the area under the ROC curve (auROC) values ranging from 0.75 to 0.94. Model weight inspections showed biologically interpretable patterns, resulting in the hypothesis that the 3' UTR is more important for fine-tuning mRNA abundance levels while the 5' UTR is more important for large-scale changes.

SUBMITTER: Washburn JD

PROVIDER: S-EPMC6431157 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence.

Washburn Jacob D JD Mejia-Guerra Maria Katherine MK Ramstein Guillaume G Kremling Karl A KA Valluru Ravi R Buckler Edward S ES Wang Hai H

Proceedings of the National Academy of Sciences of the United States of America 20190306 12

Deep learning methodologies have revolutionized prediction in many fields and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two approaches that account for evolutionary relatedness in machine learning models: (<i>i</i>) gene-family-guided splitting and (<i>ii</i>) ortholog contrasts. The firs ...[more]

PMID: 30842277

Similar Datasets

Project description:BackgroundMassively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance.ResultsFor structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability.ConclusionsTogether, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools http://bix.ucsd.edu/projects/NGS-DesignTools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.

Dataset Information

Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence.

Publications

Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets