Dataset Information

DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields.

ABSTRACT: Motivation:Reconstructing the full-length expressed transcripts ( a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results:We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods. Availability and implementation:DeepBound is freely available at https://github.com/realbigws/DeepBound . Contact:mingfu.shao@cs.cmu.edu or realbigws@gmail.com.

SUBMITTER: Shao M

PROVIDER: S-EPMC5870651 | biostudies-literature | 2017 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields.

Shao Mingfu M Ma Jianzhu J Wang Sheng S

Bioinformatics (Oxford, England) 20170701 14

<h4>Motivation</h4>Reconstructing the full-length expressed transcripts ( a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that ...[more]

PMID: 28881999

Similar Datasets

Project description:Motivation:A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. Results:We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Availability and implementation:Deopen is freely available at https://github.com/kimmo1019/Deopen. Contact:ruijiang@tsinghua.edu.cn. Supplementary information:Supplementary data are available at Bioinformatics online.

Dataset Information

DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields.

Publications

DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets