Dataset Information

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment.

ABSTRACT:

Motivation

RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ~3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ~10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.

Results

We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.

Availability

The software can be downloaded at http://csbio.unc.edu/genescissors/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Zhang Z

PROVIDER: S-EPMC3694649 | biostudies-literature | 2013 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment.

Zhang Zhaojun Z Huang Shunping S Wang Jack J Zhang Xiang X Pardo Manuel de Villena Fernando F McMillan Leonard L Wang Wei W

Bioinformatics (Oxford, England) 20130701 13

<h4>Motivation</h4>RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate tre ...[more]

PMID: 23812996

Dataset Information

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment.

Motivation

Results

Availability

Supplementary information

Publications

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification.
| S-EPMC6938679 | biostudies-literature

Correcting signal biases and detecting regulatory elements in STARR-seq data.
| S-EPMC8092017 | biostudies-literature

Spurious inference when comparing networks.
| S-EPMC6708361 | biostudies-literature

Optimal schedules of light exposure for rapidly correcting circadian misalignment.
| S-EPMC3983044 | biostudies-literature

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.
| S-EPMC4643835 | biostudies-literature

CleanUpRNAseq: An R/Bioconductor Package for Detecting and Correcting DNA Contamination in RNA-Seq Data.
| S-EPMC11348166 | biostudies-literature

Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads.
| S-EPMC3496342 | biostudies-literature

Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads.
| S-EPMC3152782 | biostudies-literature

Correcting palindromes in long reads after whole-genome amplification.
| S-EPMC6218980 | biostudies-literature

VeChat: correcting errors in long reads using variation graphs.
| S-EPMC9636371 | biostudies-literature