Dataset Information

Updating RNA-Seq analyses after re-annotation.

ABSTRACT:

Unlabelled

The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example, on the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses. We present a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments on re-annotation that does not require re-analysis of the entire dataset. Our approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. We demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, we provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised.

Availability and implementation

Our methods are implemented in software called ReXpress and are freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Roberts A

PROVIDER: S-EPMC3694665 | biostudies-literature | 2013 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Updating RNA-Seq analyses after re-annotation.

Roberts Adam A Schaeffer Lorian L Pachter Lior L

Bioinformatics (Oxford, England) 20130514 13

<h4>Unlabelled</h4>The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example, on the discovery of novel isoforms or errors in existing annotation ...[more]

PMID: 23677943

Similar Datasets

Project description:Pyrococcus furiosus DSM 3638 is a model organism for hyperthermophilic archaea with an optimal growth temperature near 100°C. The genome was sequenced about 18 years ago. However, some publications suggest that in contrast to other Pyrococcus species, the genome of P. furiosus DSM 3638 is prone to genomic rearrangements. Therefore, we re-sequenced the genome using third generation sequencing techniques. The new de novo assembled genome is 1,889,914 bp in size and exhibits high sequence identity to the published sequence. However, two major deviations were detected: (1) The genome is 18,342 bp smaller than the NCBI reference genome due to a recently described deletion. (2) The region between PF0349 and PF0388 is inverted most likely due an assembly problem for the original sequence. In addition, numerous minor variations, ranging from single nucleotide exchanges, deletions or insertions were identified. The total number of insertion sequence (IS) elements is also reduced from 30 to 24 in the new sequence. Re-sequencing of a 2-year-old "lab culture" using Nanopore sequencing confirmed the overall stability of the P. furiosus DSM 3638 genome even under normal lab conditions without taking any special care. To improve genome annotation, the updated DNA sequence was combined with an RNA sequencing approach. Here, RNAs from eight different growth conditions were pooled to increase the number of detected transcripts. Furthermore, a differential RNA-Seq approach was employed for the identification of transcription start sites (TSSs). In total, 2515 TSSs were detected and classified into 834 primary (pTSS), 797 antisense (aTSS), 739 internal and 145 secondary TSSs. Our analysis of the upstream regions revealed a well conserved archaeal promoter structure. Interrogation of the distances between pTSSs and aTSSs revealed a significant number of antisense transcripts, which are a result of bidirectional transcription from the same TATA box. This mechanism of antisense transcript production could be further confirmed by in vitro transcription experiments. We assume that bidirectional transcription gives rise to non-functional antisense RNAs and that this is a widespread phenomenon in archaea due to the architecture of the TATA element and the symmetric structure of the TATA-binding protein.

Dataset Information

Updating RNA-Seq analyses after re-annotation.

Unlabelled

Availability and implementation

Supplementary information

Publications

Updating RNA-Seq analyses after re-annotation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets