Dataset Information

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.

ABSTRACT: BACKGROUND:The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. RESULTS:The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.

SUBMITTER: Lima L

PROVIDER: S-EPMC5322684 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.

Lima Leandro L Sinaimeri Blerina B Sacomoto Gustavo G Lopez-Maestre Helene H Marchet Camille C Miele Vincent V Sagot Marie-France MF Lacroix Vincent V

Algorithms for molecular biology : AMB 20170222

<h4>Background</h4>The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) an ...[more]

PMID: 28250805

Dataset Information

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.

Publications

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Local de novo assembly of RAD paired-end contigs using short sequencing reads.
| S-EPMC3076424 | biostudies-literature

Unsupervised discovery of behaviorally relevant brain states in rats playing hide-and-seek.
| S-EPMC9245901 | biostudies-literature

Meraculous: de novo genome assembly with short paired-end reads.
| S-EPMC3158087 | biostudies-literature

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.
| S-EPMC4792456 | biostudies-literature

Hybrid de novo tandem repeat detection using short and long reads.
| S-EPMC4582210 | biostudies-literature

FastUniq: a fast de novo duplicates removal tool for paired short reads.
| S-EPMC3527383 | biostudies-literature

Rapid, robust plasmid verification by de novo assembly of short sequencing reads.
| S-EPMC7544192 | biostudies-literature

Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads.
| S-EPMC3152782 | biostudies-literature

Immunogenomics: molecular hide and seek.
| S-EPMC3525153 | biostudies-literature

Playing hide and seek: how glycosylation of the influenza virus hemagglutinin can modulate the immune response to infection.
| S-EPMC3970151 | biostudies-literature