Dataset Information

SuRankCo: supervised ranking of contigs in de novo assemblies.

ABSTRACT:

Background

Evaluating the quality and reliability of a de novo assembly and of single contigs in particular is challenging since commonly a ground truth is not readily available and numerous factors may influence results. Currently available procedures provide assembly scores but lack a comparative quality ranking of contigs within an assembly.

Results

We present SuRankCo, which relies on a machine learning approach to predict quality scores for contigs and to enable the ranking of contigs within an assembly. The result is a sorted contig set which allows selective contig usage in downstream analysis. Benchmarking on datasets with known ground truth shows promising sensitivity and specificity and favorable comparison to existing methodology.

Conclusions

SuRankCo analyzes the reliability of de novo assemblies on the contig level and thereby allows quality control and ranking prior to further downstream and validation experiments.

SUBMITTER: Kuhring M

PROVIDER: S-EPMC4520199 | biostudies-literature | 2015 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SuRankCo: supervised ranking of contigs in de novo assemblies.

Kuhring Mathias M Dabrowski Piotr Wojtek PW Piro Vitor C VC Nitsche Andreas A Renard Bernhard Y BY

BMC bioinformatics 20150730

<h4>Background</h4>Evaluating the quality and reliability of a de novo assembly and of single contigs in particular is challenging since commonly a ground truth is not readily available and numerous factors may influence results. Currently available procedures provide assembly scores but lack a comparative quality ranking of contigs within an assembly.<h4>Results</h4>We present SuRankCo, which relies on a machine learning approach to predict quality scores for contigs and to enable the ranking o ...[more]

PMID: 26224355

Similar Datasets

Project description:BackgroundFor plant species with unsequenced genomes, cDNA contigs created by de novo assembly of RNA-Seq reads are used as reference sequences for comparative analysis of RNA-Seq datasets and the detection of differentially expressed genes (DEGs). Redundancies in such contigs are evident in previous RNA-Seq studies, and such redundancies can lead to difficulties in subsequent analysis. Nevertheless, the effects of removing redundancy from contig assemblies on comparative RNA-Seq analysis have not been evaluated.ResultsHere we describe a method for removing redundancy from raw contigs that were primarily created by de novo assembly of Arabidopsis thaliana RNA-Seq reads. Specifically, the contigs with the highest bit scores were selected from raw contigs by a homology search against the gene dataset in the TAIR10 database. The two existing methods for removal of redundancy based on contig length or clustering analysis used to eliminate redundancies from raw contigs. Contig number was reduced most effectively with the method based on homology search. In a comparative analysis of RNA-Seq datasets, DEGs detected in contigs that underwent redundancy removal via the homology search method showed the highest identity to the DEGs detected when the TAIR10 gene dataset was used as an exact reference. Redundancy in raw contigs could also be removed by a homology search against integrated protein datasets from several plant species other than A. thaliana. DEGs detected using contigs that underwent such redundancy-removed also showed high homology to DEGs detected using the TAIR10 gene dataset.ConclusionHere we describe a method for removing redundant contigs within raw contigs; this method involves a homology search against a gene or protein database. In principal, this method can be used with unsequenced plant genomes that lack a well-developed gene database. Redundant contigs were not removed adequately via either of two existing methods, but our method allowed for removal of all redundant contigs. To our knowledge, this is the first reported improvement in accurate detection of DEGs via comparative RNA-Seq analysis that involved preparation of a non-redundant reference sequence. This method could be used to rapidly and cost-effectively detect useful genes in unsequenced plants.

Dataset Information

SuRankCo: supervised ranking of contigs in de novo assemblies.

Background

Results

Conclusions

Publications

SuRankCo: supervised ranking of contigs in de novo assemblies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets