Dataset Information

SPA: a short peptide assembler for metagenomic data.

ABSTRACT: The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-length protein predictions. A nucleotide assembly-based strategy does not fare much better, as metagenomic assemblies are typically fragmented and also leave a large fraction of reads unassembled. Here, we present a method for reconstructing complete protein sequences directly from NGS metagenomic data. Our framework is based on a novel short peptide assembler (SPA) that assembles protein sequences from their constituent peptide fragments identified on short reads. The SPA algorithm is based on informed traversals of a de Bruijn graph, defined on an amino acid alphabet, to identify probable paths that correspond to proteins. Using large simulated and real metagenomic data sets, we show that our method outperforms the alternate approach of identifying genes on nucleotide sequence assemblies and generates longer protein sequences that can be more effectively analysed.

SUBMITTER: Yang Y

PROVIDER: S-EPMC3632116 | biostudies-literature | 2013 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SPA: a short peptide assembler for metagenomic data.

Yang Youngik Y Yooseph Shibu S

Nucleic acids research 20130223 8

The metagenomic paradigm allows for an understanding of the metabolic and functional potential of microbes in a community via a study of their proteins. The substrate for protein identification is either the set of individual nucleotide reads generated from metagenomic samples or the set of contig sequences produced by assembling these reads. However, a read-based strategy using reads generated by next-generation sequencing (NGS) technologies, results in an overwhelming majority of partial-lengt ...[more]

PMID: 23435317

Similar Datasets

Project description:Sequencing of the spa gene of methicillin-resistant Staphylococcus aureus (MRSA) is used for assigning spa types to e.g., detect transmission and control outbreaks. Traditionally, spa typing is performed by Sanger sequencing but has in recent years been replaced by whole-genome sequencing (WGS) in some laboratories. Spa typing by WGS involves de novo assembly of millions of short sequencing reads into larger contiguous sequences, from which the spa type is then determined. The choice of assembly program therefore potentially impacts the spa typing result. In this study, WGS of 1,754 MRSA isolates was followed by de novo assembly using the assembly programs SPAdes (with two different sets of parameters) and SKESA. The spa types were assigned and compared to the spa types obtained by Sanger sequencing, regarding the latter as the correct spa types. SPAdes with the two different settings resulted in assembly of the correct spa type for 84.8% and 97.6% of the isolates, respectively, while SKESA assembled the correct spa type in 98.6% of cases. The misassembled spa types were generally two spa repeats shorter than the correct spa type and mainly included spa types with repetition of the same repeats. WGS-based spa typing is thus very accurate compared to Sanger sequencing, when the best assembly program for this purpose is used. IMPORTANCE spa typing of methicillin-resistant Staphylococcus aureus (MRSA) is widely used by clinicians, infection control workers, and researchers both in local outbreak investigations and as an easy way to communicate and compare MRSA types between laboratories and countries. Traditionally, spa types are determined by Sanger sequencing, but in recent years a whole-genome sequencing (WGS)-based approach has become increasingly used. In this study, we compared spa typing by WGS using different methods for assembling the genome from short sequencing reads and compared to Sanger sequencing as the gold standard. We find substantial differences in correct assembly of spa types between the assembly methods. Our findings are therefore important for the quality of WGS based spa typing data being exchanged by clinical microbiology laboratories.

Project description:With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera's within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

Dataset Information

SPA: a short peptide assembler for metagenomic data.

Publications

SPA: a short peptide assembler for metagenomic data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets