Dataset Information

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

ABSTRACT: BACKGROUND:The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. RESULTS:In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. CONCLUSIONS:Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

SUBMITTER: Bradnam KR

PROVIDER: S-EPMC3844414 | biostudies-literature | 2013 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

Bradnam Keith R KR Fass Joseph N JN Alexandrov Anton A Baranay Paul P Bechner Michael M Birol Inanç I Boisvert Sébastien S Chapman Jarrod A JA Chapuis Guillaume G Chikhi Rayan R Chitsaz Hamidreza H Chou Wen-Chi WC Corbeil Jacques J Del Fabbro Cristian C Docking T Roderick TR Durbin Richard R Earl Dent D Emrich Scott S Fedotov Pavel P Fonseca Nuno A NA Ganapathy Ganeshkumar G Gibbs Richard A RA Gnerre Sante S Godzaridis Elénie E Goldstein Steve S Haimel Matthias M Hall Giles G Haussler David D Hiatt Joseph B JB Ho Isaac Y IY Howard Jason J Hunt Martin M Jackman Shaun D SD Jaffe David B DB Jarvis Erich D ED Jiang Huaiyang H Kazakov Sergey S Kersey Paul J PJ Kitzman Jacob O JO Knight James R JR Koren Sergey S Lam Tak-Wah TW Lavenier Dominique D Laviolette François F Li Yingrui Y Li Zhenyu Z Liu Binghang B Liu Yue Y Luo Ruibang R Maccallum Iain I Macmanes Matthew D MD Maillet Nicolas N Melnikov Sergey S Naquin Delphine D Ning Zemin Z Otto Thomas D TD Paten Benedict B Paulo Octávio S OS Phillippy Adam M AM Pina-Martins Francisco F Place Michael M Przybylski Dariusz D Qin Xiang X Qu Carson C Ribeiro Filipe J FJ Richards Stephen S Rokhsar Daniel S DS Ruby J Graham JG Scalabrin Simone S Schatz Michael C MC Schwartz David C DC Sergushichev Alexey A Sharpe Ted T Shaw Timothy I TI Shendure Jay J Shi Yujian Y Simpson Jared T JT Song Henry H Tsarev Fedor F Vezzi Francesco F Vicedomini Riccardo R Vieira Bruno M BM Wang Jun J Wang Jun J Worley Kim C KC Yin Shuangye S Yiu Siu-Ming SM Yuan Jianying J Zhang Guojie G Zhang Hao H Zhou Shiguo S Korf Ian F IF

GigaScience 20130722 1

<h4>Background</h4>The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largel ...[more]

PMID: 23870653

Similar Datasets

Project description:Venom-gland transcriptomics is a key tool in the study of the evolution, ecology, function, and pharmacology of animal venoms. In particular, gene-expression variation and coding sequences gained through transcriptomics provide key information for explaining functional venom variation over both ecological and evolutionary timescales. The accuracy and usefulness of inferences made through transcriptomics, however, is limited by the accuracy of the transcriptome assembly, which is a bioinformatic problem with several possible solutions. Several methods have been employed to assemble venom-gland transcriptomes, with the Trinity assembler being the most commonly applied among them. Although previous evidence of variation in performance among assembly software exists, particularly regarding recovery of difficult-to-assemble multigene families such as snake venom metalloproteinases, much work to date still employs a single assembly method. We evaluated the performance of several commonly used de novo assembly methods for the recovery of both nontoxin transcripts and complete, high-quality venom-gene transcripts across eleven snake and four scorpion transcriptomes. We varied k-mer sizes used by some assemblers to evaluate the impact of k-mer length on transcript recovery. We showed that the recovery of nontoxin transcripts and toxin transcripts is best accomplished through different assembly software, with SDT at smaller k-mer lengths and Trinity being best for nontoxin recovery and a combination of SeqMan NGen and a seed-and-extend approach implemented in Extender as the best means of recovering a complete set of toxin transcripts. In particular, Extender was the only means tested capable of assembling multiple isoforms of the diverse snake venom metalloproteinase family, while traditional approaches such as Trinity recovered at most one metalloproteinase transcript. Our work demonstrated that traditional metrics of assembly performance are not predictive of performance in the recovery of complete and high quality toxin genes. Instead, effective venom-gland transcriptomic studies should combine and quality-filter the results of several assemblers with varying algorithmic strategies.

Project description:The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.

Project description:MotivationThere are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater datasets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big datasets and using the overlap graph model is still looked out.ResultsWe propose a new genome-scale de novo assembler based on the overlap graph approach, designed for short-read sequencing data. The method, ALGA, incorporates several new ideas resulting in more exact contigs produced in short time. Among these ideas, we have creation of a sparse but quite informative graph, reduction of the graph including a procedure referring to the problem of minimum spanning tree of a local subgraph, and graph traversal connected with simultaneous analysis of contigs stored so far. What is rare in genome assembly, the algorithm is almost parameter-free, with only one optional parameter to be set by a user. ALGA was compared with nine state-of-the-art assemblers in tests on genome-scale sequencing data obtained from real experiments on six organisms, differing in size, coverage, GC content and repetition rate. ALGA produced best results in the sense of overall quality of genome reconstruction, understood as a good balance between genome coverage, accuracy and length of resulting sequences. The algorithm is one of tools involved in processing data in currently realized national project Genomic Map of Poland.Availability and implementationALGA is available at http://alga.put.poznan.pl.Supplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

Publications

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets