Dataset Information

Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus.

ABSTRACT: BACKGROUND:De novo assembly of non-model organism's transcriptomes has recently been on the rise in concert with the number of de novo transcriptome assembly software programs. There is a knowledge gap as to what assembler software or k-mer strategy is best for construction of an optimal de novo assembly. Additionally, there is a lack of consensus on which evaluation metrics should be used to assess the quality of de novo transcriptome assemblies. RESULT:Six different assembly strategies were evaluated from four different assemblers. The Trinity assembly was used in its default 25 single k-mer value while Bridger, Oases, and SOAPdenovo-Trans were performed with multiple k-mer strategies. Bridger, Oases, and SOAPdenovo-Trans used a small multiple k-mer (SMK) strategy consisting of the k-mer lengths of 21, 25, 27, 29, 31, and 33. Additionally, Oases and SOAPdenovo-Trans were performed using a large multiple k-mer (LMK) strategy consisting of k-mer lengths of 25, 35, 45, 55, 65, 75, and 85. Eleven metrics were used to evaluate each assembly strategy including three genome related evaluation metrics (contig number, N50 length, Contigs >1 kb, reads) and eight transcriptome evaluation metrics (mapped back to transcripts (RMBT), number of full length transcripts, number of open reading frames, Detonate RSEM-EVAL score, and percent alignment to the southern platyfish, Amazon molly, BUSCO and CEGMA databases). The assembly strategy that performed the best, that is it was within the top three of each evaluation metric, was the Bridger assembly (10 of 11) followed by the Oases SMK assembly (8 of 11), the Oases LMK assembly (6 of 11), the Trinity assembly (4 of 11), the SOAP LMK assembly (4 of 11), and the SOAP SMK assembly (3 of 11). CONCLUSION:This study provides an in-depth multi k-mer strategy investigation concluding that the assembler itself had a greater impact than k-mer size regardless of the strategy employed. Additionally, the comprehensive performance transcriptome evaluation metrics utilized in this study identified the need for choosing metrics centered on user defined research goals. Based on the evaluation metrics performed, the Bridger assembly was able to construct the best assembly of the testis transcriptome in Fundulus heteroclitus.

SUBMITTER: Rana SB

PROVIDER: S-EPMC4824410 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus.

Rana Satshil B SB Zadlock Frank J FJ Zhang Ziping Z Murphy Wyatt R WR Bentivegna Carolyn S CS

PloS one 20160407 4

<h4>Background</h4>De novo assembly of non-model organism's transcriptomes has recently been on the rise in concert with the number of de novo transcriptome assembly software programs. There is a knowledge gap as to what assembler software or k-mer strategy is best for construction of an optimal de novo assembly. Additionally, there is a lack of consensus on which evaluation metrics should be used to assess the quality of de novo transcriptome assemblies.<h4>Result</h4>Six different assembly str ...[more]

PMID: 27054874

Similar Datasets

Project description:Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.

Project description:Arsenic is a contaminant found worldwide in drinking water and food. Epidemiological studies have correlated arsenic exposure with reduced weight gain and improper muscular development, while in vitro studies show that arsenic exposure impairs myogenic differentiation. The purpose of this study was to use Fundulus heteroclitus or killifish as a model organism to determine if embryonic-only arsenic exposure permanently reduces the number or function of muscle satellite cells. Killifish embryos were exposed to 0, 50, 200, or 800 ppb arsenite (AsIII) until hatching, and then juvenile fish were raised in clean water. At 28, 40, and 52 weeks after hatching, skeletal muscle injuries were induced by injecting cardiotoxin into the trunk of the fish just posterior to the dorsal fin. Muscle sections were collected at 0, 3 and 10 days post-injury. Collagen levels were used to assess muscle tissue damage and recovery, while levels of proliferating cell nuclear antigen (PCNA) and myogenin were quantified to compare proliferating cells and newly formed myoblasts. At 28 weeks of age, baseline collagen levels were 105% and 112% greater in 200 and 800 ppb groups, respectively, and at 52 weeks of age, were 58% higher than controls in the 200 ppb fish. After cardiotoxin injury, collagen levels tend to increase to a greater extent and take longer to resolve in the arsenic exposed fish. The number of baseline PCNA(+) cells were 48-216% greater in 800 ppb exposed fish compared to controls, depending on the week examined. However, following cardiotoxin injury, PCNA is reduced at 28 weeks in 200 and 800 ppb fish at day 3 during the recovery period. By 52 weeks, there are significant reductions in PCNA in all exposure groups at day 3 of the recovery period. Based on these results, embryonic arsenic exposure increases baseline collagen levels and PCNA(+) cells in skeletal muscle. However, when these fish are challenged with a muscle injury, the proliferation and differentiation of satellite cells into myogenic precursors is impaired and instead, the fish appear to be favoring a fibrotic resolution to the injury.

Dataset Information

Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus.

Publications

Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets