Dataset Information

SEED: efficient clustering of next-generation sequences.

ABSTRACT: MOTIVATION: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. RESULTS: Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. AVAILABILITY: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. CONTACT: thomas.girke@ucr.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

SUBMITTER: Bao E

PROVIDER: S-EPMC3167058 | biostudies-literature | 2011 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SEED: efficient clustering of next-generation sequences.

Bao Ergude E Jiang Tao T Kaloshian Isgouhi I Girke Thomas T

Bioinformatics (Oxford, England) 20110802 18

<h4>Motivation</h4>Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.<h4>Results</h4>Here, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can ...[more]

PMID: 21810899

Similar Datasets

Project description:BackgroundPotato seed tubers are colonized and inhabited by soil-borne microbes, that can affect the performance of the emerging daughter plant in the next season. In this study, we investigated the intergenerational inheritance of microbiota from seed tubers to next-season daughter plants under field condition by amplicon sequencing of bacterial and fungal microbiota associated with tubers and roots, and tracked the microbial transmission from different seed tuber compartments to sprouts.ResultsWe observed that field of production and potato genotype significantly (P < 0.01) affected the composition of the seed tuber microbiome and that these differences persisted during winter storage of the seed tubers. Remarkably, when seed tubers from different production fields were planted in a single trial field, the microbiomes of daughter tubers and roots of the emerging plants could still be distinguished (P < 0.01) according to the production field of the seed tuber. Surprisingly, we found little vertical inheritance of field-unique microbes from the seed tuber to the daughter tubers and roots, constituting less than 0.2% of their respective microbial communities. However, under controlled conditions, around 98% of the sprout microbiome was found to originate from the seed tuber and had retained their field-specific patterns.ConclusionsThe field of production shapes the microbiome of seed tubers, emerging potato plants and even the microbiome of newly formed daughter tubers. Different compartments of seed tubers harbor distinct microbiomes. Both bacteria and fungi on seed tubers have the potential of being vertically transmitted to the sprouts, and the sprout subsequently promotes proliferation of a select number of microbes from the seed tuber. Recognizing the role of plant microbiomes in plant health, the initial microbiome of seed tubers specifically or planting materials in general is an overlooked trait. Elucidating the relative importance of the initial microbiome and the mechanisms by which the origin of planting materials affect microbiome assembly will pave the way for the development of microbiome-based predictive models that may predict the quality of seed tuber lots, ultimately facilitating microbiome-improved potato cultivation.

Project description:BACKGROUND:When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. METHODS:We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. RESULTS:Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness. CONCLUSIONS:Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.

Dataset Information

SEED: efficient clustering of next-generation sequences.

Publications

SEED: efficient clustering of next-generation sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets