Dataset Information

Simultaneous identification of long similar substrings in large sets of sequences.

ABSTRACT: BACKGROUND: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. RESULTS: We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1. CONCLUSION: The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.

SUBMITTER: Kleffe J

PROVIDER: S-EPMC1892095 | biostudies-literature | 2007

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Simultaneous identification of long similar substrings in large sets of sequences.

Kleffe Jürgen J Möller Friedrich F Wittig Burghardt B

BMC bioinformatics 20070524

<h4>Background</h4>Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered.<h4>Results</h4>We therefore present a new algorithm for the identification of almost perfectly match ...[more]

PMID: 17570866

Dataset Information

Simultaneous identification of long similar substrings in large sets of sequences.

Publications

Simultaneous identification of long similar substrings in large sets of sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.
| S-EPMC3892691 | biostudies-literature

SCANMOT: searching for similar sequences using a simultaneous scan of multiple sequence motifs.
| S-EPMC1160253 | biostudies-literature

Individual sequences in large sets of gene sequences may be distinguished efficiently by combinations of shared sub-sequences.
| S-EPMC1090557 | biostudies-literature

APoc: large-scale identification of similar protein pockets.
| S-EPMC3582269 | biostudies-literature

PIQMEE: Bayesian Phylodynamic Method for Analysis of Large Data Sets with Duplicate Sequences.
| S-EPMC7530608 | biostudies-literature

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.
| S-EPMC7355301 | biostudies-literature

Identification of RNA Virus-Derived RdRp Sequences in Publicly Available Transcriptomic Data Sets.
| S-EPMC10101049 | biostudies-literature

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.
| S-EPMC9372455 | biostudies-literature

Fast selection of miRNA candidates based on large-scale pre-computed MFE sets of randomized sequences.
| S-EPMC3895842 | biostudies-literature

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.
| S-EPMC3022630 | biostudies-literature