Dataset Information

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.

ABSTRACT:

Background

While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool.

Results

Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR).

Conclusions

The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.

SUBMITTER: Corel E

PROVIDER: S-EPMC2923138 | biostudies-literature | 2010 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.

Corel Eduardo E Pitschi Florian F Laprevotte Ivan I Grasseau Gilles G Didier Gilles G Devauchelle Claudine C

BMC bioinformatics 20100730

<h4>Background</h4>While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding ...[more]

PMID: 20673356

Dataset Information

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.

Background

Results

Conclusions

Publications

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.
| S-EPMC4410667 | biostudies-literature

Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families.
| S-EPMC9602327 | biostudies-literature

Alignment-free method for DNA sequence clustering using Fuzzy integral similarity.
| S-EPMC6403383 | biostudies-literature

Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
| S-EPMC4959381 | biostudies-literature

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.
| S-EPMC4434998 | biostudies-literature

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.
| S-EPMC7382288 | biostudies-literature

Multiple alignment-free sequence comparison.
| S-EPMC3799466 | biostudies-literature

CAFE: aCcelerated Alignment-FrEe sequence analysis.
| S-EPMC5793812 | biostudies-literature

INSIDER: alignment-free detection of foreign DNA sequences.
| S-EPMC8273350 | biostudies-literature

MISHIMA--a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data.
| S-EPMC2848238 | biostudies-literature