Unknown

Dataset Information

0

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.


ABSTRACT: BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity kappa of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter kappa of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.

SUBMITTER: Corel E 

PROVIDER: S-EPMC2923138 | biostudies-literature | 2010

REPOSITORIES: biostudies-literature

altmetric image

Publications

MS4--Multi-Scale Selector of Sequence Signatures: an alignment-free method for classification of biological sequences.

Corel Eduardo E   Pitschi Florian F   Laprevotte Ivan I   Grasseau Gilles G   Didier Gilles G   Devauchelle Claudine C  

BMC bioinformatics 20100730


<h4>Background</h4>While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding  ...[more]

Similar Datasets

| S-EPMC9602327 | biostudies-literature
| S-EPMC4410667 | biostudies-literature
| S-EPMC4434998 | biostudies-literature
| S-EPMC8323718 | biostudies-literature
| S-EPMC6377666 | biostudies-literature
| S-EPMC8273350 | biostudies-literature
| S-EPMC2848238 | biostudies-literature
| S-EPMC8073112 | biostudies-literature
| S-EPMC3849074 | biostudies-literature
| S-EPMC7859483 | biostudies-literature