Ontology highlight
ABSTRACT: Motivation
Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences.Results
Our investigation uses both simulated data as well as cis-regulatory module data where the task is to identify cis-regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics.Availability
Our implementation of the five statistics is available as R package named 'multiAlignFree' at be http://www-rcf.usc.edu/?fsun/Programs/multiAlignFree/multiAlignFreemain.html.Contact
reinert@stats.ox.ac.uk.Supplementary information
Supplementary data are available at Bioinformatics online.
SUBMITTER: Ren J
PROVIDER: S-EPMC3799466 | biostudies-literature | 2013 Nov
REPOSITORIES: biostudies-literature
Ren Jie J Song Kai K Sun Fengzhu F Deng Minghua M Reinert Gesine G
Bioinformatics (Oxford, England) 20130829 21
<h4>Motivation</h4>Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k-tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, C(*)1 and C(S)1, extensions of statistics for pairwise comparison of the joint k-tuple content of all the sequences, and second, C(*)2, C(S)2 and C(geo)2, averages of sums of pairwise comparison statistics. T ...[more]