Unknown

Dataset Information

0

Similarity evaluation of DNA sequences based on frequent patterns and entropy.


ABSTRACT: BACKGROUND: DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage. RESULTS: In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the ?-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool. CONCLUSIONS: Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.

SUBMITTER: Xie X 

PROVIDER: S-EPMC4331808 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

altmetric image

Publications

Similarity evaluation of DNA sequences based on frequent patterns and entropy.

Xie Xiaojing X   Guan Jihong J   Zhou Shuigeng S  

BMC genomics 20150129


<h4>Background</h4>DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage.<h4>Results</h4>In this paper, for effectively computing the s  ...[more]

Similar Datasets

| S-EPMC6557737 | biostudies-literature
| S-EPMC4880953 | biostudies-literature
| S-EPMC2831315 | biostudies-literature
| S-EPMC2639693 | biostudies-literature
| S-EPMC2082047 | biostudies-literature
| S-EPMC3922877 | biostudies-literature
| S-EPMC4713410 | biostudies-literature
| S-EPMC6554434 | biostudies-literature
| S-EPMC5896879 | biostudies-literature
| S-EPMC1976428 | biostudies-literature