Unknown

Dataset Information

0

Large-scale motif discovery using DNA Gray code and equiprobable oligomers.


ABSTRACT:

Motivation

How to find motifs from genome-scale functional sequences, such as all the promoters in a genome, is a challenging problem. Word-based methods count the occurrences of oligomers to detect excessively represented ones. This approach is known to be fast and accurate compared with other methods. However, two problems have hampered the application of such methods to large-scale data. One is the computational cost necessary for clustering similar oligomers, and the other is the bias in the frequency of fixed-length oligomers, which complicates the detection of significant words.

Results

We introduce a method that uses a DNA Gray code and equiprobable oligomers, which solve the clustering problem and the oligomer bias, respectively. Our method can analyze 18 000 sequences of ~1 kbp long in 30 s. We also show that the accuracy of our method is superior to that of a leading method, especially for large-scale data and small fractions of motif-containing sequences.

Availability

The online and stand-alone versions of the application, named Hegma, are available at our website: http://www.genome.ist.i.kyoto-u.ac.jp/~ichinose/hegma/

Contact

ichinose@i.kyoto-u.ac.jp; o.gotoh@i.kyoto-u.ac.jp

SUBMITTER: Ichinose N 

PROVIDER: S-EPMC3244767 | biostudies-literature | 2012 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

Large-scale motif discovery using DNA Gray code and equiprobable oligomers.

Ichinose Natsuhiro N   Yada Tetsushi T   Gotoh Osamu O  

Bioinformatics (Oxford, England) 20111103 1


<h4>Motivation</h4>How to find motifs from genome-scale functional sequences, such as all the promoters in a genome, is a challenging problem. Word-based methods count the occurrences of oligomers to detect excessively represented ones. This approach is known to be fast and accurate compared with other methods. However, two problems have hampered the application of such methods to large-scale data. One is the computational cost necessary for clustering similar oligomers, and the other is the bia  ...[more]

Similar Datasets

| S-EPMC10433467 | biostudies-literature
| S-EPMC6680723 | biostudies-literature
| S-EPMC6209220 | biostudies-literature
| S-EPMC149860 | biostudies-literature
| S-EPMC2562012 | biostudies-literature
| S-EPMC2194741 | biostudies-literature
| S-EPMC10280557 | biostudies-literature
| S-EPMC6884855 | biostudies-literature
| S-EPMC5362644 | biostudies-literature
| S-EPMC2646190 | biostudies-literature