Dataset Information

Towards a theoretical understanding of false positives in DNA motif finding.

ABSTRACT: BACKGROUND: Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios. RESULTS: Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula. CONCLUSIONS: We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology.

SUBMITTER: Zia A

PROVIDER: S-EPMC3436861 | biostudies-other | 2012

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Towards a theoretical understanding of false positives in DNA motif finding.

Zia Amin A Moses Alan M AM

BMC bioinformatics 20120627

<h4>Background</h4>Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios.<h4>Results</h4>Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size ...[more]

PMID: 22738169

Dataset Information

Towards a theoretical understanding of false positives in DNA motif finding.

Publications

Towards a theoretical understanding of false positives in DNA motif finding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Similar Datasets

A survey of DNA motif finding algorithms.
| S-EPMC2099490 | biostudies-other

DECOD: fast and accurate discriminative DNA motif finding.
| S-EPMC3157928 | biostudies-literature

Using RNA secondary structures to guide sequence motif finding towards single-stranded regions.
| S-EPMC1903381 | biostudies-literature

The DNA Recognition Motif of GapR Has an Intrinsic DNA Binding Preference towards AT-rich DNA.
| S-EPMC8510090 | biostudies-literature

Towards epigenetic understanding and therapy of insulin resistance by intranuclear insulin [DNA methylation]
2014-05-23 | E-GEOD-57894 | biostudies-arrayexpress

Towards understanding breast cancer mechanisms to metastasize
2013-11-01 | E-GEOD-47389 | biostudies-arrayexpress

Incidental Finding of Left Ventricular False Chamber: Diagnostic and Therapeutic Implications.
| S-EPMC6057300 | biostudies-literature

Discriminative motif finding for predicting protein subcellular localization.
| S-EPMC3050600 | biostudies-literature

Towards epigenetic understanding and therapy of insulin resistance by intranuclear insulin [DNA methylation]
2014-05-23 | GSE57894 | GEO

Towards understanding breast cancer mechanisms to metastasize
2013-11-01 | GSE47389 | GEO