Dataset Information

ITriplet, a rule-based nucleic acid sequence motif finder.

ABSTRACT:

Background

With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge as high degeneracy will diminish statistical significance of biological signals and increasing motif size will cause combinatorial explosion. In this report, we present a novel rule-based method that is focused on finding degenerate and long motifs. Our proposed method, named iTriplet, avoids costly enumeration present in existing combinatorial methods and is amenable to parallel processing.

Results

We have conducted a comprehensive assessment on the performance and sensitivity-specificity of iTriplet in analyzing artificial and real biological sequences in various genomic regions. The results show that iTriplet is able to solve challenging cases. Furthermore we have confirmed the utility of iTriplet by showing it accurately predicts polyA-site-related motifs using a dual Luciferase reporter assay.

Conclusion

iTriplet is a novel rule-based combinatorial or enumerative motif finding method that is able to process highly degenerate and long motifs that have resisted analysis by other methods. In addition, iTriplet is distinguished from other methods of the same family by its parallelizability, which allows it to leverage the power of today's readily available high-performance computing systems.

SUBMITTER: Ho ES

PROVIDER: S-EPMC2784457 | biostudies-literature | 2009 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

iTriplet, a rule-based nucleic acid sequence motif finder.

Ho Eric S ES Jakubowski Christopher D CD Gunderson Samuel I SI

Algorithms for molecular biology : AMB 20091029

<h4>Background</h4>With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge a ...[more]

PMID: 19874606

Similar Datasets

Project description:MotivationProtein phosphorylation, driven by specific recognition of substrates by kinases and phosphatases, plays central roles in a variety of important cellular processes such as signaling and enzyme activation. Mass spectrometry enables the determination of phosphorylated peptides (and thereby proteins) in scenarios ranging from targeted in vitro studies to in vivo cell lysates under particular conditions. The characterization of commonalities among identified phosphopeptides provides insights into the specificities of the kinases involved in a study. Several algorithms have been developed to uncover linear motifs representing position-specific amino acid patterns in sets of phosphopeptides. To more fully capture the available information, reduce sensitivity to both parameter choices and natural experimental variation, and develop more precise characterizations of kinase specificities, it is necessary to determine all statistically significant motifs represented in a dataset.ResultsWe have developed MMFPh (Maximal Motif Finder for Phosphoproteomics datasets), which extends the approach of the popular phosphorylation motif software Motif-X (Schwartz and Gygi, 2005) to identify all statistically significant motifs and return the maximal ones (those not subsumed by motifs with more fixed amino acids). In tests with both synthetic and experimental data, we show that MMFPh finds important motifs missed by the greedy approach of Motif-X, while also finding more motifs that are more characteristic of the dataset relative to the background proteome. Thus MMFPh is in some sense both more sensitive and more specific in characterizing the involved kinases. We also show that MMFPh compares favorably to other recent methods for finding phosphorylation motifs. Furthermore, MMFPh is less dependent on parameter choices. We support this powerful new approach with a web interface so that it may become a useful tool for studies of kinase specificity and phosphorylation site prediction.AvailabilityA web server is at www.cs.dartmouth.edu/~cbk/.

Project description:A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and "background" intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.