Dataset Information

The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes.

ABSTRACT: A nucleotide sequence 35 base pairs long can take 1,180,591,620,717,411,303,424 possible values. An example of systems biology datasets, protein binding microarrays, contain activity data from about 40,000 such sequences. The discrepancy between the number of possible configurations and the available activities is enormous. Thus, albeit that systems biology datasets are large in absolute terms, they oftentimes require methods developed for rare events due to the combinatorial increase in the number of possible configurations of biological systems. A plethora of techniques for handling large datasets, such as Empirical Bayes, or rare events, such as importance sampling, have been developed in the literature, but these cannot always be simultaneously utilized. Here we introduce a principled approach to Empirical Bayes based on importance sampling, information theory, and theoretical physics in the general context of sequence phenotype model induction. We present the analytical calculations that underlie our approach. We demonstrate the computational efficiency of the approach on concrete examples, and demonstrate its efficacy by applying the theory to publicly available protein binding microarray transcription factor datasets and to data on synthetic cAMP-regulated enhancer sequences. As further demonstrations, we find transcription factor binding motifs, predict the activity of new sequences and extract the locations of transcription factor binding sites. In summary, we present a novel method that is efficient (requiring minimal computational time and reasonable amounts of memory), has high predictive power that is comparable with that of models with hundreds of parameters, and has a limited number of optimized parameters, proportional to the sequence length.

SUBMITTER: Shreif Z

PROVIDER: S-EPMC4522360 | biostudies-literature | 2015 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes.

Shreif Zeina Z Striegel Deborah A DA Periwal Vipul V

Journal of theoretical biology 20150617

A nucleotide sequence 35 base pairs long can take 1,180,591,620,717,411,303,424 possible values. An example of systems biology datasets, protein binding microarrays, contain activity data from about 40,000 such sequences. The discrepancy between the number of possible configurations and the available activities is enormous. Thus, albeit that systems biology datasets are large in absolute terms, they oftentimes require methods developed for rare events due to the combinatorial increase in the num ...[more]

PMID: 26092377

Dataset Information

The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes.

Publications

The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Piecing the puzzle together: a revisit to transcript reconstruction problem in RNA-seq.
| S-EPMC4168703 | biostudies-literature

A decade of research on the 17q12-21 asthma locus: Piecing together the puzzle.
| S-EPMC6172038 | biostudies-literature

Pattern recognition receptors and DNA repair: starting to put a jigsaw puzzle together.
| S-EPMC4107940 | biostudies-literature

Scalable Empirical Bayes Inference and Bayesian Sensitivity Analysis.
| S-EPMC11654829 | biostudies-literature

Piecing together the Inuit food security policy puzzle in Nunatsiavut, Labrador (Canada): protocol for a scoping review.
| S-EPMC6924784 | biostudies-literature

β-empirical Bayes inference and model diagnosis of microarray data.
| S-EPMC3464654 | biostudies-literature

Piecing Together How Peroxiredoxins Maintain Genomic Stability.
| S-EPMC6316004 | biostudies-literature

Allograft for Myeloma: Examining Pieces of the Jigsaw Puzzle.
| S-EPMC5732220 | biostudies-literature

Assembling bacterial puzzles: piecing together functions into microbial pathways.
| S-EPMC11344244 | biostudies-literature

Empirical Bayes Conditional Independence Graphs for Dense Regulatory Network Recovery
2024-06-05 | GSE32030 | GEO