Project description:A nucleotide sequence 35 base pairs long can take 1,180,591,620,717,411,303,424 possible values. An example of systems biology datasets, protein binding microarrays, contain activity data from about 40,000 such sequences. The discrepancy between the number of possible configurations and the available activities is enormous. Thus, albeit that systems biology datasets are large in absolute terms, they oftentimes require methods developed for rare events due to the combinatorial increase in the number of possible configurations of biological systems. A plethora of techniques for handling large datasets, such as Empirical Bayes, or rare events, such as importance sampling, have been developed in the literature, but these cannot always be simultaneously utilized. Here we introduce a principled approach to Empirical Bayes based on importance sampling, information theory, and theoretical physics in the general context of sequence phenotype model induction. We present the analytical calculations that underlie our approach. We demonstrate the computational efficiency of the approach on concrete examples, and demonstrate its efficacy by applying the theory to publicly available protein binding microarray transcription factor datasets and to data on synthetic cAMP-regulated enhancer sequences. As further demonstrations, we find transcription factor binding motifs, predict the activity of new sequences and extract the locations of transcription factor binding sites. In summary, we present a novel method that is efficient (requiring minimal computational time and reasonable amounts of memory), has high predictive power that is comparable with that of models with hundreds of parameters, and has a limited number of optimized parameters, proportional to the sequence length.
Project description:Peroxiredoxins, a highly conserved family of thiol oxidoreductases, play a key role in oxidant detoxification by partnering with the thioredoxin system to protect against oxidative stress. In addition to their peroxidase activity, certain types of peroxiredoxins possess other biochemical activities, including assistance in preventing protein aggregation upon exposure to high levels of oxidants (molecular chaperone activity), and the transduction of redox signals to downstream proteins (redox switch activity). Mice lacking the peroxiredoxin Prdx1 exhibit an increased incidence of tumor formation, whereas baker's yeast (Saccharomyces cerevisiae) lacking the orthologous peroxiredoxin Tsa1 exhibit a mutator phenotype. Collectively, these findings suggest a potential link between peroxiredoxins, control of genomic stability, and cancer etiology. Here, we examine the potential mechanisms through which Tsa1 lowers mutation rates, taking into account its diverse biochemical roles in oxidant defense, protein homeostasis, and redox signaling as well as its interplay with thioredoxin and thioredoxin substrates, including ribonucleotide reductase. More work is needed to clarify the nuanced mechanism(s) through which this highly conserved peroxidase influences genome stability, and to determine if this mechanism is similar across a range of species.
Project description:Functional metagenomics enables the study of unexplored bacterial diversity, gene families, and pathways essential to microbial communities. However, discovering biological insights with these data is impeded by the scarcity of quality annotations. Here, we use a co-occurrence-based analysis of predicted microbial protein functions to uncover pathways in genomic and metagenomic biological systems. Our approach, based on phylogenetic profiles, improves the identification of functional relationships, or participation in the same biochemical pathway, between enzymes over a comparable homology-based approach. We optimized the design of our profiles to identify potential pathways using minimal data, clustered functionally related enzyme pairs into multi-enzymatic pathways, and evaluated our predictions against reference pathways in the KEGG database. We then demonstrated a novel extension of this approach to predict inter-bacterial protein interactions amongst members of a marine microbiome. Most significantly, we show our method predicts emergent biochemical pathways between known and unknown functions. Thus, our work establishes a basis for identifying the potential functional capacities of the entire metagenome, capturing previously unknown and abstract functions into discrete putative pathways.
Project description:Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier and supplementary information. Michigan Molecular Interactions (MiMI) assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information. Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the users to retrieve information from many different databases at once, highlighting complementary and contradictory information. To help scientists judge the usefulness of a piece of data, MiMI tracks the provenance of all data. Finally, a simple yet powerful user interface aids users in their queries, and frees them from the onerous task of knowing the data format or learning a query language. MiMI allows scientists to query all data, whether corroborative or contradictory, and specify which sources to utilize. MiMI is part of the National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.
Project description:Integrase (IN) is one of only three enzymes encoded in the genomes of all retroviruses, and is the one least characterized in structural terms. IN catalyzes processing of the ends of a DNA copy of the retroviral genome and its concerted insertion into the chromosome of the host cell. The protein consists of three domains, the central catalytic core domain flanked by the N-terminal and C-terminal domains, the latter being involved in DNA binding. Although the Protein Data Bank contains a number of NMR structures of the N-terminal and C-terminal domains of HIV-1 and HIV-2, simian immunodeficiency virus and avian sarcoma virus IN, as well as X-ray structures of the core domain of HIV-1, avian sarcoma virus and foamy virus IN, plus several models of two-domain constructs, no structure of the complete molecule of retroviral IN has been solved to date. Although no experimental structures of IN complexed with the DNA substrates are at hand, the catalytic mechanism of IN is well understood by analogy with other nucleotidyl transferases, and a variety of models of the oligomeric integration complexes have been proposed. In this review, we present the current state of knowledge resulting from structural studies of IN from several retroviruses. We also attempt to reconcile the differences between the reported structures, and discuss the relationship between the structure and function of this enzyme, which is an important, although so far rather poorly exploited, target for designing drugs against HIV-1 infection.
Project description:The advancement of RNA sequencing (RNA-seq) has provided an unprecedented opportunity to assess both the diversity and quantity of transcript isoforms in an mRNA transcriptome. In this paper, we revisit the computational problem of transcript reconstruction and quantification. Unlike existing methods which focus on how to explain the exons and splice variants detected by the reads with a set of isoforms, we aim at reconstructing transcripts by piecing the reads into individual effective transcript copies. Simultaneously, the quantity of each isoform is explicitly measured by the number of assembled effective copies, instead of estimated solely based on the collective read count. We have developed a novel method named Astroid that solves the problem of effective copy reconstruction on the basis of a flow network. The RNA-seq reads are represented as vertices in the flow network and are connected by weighted edges that evaluate the likelihood of two reads originating from the same effective copy. A maximum likelihood set of transcript copies is then reconstructed by solving a minimum-cost flow problem on the flow network. Simulation studies on the human transcriptome have demonstrated the superior sensitivity and specificity of Astroid in transcript reconstruction as well as improved accuracy in transcript quantification over several existing approaches. The application of Astroid on two real RNA-seq datasets has further demonstrated its accuracy through high correlation between the estimated isoform abundance and the qRT-PCR validations.
Project description:Chromosome 17q12-21 remains the most highly replicated and significant asthma locus. Genotypes in the core region defined by the first genome-wide association study correlate with expression of 2 genes, ORM1-like 3 (ORMDL3) and gasdermin B (GSDMB), making these prime candidate asthma genes, although recent studies have implicated gasdermin A (GSDMA) distal to and post-GPI attachment to proteins 3 (PGAP3) proximal to the core region as independent loci. We review 10 years of studies on the 17q12-21 locus and suggest that genotype-specific risks for asthma at the proximal and distal loci are not specific to early-onset asthma and mediated by PGAP3, ORMDL3, and/or GSDMA expression. We propose that the weak and inconsistent associations of 17q single nucleotide polymorphisms with asthma in African Americans is due to the high frequency of some 17q alleles, the breakdown of linkage disequilibrium on African-derived chromosomes, and possibly different early-life asthma endotypes in these children. Finally, the inconsistent association between asthma and gene expression levels in blood or lung cells from older children and adults suggests that genotype effects may mediate asthma risk or protection during critical developmental windows and/or in response to relevant exposures in early life. Thus studies of young children and ethnically diverse populations are required to fully understand the relationship between genotype and asthma phenotype and the gene regulatory architecture at this locus.
Project description:This study demonstrates the value of legacy literature and historic collections as a source of data on environmental history. Chenopodium vulvaria L. has declined in northern Europe and is of conservation concern in several countries, whereas in other countries outside Europe it has naturalised and is considered an alien weed. In its European range it is considered native in the south, but the northern boundary of its native range is unknown. It is hypothesised that much of its former distribution in northern Europe was the result of repeated introductions from southern Europe and that its decline in northern Europe is the result of habitat change and a reduction in the number of propagules imported to the north. A historical analysis of its ecology and distribution was conducted by mining legacy literature and historical botanical collections. Text analysis of habitat descriptions written on specimens and published in botanical literature covering a period of more than 200 years indicate that the habitat and introduction pathways of C. vulvaria have changed with time. Using the non-European naturalised range in a climate niche model, it is possible to project the range in Europe. By comparing this predicted model with a similar model created from all observations, it is clear that there is a large discrepancy between the realized and predicted distributions. This is discussed together with the social, technological and economic changes that have occurred in northern Europe, with respect to their influence on C. vulvaria.
Project description:The challenge of crystallizing single-pass plasma membrane receptors has remained an obstacle to understanding the structural mechanisms that connect extracellular ligand binding to cytosolic activation. For example, the complex interplay between receptor oligomerization and conformational dynamics has been, historically, only inferred from static structures of isolated receptor domains. A fundamental challenge in the field of membrane receptor biology, then, has been to integrate experimentally observable dynamics of full-length receptors (e.g. diffusion and conformational flexibility) into static structural models of the disparate domains. In certain receptor families, e.g. the ErbB receptors, structures have led somewhat linearly to a putative model of activation. In other families, e.g. the tumor necrosis factor (TNF) receptors, structures have produced divergent hypothetical mechanisms of activation and transduction. Here, we discuss in detail these and other related receptors, with the goal of illuminating the current challenges and opportunities in building comprehensive models of single-pass receptor activation. The deepening understanding of these receptors has recently been accelerated by new experimental and computational tools that offer orthogonal perspectives on both structure and dynamics. As such, this review aims to contextualize those technological developments as we highlight the elegant and complex conformational communication between receptor domains. This article is part of a Special Issue entitled: Interactions between membrane receptors in cellular membranes edited by Kalina Hristova.