Project description:The systematic functional analysis of combinatorial genetics has been limited by the throughput that can be achieved and the order of complexity that can be studied. To enable massively parallel characterization of genetic combinations in human cells, we developed a technology for rapid, scalable assembly of high-order barcoded combinatorial genetic libraries that can be quantified with high-throughput sequencing. We applied this technology, combinatorial genetics en masse (CombiGEM), to create high-coverage libraries of 1,521 two-wise and 51,770 three-wise barcoded combinations of 39 human microRNA (miRNA) precursors. We identified miRNA combinations that synergistically sensitize drug-resistant cancer cells to chemotherapy and/or inhibit cancer cell proliferation, providing insights into complex miRNA networks. More broadly, our method will enable high-throughput profiling of multifactorial genetic combinations that regulate phenotypes of relevance to biomedicine, biotechnology and basic science.
Project description:MotivationCancer heterogeneity is observed at multiple biological levels. To improve our understanding of these differences and their relevance in medicine, approaches to link organ- and tissue-level information from diagnostic images and cellular-level information from genomics are needed. However, these 'radiogenomic' studies often use linear or shallow models, depend on feature selection, or consider one gene at a time to map images to genes. Moreover, no study has systematically attempted to understand the molecular basis of imaging traits based on the interpretation of what the neural network has learned. These studies are thus limited in their ability to understand the transcriptomic drivers of imaging traits, which could provide additional context for determining clinical outcomes.ResultsWe present a neural network-based approach that takes high-dimensional gene expression data as input and performs non-linear mapping to an imaging trait. To interpret the models, we propose gene masking and gene saliency to extract learned relationships from radiogenomic neural networks. In glioblastoma patients, our models outperformed comparable classifiers (>0.10 AUC) and our interpretation methods were validated using a similar model to identify known relationships between genes and molecular subtypes. We found that tumor imaging traits had specific transcription patterns, e.g. edema and genes related to cellular invasion, and 10 radiogenomic traits were significantly predictive of survival. We demonstrate that neural networks can model transcriptomic heterogeneity to reflect differences in imaging and can be used to derive radiogenomic traits with clinical value.Availability and implementationhttps://github.com/novasmedley/deepRadiogenomics.Contactwhsu@mednet.ucla.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:MOTIVATION: Although several methods exist to relate high-dimensional gene expression data to various clinical phenotypes, finding combinations of features in such input remains a challenge, particularly when fitting complex statistical models such as those used for survival studies. RESULTS: Our proposed method builds on existing 'regularization path-following' techniques to produce regression models that can extract arbitrarily complex patterns of input features (such as gene combinations) from large-scale data that relate to a known clinical outcome. Through the use of the data's structure and itemset mining techniques, we are able to avoid combinatorial complexity issues typically encountered with such methods, and our algorithm performs in similar orders of duration as single-variable versions. Applied to data from various clinical studies of cancer patient survival time, our method was able to produce a number of promising gene-interaction candidates whose tumour-related roles appear confirmed by literature.
Project description:Despite the need for inducible promoters in strain development efforts, the majority of engineering in Saccharomyces cerevisiae continues to rely on a few constitutively active or inducible promoters. Building on advances that use the modular nature of both transcription factors and promoter regions, we have built a library of hybrid promoters that are regulated by a synthetic transcription factor. The hybrid promoters consist of native S. cerevisiae promoters, in which the operator regions have been replaced with sequences that are recognized by the bacterial LexA DNA binding protein. Correspondingly, the synthetic transcription factor (TF) consists of the DNA binding domain of the LexA protein, fused with the human estrogen binding domain and the viral activator domain, VP16. The resulting system with a bacterial DNA binding domain avoids the transcription of native S. cerevisiae genes, and the hybrid promoters can be induced using estradiol, a compound with no detectable impact on S. cerevisiae physiology. Using combinations of one, two or three operator sequence repeats and a set of native S. cerevisiae promoters, we obtained a series of hybrid promoters that can be induced to different levels, using the same synthetic TF and a given estradiol. This set of promoters, in combination with our synthetic TF, has the potential to regulate numerous genes or pathways simultaneously, to multiple desired levels, in a single strain.
Project description:Mutations in genes that confer a selective advantage to hematopoietic stem cells (HSCs) drive clonal hematopoiesis (CH). While some CH drivers have been identified, the compendium of all genes able to drive CH upon mutations in HSCs remains incomplete. Exploiting signals of positive selection in blood somatic mutations may be an effective way to identify CH driver genes, analogously to cancer. Using the tumor sample in blood/tumor pairs as reference, we identify blood somatic mutations across more than 12,000 donors from two large cancer genomics cohorts. The application of IntOGen, a driver discovery pipeline, to both cohorts, and more than 24,000 targeted sequenced samples yields a list of close to 70 genes with signals of positive selection in CH, available at http://www.intogen.org/ch . This approach recovers known CH genes, and discovers other candidates.
Project description:Motivation:Transcription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models. Results:We present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics. Availability and implementation:Code is available at: https://github.com/kundajelab/dfim. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:With the completion of full genome sequences and advancement in high-throughput technologies, in silico methods have been successfully used to integrate diverse data sources toward unraveling the combinatorial nature of transcriptional regulation. So far, almost all of these studies are restricted to lower eukaryotes such as budding yeast. We describe here a computational search for functional transcription-factor (TF) combinations using phylogenetically conserved sequences and microarray-based expression data. Taking into account both orientational and positional constraints, we investigated the overrepresentation of binding sites in the vicinity of one another and whether these combinations result in more coherent expression profiles. Without any prior biological knowledge, the search led to the discovery of several experimentally established TF associations, as well as some novel ones. In particular, we identified a regulatory module controlling cell cycle-dependent transcription of G2-M genes and expanded its functional generality. We also detected many homotypic combinations, supporting the importance of binding-site density in transcriptional regulation of higher eukaryotes.
Project description:We used combinatorial engineering to investigate the relationships between structure and linkage specificity of the dextransucrase DSR-S from Leuconostoc mesenteroides NRRL B-512F, and to generate variants with altered specificity. Sequence and structural analysis of glycoside-hydrolase family 70 enzymes led to eight amino acids (D306, F353, N404, W440, D460, H463, T464 and S512) being targeted, randomized by saturation mutagenesis and simultaneously recombined. Screening of two libraries totaling 3.6.10(4) clones allowed the isolation of a toolbox comprising 81 variants which synthesize high molecular weight ?-glucans with different proportions of ?(1?3) linkages ranging from 3 to 20 %. Mutant sequence analysis, biochemical characterization and molecular modelling studies revealed the previously unknown role of peptide (460)DYVHT(464) in DSR-S linkage specificity. This peptide sequence together with residue S512 contribute to defining +2 subsite topology, which may be critical for the enzyme regiospecificity.
Project description:Microarrays are commonly used in biology because of their ability to simultaneously measure thousands of genes under different conditions. Due to their structure, typically containing a high amount of variables but far fewer samples, scalable network analysis techniques are often employed. In particular, consensus approaches have been recently used that combine multiple microarray studies in order to find networks that are more robust. The purpose of this paper, however, is to combine multiple microarray studies to automatically identify subnetworks that are distinctive to specific experimental conditions rather than common to them all. To better understand key regulatory mechanisms and how they change under different conditions, we derive unique networks from multiple independent networks built using glasso which goes beyond standard correlations. This involves calculating cluster prediction accuracies to detect the most predictive genes for a specific set of conditions. We differentiate between accuracies calculated using cross-validation within a selected cluster of studies (the intra prediction accuracy) and those calculated on a set of independent studies belonging to different study clusters (inter prediction accuracy). Finally, we compare our method's results to related state-of-the art techniques. We explore how the proposed pipeline performs on both synthetic data and real data (wheat and Fusarium). Our results show that subnetworks can be identified reliably that are specific to subsets of studies and that these networks reflect key mechanisms that are fundamental to the experimental conditions in each of those subsets.
Project description:Homophily is the seemingly ubiquitous tendency for people to connect and interact with other individuals who are similar to them. This is a well-documented principle and is fundamental for how society organizes. Although many social interactions occur in groups, homophily has traditionally been measured using a graph model, which only accounts for pairwise interactions involving two individuals. Here, we develop a framework using hypergraphs to quantify homophily from group interactions. This reveals natural patterns of group homophily that appear with gender in scientific collaboration and political affiliation in legislative bill cosponsorship and also reveals distinctive gender distributions in group photographs, all of which cannot be fully captured by pairwise measures. At the same time, we show that seemingly natural ways to define group homophily are combinatorially impossible. This reveals important pitfalls to avoid when defining and interpreting notions of group homophily, as higher-order homophily patterns are governed by combinatorial constraints that are independent of human behavior but are easily overlooked.