Project description:There are more ways to synthesize a 100 amino acid protein (20^100) than atoms in the universe. Only a miniscule fraction of such a vast sequence space can ever be experimentally or computationally surveyed. Deep neural networks are increasingly being used to navigate high-dimensional sequence spaces. However, these models are extremely complicated and provide little insight into the fundamental genetic architecture of proteins. Here, by experimentally exploring sequence spaces >10^10, we show that the genetic architecture of at least some proteins is remarkably simple, allowing accurate genetic prediction in high-dimensional sequence spaces with fully interpretable biophysical models. These models capture the non-linear relationships between free energies and phenotypes but otherwise consist of additive free energy changes with a small contribution from pairwise energetic couplings. These energetic couplings are sparse and caused by structural contacts and backbone propagations. Our results suggest that artificial intelligence models may be vastly more complicated than the proteins that they are modeling and that protein genetics is actually both simple and intelligible.
Project description:Natural variation in protein expression is common in all organisms and contribute to phenotypic differences among individuals. While variation in gene expression at the transcript level has been extensively investigated, the genetic mechanisms underlying variation in protein expression have lagged considerably behind. Here we investigate genetic architecture of protein expression by profiling a deep mouse brain proteome of two inbred strains, C57BL/6J (B6) and DBA/2J (D2), and their reciprocal F1 hybrids using two-dimensional liquid chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) technology. By comparing protein expression levels in the four mouse strains, we observed 329 statistically significant differentially expressed proteins between the two parental strains and identified four common inheritance patterns, including dominant, additive, over- and under-dominant expression. We further applied the proteogenomic approach to detect variant peptides and define protein allele-specific expression (pASE).
Project description:Chromatin accessibility is an important functional genomics phenotype that influences transcription factor binding and gene expression. Genome-scale technologies allow chromatin accessibility to be mapped with high-resolution, facilitating detailed analyses into the genetic architecture and evolution of chromatin structure within and between species. We performed Formaldehyde-Assisted Isolation of Regulatory Elements sequencing (FAIRE-Seq) to map chromatin accessibility in two parental haploid yeast species, Saccharomyces cerevisiae and Saccharomyces paradoxus and their diploid hybrid. We show that although broad-scale characteristics of the chromatin landscape are well conserved between these species, accessibility is significantly different for 947 regions upstream of genes that are enriched for GO terms such as intracellular transport and protein localization exhibit. We also develop new statistical methods to investigate the genetic architecture of variation in chromatin accessibility between species, and find that cis effects are more common and of greater magnitude than trans effects. Interestingly, we find that cis and trans effects at individual genes are often negatively correlated, suggesting widespread compensatory evolution to stabilize levels of chromatin accessibility. Finally, we demonstrate that the relationship between chromatin accessibility and gene expression levels is complex, and a significant proportion of differences in chromatin accessibility might be functionally benign. There are 20 samples in total. These consist of 10 FAIRE-seq samples, specifically 6 haploid samples, S. cerevisiae strain UWOPS05_217_3 replicates 1 and 2, S. cerevisiae strain DBVPG1373 replicates 1 and 2, and S. paradoxus strain CBS432 replicates 1 and 2. There are also 4 diploid hybrid samples, hybrid between S. cerevisiae strain UWOPS05_217_3 and S. paradoxus strain CBS432 replicates 1 and 2, and the hybrid between S. cerevisiae strain DBVPG1373 and S. paradoxus strain CBS432 replicates 1 and 2. There are also RNA-seq samples for each of these 10 samples.
Project description:This data set comprises population (47 samples) measurements of transcription factor DNA binding (PU.1 and RPB2) and histone modification (H3K27ac, H3K4me1 and H3k4me3) levels for a subset of the 1000 Genomes Project CEPH samples. This data was generated as part of the following study: - Population Variation and Genetic Control of Modular Chromatin Architecture in Humans. Cell. 2015 Aug 27;162(5):1039-50. doi: 10.1016/j.cell.2015.08.001. Epub 2015 Aug 20. An additional set of 111 samples from the 1000 Genomes Project (GBR and TSI populations) were also assayed for three histone modifications (H3K27ac, H3K4me1 and H3k4me3). This data was generated as part of the following study: - Chromatin 3D interactions mediate genetic effects on regulatory networks.
Project description:Genome-wide association studies (GWAS) have identified hundreds of susceptibility loci for chronic and inflammatory disease phenotypes in humans. There is increasing evidence that chronic inflammation is a crucial driver in the pathogenesis of cardiovascular diseases (CVD), which may be genetically determined. To understand the genetic architecture underlying chronic inflammation and CVD we performed a systematic analysis of (1) common risk alleles coming from published GWAS, (2) of protein-protein interaction (PPI) networks informed by (3) gene expression data with a defined molecular target involved in the inflammatory processes promoting CVD, MRP8. (4) through analysis of integrated haplotype scores (iHS) and FST values in HapMap phase 2 data, we investigated whether recent selection pressure acting upon inflammatory genes affected CVD susceptibility loci. Our findings provide significant evidence for a PPI network, which connects inflammatory and cardiovascular susceptibility genes, and establish a genetic framework of inflammatory CVD. 41.59% of PPI genes are associated with immune functions. 28.3% of integrated genes can be linked to both, an inflammatory and cardiovascular disease phenotype. Interestingly, CDKN2B, and CELSR2/PSRC1/MYBPHL/SORT1, unequivocally replicated CVD loci, are integrated within this network as are several SNPs located in transcription factor recognition sequences, i.e. NFKB1, STAT3, which are key factors in inflammation. Finally, we observed a significant enrichment of inflammatory variants within CVD cluster loci that are targets of selection. Overall, 32 genes exhibit traces of selection, 16 of which are part of the PPI, further suggesting that recent selective sweeps may have affected the genomic architecture underlying CVD. 6 samples, no replicates.