Project description:Single cell RNA sequencing (scRNA-seq) can be used to characterize variation in gene expression levels at high resolution. However, the sources of experimental noise in scRNA-seq are not yet well understood. We investigated the technical variation associated with sample processing using the single cell Fluidigm C1 platform. To do so, we processed three C1 replicates from three human induced pluripotent stem cell (iPSC) lines. We added unique molecular identifiers (UMIs) to all samples, to account for amplification bias. We found that the major source of variation in the gene expression data was driven by genotype, but we also observed substantial variation between the technical replicates. We observed that the conversion of reads to molecules using the UMIs was impacted by both biological and technical variation, indicating that UMI counts are not an unbiased estimator of gene expression levels. Based on our results, we suggest a framework for effective scRNA-seq studies.
Project description:Open chromatin is implicated in regulatory processes, and thus variation in chromatin structure may contribute to variation in gene expression and other molecular phenotypes. In this work, we performed a targeted deep sequencing to identify somatic mutations and genetic polymorphisms underlying accessible chromatin in the genomes of 72 monozygotic twins. Open chromatin sequencing based on FAIRE assay for 36 pairs of monozygotic twins
Project description:Plasmodium species, the causative agent of malaria, have a complex life cycle involving two hosts. The sporozoite life stage is characterized by an extended phase in the mosquito salivary glands followed by free movement and rapid invasion of hepatocytes in the human host. This transmission stage has been the subject of many transcriptomics and proteomics studies and is also targeted by the most advanced malaria vaccine. We applied Bayesian data integration to determine which proteins are not only present in sporozoites but are also specific to that stage. Transcriptomic and proteomic Plasmodium data sets from 26 studies were weighted for how representative they are for sporozoites, based on a carefully assembled gold standard for Plasmodium falciparum (Pf) proteins known to be present or absent during the sporozoite life stage. Of 5418 Pf genes for which expression data were available at the RNA level or at the protein level, 975 were identified as enriched in sporozoites and 90 specific to them. We show that Pf sporozoites are enriched for proteins involved in type II fatty acid synthesis in the apicoplast and GPI anchor synthesis, but otherwise appear metabolically relatively inactive in the salivary glands of mosquitos. Newly annotated hypothetical sporozoite-specific and sporozoite-enriched proteins highlight sporozoite-specific functions. They include PF3D7_0104100 that we identified to be homologous to the prominin family, which in human has been related to a quiescent state of cancer cells. We document high levels of genetic variability for sporozoite proteins, specifically for sporozoite-specific proteins that elicit antibodies in the human host. Nevertheless, we can identify nine relatively well-conserved sporozoite proteins that elicit antibodies and that together can serve as markers for previous exposure. Our understanding of sporozoite biology benefits from identifying key pathways that are enriched during this life stage. This work can guide studies of molecular mechanisms underlying sporozoite biology and potential well-conserved targets for marker and drug development.
Project description:We carried out a comparative genomic analysis of 48 avian species to identify avian-specific highly conserved elements (ASHCEs). We performed genome-wide chromatin immunoprecipitation sequencing (ChIP-seq) for three enhancer-associated histone modifications (H3K4me1, H3K27ac, H3K27me3), to investigate dynamic regulatory roles of ASHCEs in chicken development. We found that all three enhancer-associated histone marks are enriched in ASHCEs compared to the whole genome background.
Project description:BackgroundThe highly dimensional data produced by functional genomic (FG) studies makes it difficult to visualize relationships between gene products and experimental conditions (i.e., assays). Although dimensionality reduction methods such as principal component analysis (PCA) have been very useful, their application to identify assay-specific signatures has been limited by the lack of appropriate methodologies. This article proposes a new and powerful PCA-based method for the identification of assay-specific gene signatures in FG studies.ResultsThe proposed method (PM) is unique for several reasons. First, it is the only one, to our knowledge, that uses gene contribution, a product of the loading and expression level, to obtain assay signatures. The PM develops and exploits two types of assay-specific contribution plots, which are new to the application of PCA in the FG area. The first type plots the assay-specific gene contribution against the given order of the genes and reveals variations in distribution between assay-specific gene signatures as well as outliers within assay groups indicating the degree of importance of the most dominant genes. The second type plots the contribution of each gene in ascending or descending order against a constantly increasing index. This type of plots reveals assay-specific gene signatures defined by the inflection points in the curve. In addition, sharp regions within the signature define the genes that contribute the most to the signature. We proposed and used the curvature as an appropriate metric to characterize these sharp regions, thus identifying the subset of genes contributing the most to the signature. Finally, the PM uses the full dataset to determine the final gene signature, thus eliminating the chance of gene exclusion by poor screening in earlier steps. The strengths of the PM are demonstrated using a simulation study, and two studies of real DNA microarray data--a study of classification of human tissue samples and a study of E. coli cultures with different medium formulations.ConclusionWe have developed a PCA-based method that effectively identifies assay-specific signatures in ranked groups of genes from the full data set in a more efficient and simplistic procedure than current approaches. Although this work demonstrates the ability of the PM to identify assay-specific signatures in DNA microarray experiments, this approach could be useful in areas such as proteomics and metabolomics.
Project description:Open chromatin is implicated in regulatory processes, and thus variation in chromatin structure may contribute to variation in gene expression and other molecular phenotypes. In this work, we performed a targeted deep sequencing to identify somatic mutations and genetic polymorphisms underlying accessible chromatin in the genomes of 72 monozygotic twins.
Project description:The Global Pandemic Lineage (GPL) of the amphibian pathogen Batrachochytrium dendrobatidis (Bd) has been described as a main driver of amphibian extinctions on nearly every continent. Near complete genome of three Bd-GPL strains have enabled studies of the pathogen but the genomic features that set Bd-GPL apart from other Bd lineages is not well understood due to a lack of high-quality genome assemblies and annotations from other lineages. We used long-read DNA sequencing to assemble high-quality genomes of three Bd-BRAZIL isolates and one non-pathogen outgroup species Polyrhizophydium stewartii (Ps) strain JEL0888, and compared these to genomes of previously sequenced Bd-GPL strains. The Bd-BRAZIL assemblies range in size between 22.0 and 26.1 Mb and encode 8495-8620 protein-coding genes for each strain. Our pan-genome analysis provided insight into shared and lineage-specific gene content. The core genome of Bd consists of 6278 conserved gene families, with 202 Bd-BRAZIL and 172 Bd-GPL specific gene families. We discovered gene copy number variation in pathogenicity gene families between Bd-BRAZIL and Bd-GPL strains though none were consistently expanded in Bd-GPL or Bd-BRAZIL strains. Comparison within the Batrachochytrium genus and two closely related non-pathogenic saprophytic chytrids identified variation in sequence and protein domain counts. We further test these new Bd-BRAZIL genomes to assess their utility as reference genomes for transcriptome alignment and analysis. Our analysis examines the genomic variation between strains in Bd-BRAZIL and Bd-GPL and offers insights into the application of these genomes as reference genomes for future studies.
Project description:Comparative genomics studies in primates are extremely restricted due to our limited access to samples from non-human apes. In order to gain better insight into the genetic processes that underlie variation in complex phenotypes in primates, we must have access to faithful model systems for a wide range of cell types. To facilitate this, we have generated a panel of 7 fully characterized chimpanzee induced pluripotent stem cell (iPSC) lines derived from healthy donors. To begin demonstrating the utility of comparative iPSC panels, we collected RNA-sequencing and DNA methylation data from the chimpanzee iPSCs and the corresponding fibroblast lines, as well as from 7 human iPSCs and their source lines, which encompass multiple populations and cell types. We observe much less within-species variation in iPSCs than in somatic cells, indicating that the reprogramming process erases many inter-individual differences. The low within-species regulatory variation in iPSCs allowed us to identify many novel inter-species regulatory differences of small magnitude. We used ChIP-seq to characterize the genome-wide distribution of two types of histone modifications (H3K27me3 and H3K27ac) in three of our chimpanzee iPSCs and compared them to histone modification data from three human iPSC lines from the Roadmap Epigenomics project:
Project description:The mechanisms by which DNA alleles contribute to disease risk, drug response, and other human phenotypes are highly context-specific, varying across cell types and under different conditions. Human induced pluripotent stem cells (hiPSCs) are uniquely suited to study these context-dependent effects, but to do so requires cell lines from hundreds or thousands of individuals. Village cultures, where multiple hiPSC lines are cultured and differentiated in a single dish, provide an elegant solution for scaling hiPSC experiments to the necessary sample sizes required for population-scale studies. Here, we show the utility of village models, demonstrating how cells can be assigned back to a donor line using single-cell sequencing and addressing whether line-specific signalling alters the transcriptional profiles of companion lines in a village. We generated single-cell RNA sequence data from hiPSC lines cultured independently (uni-culture) and in villages at three independent sites. Using a mixed linear model framework, we estimate that the proportion of transcriptional variation across cells is predominantly due to donor effects, with minimal evidence of variation due to culturing in a village system. We demonstrate that the genetic, epigenetic or hiPSC line-specific effects explain a large percentage of gene expression variation for many genes, not the village status. This is reiterated by replication of previously identified genetic effects. Finally, we demonstrate consistency in the landscape of cell states between uni- and village-culture systems. We demonstrate that village methods can effectively detect hiPSC line-specific effects, including sensitive dynamics of cell states.