Project description:SUMMARY:Cellular lineage trees can be derived from single-cell RNA sequencing snapshots of differentiating cells. Currently, only datasets with simple topologies are available. To test and further develop tools for lineage tree reconstruction, we need test datasets with known complex topologies. PROSSTT can simulate scRNA-seq datasets for differentiation processes with lineage trees of any desired complexity, noise level, noise model and size. PROSSTT also provides scripts to quantify the quality of predicted lineage trees. AVAILABILITY AND IMPLEMENTATION:https://github.com/soedinglab/prosstt. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Choosing a sequence of observations (often with stochastic outcomes) which maximizes the information gain from a system of interacting variables is essential for a wide range of problems in science and technology, such as clinical diagnostic problems. Here, we use a probabilistic model of diseases and signs/symptoms to simulate the effects of medical decisions on the quality of diagnosis by maximizing an appropriate objective function of the medical observations. The study provides a systematic way of proposing new medical tests, considering the significance of diseases and cost of the suggested observations. The efficacy of methods and role of the objective functions as well as initial signs/symptoms are examined by numerical simulations of the diagnostic process by exhaustive or Monte Carlo sampling algorithms.
Project description:As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.
Project description:MotivationA common problem in understanding a biochemical system is to infer its correct structure or topology. This topology consists of all relevant state variables-usually molecules and their interactions. Here we present a method called topological augmentation to infer this structure in a statistically rigorous and systematic way from prior knowledge and experimental data.ResultsTopological augmentation starts from a simple model that is unable to explain the experimental data and augments its topology by adding new terms that capture the experimental behavior. This process is guided by representing the uncertainty in the model topology through stochastic differential equations whose trajectories contain information about missing model parts. We first apply this semiautomatic procedure to a pharmacokinetic model. This example illustrates that a global sampling of the parameter space is critical for inferring a correct model structure. We also use our method to improve our understanding of glutamine transport in yeast. This analysis shows that transport dynamics is determined by glutamine permeases with two different kinds of kinetics. Topological augmentation can not only be applied to biochemical systems, but also to any system that can be described by ordinary differential equations.Availability and implementationMatlab code and examples are available at: http://www.csb.ethz.ch/tools/index
Project description:We introduce a novel method to represent time independent, scalar data sets as complex networks. We apply our method to investigate gene expression in the response to osmotic stress of Arabidopsis thaliana. In the proposed network representation, the most important genes for the plant response turn out to be the nodes with highest centrality in appropriately reconstructed networks. We also performed a target experiment, in which the predicted genes were artificially induced one by one, and the growth of the corresponding phenotypes compared to that of the wild-type. The joint application of the network reconstruction method and of the in vivo experiments allowed identifying 15 previously unknown key genes, and provided models of their mutual relationships. This novel representation extends the use of graph theory to data sets hitherto considered outside of the realm of its application, vastly simplifying the characterization of their underlying structure.
Project description:We present Epiclomal, a probabilistic clustering method arising from a hierarchical mixture model to simultaneously cluster sparse single-cell DNA methylation data and impute missing values. Using synthetic and published single-cell CpG datasets, we show that Epiclomal outperforms non-probabilistic methods and can handle the inherent missing data characteristic that dominates single-cell CpG genome sequences. Using newly generated single-cell 5mCpG sequencing data, we show that Epiclomal discovers sub-clonal methylation patterns in aneuploid tumour genomes, thus defining epiclones that can match or transcend copy number-determined clonal lineages and opening up an important form of clonal analysis in cancer. Epiclomal is written in R and Python and is available at https://github.com/shahcompbio/Epiclomal.
Project description:Sleep pressure and sleep depth are key regulators of wake and sleep. Current methods of measuring these parameters in Drosophila melanogaster have low temporal resolution and/or require disrupting sleep. Here we report analysis tools for high-resolution, noninvasive measurement of sleep pressure and depth from movement data. Probability of initiating activity, P(Wake), measures sleep depth while probability of ceasing activity, P(Doze), measures sleep pressure. In vivo and computational analyses show that P(Wake) and P(Doze) are largely independent and control the amount of total sleep. We also develop a Hidden Markov Model that allows visualization of distinct sleep/wake substates. These hidden states have a predictable relationship with P(Doze) and P(Wake), suggesting that the methods capture the same behaviors. Importantly, we demonstrate that both the Doze/Wake probabilities and the sleep/wake substates are tied to specific biological processes. These metrics provide greater mechanistic insight into behavior than measuring the amount of sleep alone.
Project description:The paired measurement of RNA and surface proteins in single cells with cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) is a promising approach to connect transcriptional variation with cell phenotypes and functions. However, combining these paired views into a unified representation of cell state is made challenging by the unique technical characteristics of each measurement. Here we present Total Variational Inference (totalVI; https://scvi-tools.org ), a framework for end-to-end joint analysis of CITE-seq data that probabilistically represents the data as a composite of biological and technical factors, including protein background and batch effects. To evaluate totalVI's performance, we profiled immune cells from murine spleen and lymph nodes with CITE-seq, measuring over 100 surface proteins. We demonstrate that totalVI provides a cohesive solution for common analysis tasks such as dimensionality reduction, the integration of datasets with different measured proteins, estimation of correlations between molecules and differential expression testing.
Project description:Single-cell sequencing technology enables the simultaneous capture of multiomic data from multiple cells. The captured data can be represented by tensors, i.e. the higher-rank matrices. However, the existing analysis tools often take the data as a collection of two-order matrices, renouncing the correspondences among the features. Consequently, we propose a probabilistic tensor decomposition framework, SCOIT, to extract embeddings from single-cell multiomic data. SCOIT incorporates various distributions, including Gaussian, Poisson, and negative binomial distributions, to deal with sparse, noisy, and heterogeneous single-cell data. Our framework can decompose a multiomic tensor into a cell embedding matrix, a gene embedding matrix, and an omic embedding matrix, allowing for various downstream analyses. We applied SCOIT to eight single-cell multiomic datasets from different sequencing protocols. With cell embeddings, SCOIT achieves superior performance for cell clustering compared to nine state-of-the-art tools under various metrics, demonstrating its ability to dissect cellular heterogeneity. With the gene embeddings, SCOIT enables cross-omics gene expression analysis and integrative gene regulatory network study. Furthermore, the embeddings allow cross-omics imputation simultaneously, outperforming current imputation methods with the Pearson correlation coefficient increased by 3.38-39.26%; moreover, SCOIT accommodates the scenario that subsets of the cells are with merely one omic profile available.
Project description:Despite the recent progress in sequencing technologies, genome-wide association studies (GWAS) remain limited by a statistical-power issue: many polymorphisms contribute little to common trait variation and therefore escape detection. The small contribution sometimes corresponds to incomplete penetrance, which may result from probabilistic effects on molecular regulations. In such cases, genetic mapping may benefit from the wealth of data produced by single-cell technologies. We present here the development of a novel genetic mapping method that allows to scan genomes for single-cell Probabilistic Trait Loci that modify the statistical properties of cellular-level quantitative traits. Phenotypic values are acquired on thousands of individual cells, and genetic association is obtained from a multivariate analysis of a matrix of Kantorovich distances. No prior assumption is required on the mode of action of the genetic loci involved and, by exploiting all single-cell values, the method can reveal non-deterministic effects. Using both simulations and yeast experimental datasets, we show that it can detect linkages that are missed by classical genetic mapping. A probabilistic effect of a single SNP on cell shape was detected and validated. The method also detected a novel locus associated with elevated gene expression noise of the yeast galactose regulon. Our results illustrate how single-cell technologies can be exploited to improve the genetic dissection of certain common traits. The method is available as an open source R package called ptlmapper.