Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities.
Ontology highlight
ABSTRACT: A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.
Project description:High-resolution imaging has revolutionized the study of single cells in their spatial context. However, summarizing the great diversity of complex cell shapes found in tissues and inferring associations with other single-cell data remains a challenge. Here, we present CAJAL, a general computational framework for the analysis and integration of single-cell morphological data. By building upon metric geometry, CAJAL infers cell morphology latent spaces where distances between points indicate the amount of physical deformation required to change the morphology of one cell into that of another. We show that cell morphology spaces facilitate the integration of single-cell morphological data across technologies and the inference of relations with other data, such as single-cell transcriptomic data. We demonstrate the utility of CAJAL with several morphological datasets of neurons and glia and identify genes associated with neuronal plasticity in C. elegans. Our approach provides an effective strategy for integrating cell morphology data into single-cell omics analyses.
Project description:The brain consists of organized ensembles of cells that exhibit distinct morphologies, cellular connectivity, and dynamic biochemistries that control the executive functions of an organism. However, the relationships between chemical heterogeneity, cell function, and phenotype are not always understood. Recent advancements in matrix-assisted laser desorption/ionization mass spectrometry have enabled the high-throughput, multiplexed chemical analysis of single cells, capable of resolving hundreds of molecules in each mass spectrum. We developed a machine learning workflow to classify single cells according to their mass spectra based on cell groups of interest (GOI), e.g., neurons vs astrocytes. Three data sets from various cell groups were acquired on three different mass spectrometer platforms representing thousands of individual cell spectra that were collected and used to validate the single cell classification workflow. The trained models achieved >80% classification accuracy and were subjected to the recently developed instance-based model interpretation framework, SHapley Additive exPlanations (SHAP), which locally assigns feature importance for each single-cell spectrum. SHAP values were used for both local and global interpretations of our data sets, preserving the chemical heterogeneity uncovered by the single-cell analysis while offering the ability to perform supervised analysis. The top contributing mass features to each of the GOI were ranked and selected using mean absolute SHAP values, highlighting the features that are specific to the defined GOI. Our approach provides insight into discriminating the chemical profiles of the single cells through interpretable machine learning, facilitating downstream analysis and validation.
Project description:A central challenge in biology is to use existing measurements to predict the outcomes of future experiments. For the rapidly evolving influenza virus, variants examined in one study will often have little to no overlap with other studies, making it difficult to discern patterns or unify datasets. We develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We validate this method using hemagglutination inhibition data from seven studies and predict 2,000,000 new values ± uncertainties. Our analysis quantifies the transferability between vaccination and infection studies in humans and ferrets, shows that serum potency is negatively correlated with breadth, and provides a tool for pandemic preparedness. In essence, this approach enables a shift in perspective when analyzing data from "what you see is what you get" into "what anyone sees is what everyone gets."
Project description:The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 106 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.
Project description:A primary challenge of high-throughput imaging flow cytometry (IFC) is to analyze the vast amount of imaging data, especially in applications where ground truth labels are unavailable or hard to obtain. We present an unsupervised deep embedding algorithm, the Deep Convolutional Autoencoder-based Clustering (DCAEC) model, to cluster label-free IFC images without any prior knowledge of input labels. The DCAEC model first encodes the input images into the latent representations and then clusters based on the latent representations. Using the DCAEC model, we achieve a balanced accuracy of 91.9% for human white blood cell (WBC) clustering and 97.9% for WBC/leukemia clustering using the 3D IFC images and 3D DCAEC model. Above all, although no human recognizable features can separate the clusters of cells with protein localization, we demonstrate the fused DCAEC model can achieve a cluster balanced accuracy of 85.3% from the label-free 2D transmission and 3D side scattering images. To reveal how the neural network recognizes features beyond human ability, we use the gradient-weighted class activation mapping method to discover the cluster-specific visual patterns automatically. Evaluation results show that the automatically identified salient image regions have strong cluster-specific visual patterns for different clusters, which we believe is a stride for the interpretable neural network for cell analysis with high-throughput IFCs.
Project description:Heterogeneous information networks (e.g. bibliographic networks and social media networks) that consist of multiple interconnected objects are ubiquitous. Clustering analysis is an effective method to understand the semantic information and interpretable structure of the heterogeneous information networks, and it has attracted the attention of many researchers in recent years. However, most studies assume that heterogeneous information networks usually follow some simple schemas, such as bi-typed networks or star network schema, and they can only cluster one type of object in the network each time. In this paper, a novel clustering framework is proposed based on sparse tensor factorization for heterogeneous information networks, which can cluster multiple types of objects simultaneously in a single pass without any network schema information. The types of objects and the relations between them in the heterogeneous information networks are modeled as a sparse tensor. The clustering issue is modeled as an optimization problem, which is similar to the well-known Tucker decomposition. Then, an Alternating Least Squares (ALS) algorithm and a feasible initialization method are proposed to solve the optimization problem. Based on the tensor factorization, we simultaneously partition different types of objects into different clusters. The experimental results on both synthetic and real-world datasets have demonstrated that our proposed clustering framework, STFClus, can model heterogeneous information networks efficiently and can outperform state-of-the-art clustering algorithms as a generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic.
Project description:MotivationSingle-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors.ResultsImputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA's interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets.Availability and implementationOur interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA.
Project description:BackgroundDeep learning has emerged as a versatile approach for predicting complex biological phenomena. However, its utility for biological discovery has so far been limited, given that generic deep neural networks provide little insight into the biological mechanisms that underlie a successful prediction. Here we demonstrate deep learning on biological networks, where every node has a molecular equivalent, such as a protein or gene, and every edge has a mechanistic interpretation, such as a regulatory interaction along a signaling pathway.ResultsWith knowledge-primed neural networks (KPNNs), we exploit the ability of deep learning algorithms to assign meaningful weights in multi-layered networks, resulting in a widely applicable approach for interpretable deep learning. We present a learning method that enhances the interpretability of trained KPNNs by stabilizing node weights in the presence of redundancy, enhancing the quantitative interpretability of node weights, and controlling for uneven connectivity in biological networks. We validate KPNNs on simulated data with known ground truth and demonstrate their practical use and utility in five biological applications with single-cell RNA-seq data for cancer and immune cells.ConclusionsWe introduce KPNNs as a method that combines the predictive power of deep learning with the interpretability of biological networks. While demonstrated here on single-cell sequencing data, this method is broadly relevant to other research areas where prior domain knowledge can be represented as networks.
Project description:Experimental methods that capture the individual properties of single cells are revealing the key role of cell-to-cell variability in countless biological processes. These single-cell methods are becoming increasingly important across the life sciences in fields such as immunology, regenerative medicine and cancer biology. In addition to high-dimensional transcriptomic techniques such as single-cell RNA sequencing, there is a need for fast, simple and high-throughput assays to enumerate cell samples based on RNA biomarkers. In this work, we present single-cell nucleic acid profiling in droplets (SNAPD) to analyze sets of transcriptional markers in tens of thousands of single mammalian cells. Individual cells are encapsulated in aqueous droplets on a microfluidic chip and the RNA markers in each cell are amplified. Molecular logic circuits then integrate these amplicons to categorize cells based on the transcriptional markers and produce a detectable fluorescence output. SNAPD is capable of analyzing over 100,000 cells per hour and can be used to quantify distinct cell types within heterogeneous populations, detect rare cells at frequencies down to 0.1% and enrich specific cell types using microfluidic sorting. SNAPD provides a simple, rapid, low cost and scalable approach to study complex phenotypes in heterogeneous cell populations.
Project description:Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable Residual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.