Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities.

ABSTRACT: A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.

SUBMITTER: Singh R

PROVIDER: S-EPMC8091541 | biostudies-literature | 2021 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Publications

Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities.

Singh Rohit R Hie Brian L BL Narayan Ashwin A Berger Bonnie B

Genome biology 20210503 1

A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer c ...[more]

PMID: 33941239

Similar Datasets

CAJAL enables analysis and integration of single-cell morphological data using metric geometry.

Project description:High-resolution imaging has revolutionized the study of single cells in their spatial context. However, summarizing the great diversity of complex cell shapes found in tissues and inferring associations with other single-cell data remains a challenge. Here, we present CAJAL, a general computational framework for the analysis and integration of single-cell morphological data. By building upon metric geometry, CAJAL infers cell morphology latent spaces where distances between points indicate the amount of physical deformation required to change the morphology of one cell into that of another. We show that cell morphology spaces facilitate the integration of single-cell morphological data across technologies and the inference of relations with other data, such as single-cell transcriptomic data. We demonstrate the utility of CAJAL with several morphological datasets of neurons and glia and identify genes associated with neuronal plasticity in C. elegans. Our approach provides an effective strategy for integrating cell morphology data into single-cell omics analyses.

| S-EPMC10282047 | biostudies-literature

Single-Cell Classification Using Mass Spectrometry through Interpretable Machine Learning.

Project description:The brain consists of organized ensembles of cells that exhibit distinct morphologies, cellular connectivity, and dynamic biochemistries that control the executive functions of an organism. However, the relationships between chemical heterogeneity, cell function, and phenotype are not always understood. Recent advancements in matrix-assisted laser desorption/ionization mass spectrometry have enabled the high-throughput, multiplexed chemical analysis of single cells, capable of resolving hundreds of molecules in each mass spectrum. We developed a machine learning workflow to classify single cells according to their mass spectra based on cell groups of interest (GOI), e.g., neurons vs astrocytes. Three data sets from various cell groups were acquired on three different mass spectrometer platforms representing thousands of individual cell spectra that were collected and used to validate the single cell classification workflow. The trained models achieved >80% classification accuracy and were subjected to the recently developed instance-based model interpretation framework, SHapley Additive exPlanations (SHAP), which locally assigns feature importance for each single-cell spectrum. SHAP values were used for both local and global interpretations of our data sets, preserving the chemical heterogeneity uncovered by the single-cell analysis while offering the ability to perform supervised analysis. The top contributing mass features to each of the GOI were ranked and selected using mean absolute SHAP values, highlighting the features that are specific to the defined GOI. Our approach provides insight into discriminating the chemical profiles of the single cells through interpretable machine learning, facilitating downstream analysis and validation.

| S-EPMC7374983 | biostudies-literature

Batch alignment of single-cell transcriptomics data using deep metric learning.

Project description:scRNA-seq has uncovered previously unappreciated levels of heterogeneity. With the increasing scale of scRNA-seq studies, the major challenge is correcting batch effect and accurately detecting the number of cell types, which is inevitable in human studies. The majority of scRNA-seq algorithms have been specifically designed to remove batch effect firstly and then conduct clustering, which may miss some rare cell types. Here we develop scDML, a deep metric learning model to remove batch effect in scRNA-seq data, guided by the initial clusters and the nearest neighbor information intra and inter batches. Comprehensive evaluations spanning different species and tissues demonstrated that scDML can remove batch effect, improve clustering performance, accurately recover true cell types and consistently outperform popular methods such as Seurat 3, scVI, Scanorama, BBKNN, Harmony et al. Most importantly, scDML preserves subtle cell types in raw data and enables discovery of new cell subtypes that are hard to extract by analyzing each batch individually. We also show that scDML is scalable to large datasets with lower peak memory usage, and we believe that scDML offers a valuable tool to study complex cellular heterogeneity.

| S-EPMC9944958 | biostudies-literature

Using interpretable machine learning to extend heterogeneous antibody-virus datasets.

Project description:A central challenge in biology is to use existing measurements to predict the outcomes of future experiments. For the rapidly evolving influenza virus, variants examined in one study will often have little to no overlap with other studies, making it difficult to discern patterns or unify datasets. We develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We validate this method using hemagglutination inhibition data from seven studies and predict 2,000,000 new values ± uncertainties. Our analysis quantifies the transferability between vaccination and infection studies in humans and ferrets, shows that serum potency is negatively correlated with breadth, and provides a tool for pandemic preparedness. In essence, this approach enables a shift in perspective when analyzing data from "what you see is what you get" into "what anyone sees is what everyone gets."

| S-EPMC10475791 | biostudies-literature

Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data.

Project description:The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, large-scale integrative analysis of scRNA-seq data remains a challenge largely due to unwanted batch effects and the limited transferabilty, interpretability, and scalability of the existing computational methods. We present single-cell Embedded Topic Model (scETM). Our key contribution is the utilization of a transferable neural-network-based encoder while having an interpretable linear decoder via a matrix tri-factorization. In particular, scETM simultaneously learns an encoder network to infer cell type mixture and a set of highly interpretable gene embeddings, topic embeddings, and batch-effect linear intercepts from multiple scRNA-seq datasets. scETM is scalable to over 106 cells and confers remarkable cross-tissue and cross-species zero-shot transfer-learning performance. Using gene set enrichment analysis, we find that scETM-learned topics are enriched in biologically meaningful and disease-related pathways. Lastly, scETM enables the incorporation of known gene sets into the gene embeddings, thereby directly learning the associations between pathways and topics via the topic embeddings.

| S-EPMC8421403 | biostudies-literature

scMoMtF: An interpretable multitask learning framework for single-cell multi-omics data analysis.

Project description:With the rapidly development of biotechnology, it is now possible to obtain single-cell multi-omics data in the same cell. However, how to integrate and analyze these single-cell multi-omics data remains a great challenge. Herein, we introduce an interpretable multitask framework (scMoMtF) for comprehensively analyzing single-cell multi-omics data. The scMoMtF can simultaneously solve multiple key tasks of single-cell multi-omics data including dimension reduction, cell classification and data simulation. The experimental results shows that scMoMtF outperforms current state-of-the-art algorithms on these tasks. In addition, scMoMtF has interpretability which allowing researchers to gain a reliable understanding of potential biological features and mechanisms in single-cell multi-omics data.

| S-EPMC11654984 | biostudies-literature

Machine learning enables interpretable discovery of innovative polymers for gas separation membranes.

Project description:Polymer membranes perform innumerable separations with far-reaching environmental implications. Despite decades of research, design of new membrane materials remains a largely Edisonian process. To address this shortcoming, we demonstrate a generalizable, accurate machine learning (ML) implementation for the discovery of innovative polymers with ideal performance. Specifically, multitask ML models are trained on experimental data to link polymer chemistry to gas permeabilities of He, H2, O2, N2, CO2, and CH4. We interpret the ML models and extract valuable insights into the contributions of different chemical moieties to permeability and selectivity. We then screen over 9 million hypothetical polymers and identify thousands that lie well above current performance upper bounds, including hundreds of never-before-seen ultrapermeable polymer membranes with O2 and CO2 permeability greater than 104 and 105 Barrers, respectively. High-fidelity molecular dynamics simulations confirm the ML-predicted gas permeabilities of the promising candidates, which suggests that many can be translated to reality.

| S-EPMC9299556 | biostudies-literature

Interpretable unsupervised learning enables accurate clustering with high-throughput imaging flow cytometry.

Project description:A primary challenge of high-throughput imaging flow cytometry (IFC) is to analyze the vast amount of imaging data, especially in applications where ground truth labels are unavailable or hard to obtain. We present an unsupervised deep embedding algorithm, the Deep Convolutional Autoencoder-based Clustering (DCAEC) model, to cluster label-free IFC images without any prior knowledge of input labels. The DCAEC model first encodes the input images into the latent representations and then clusters based on the latent representations. Using the DCAEC model, we achieve a balanced accuracy of 91.9% for human white blood cell (WBC) clustering and 97.9% for WBC/leukemia clustering using the 3D IFC images and 3D DCAEC model. Above all, although no human recognizable features can separate the clusters of cells with protein localization, we demonstrate the fused DCAEC model can achieve a cluster balanced accuracy of 85.3% from the label-free 2D transmission and 3D side scattering images. To reveal how the neural network recognizes features beyond human ability, we use the gradient-weighted class activation mapping method to discover the cluster-specific visual patterns automatically. Evaluation results show that the automatically identified salient image regions have strong cluster-specific visual patterns for different clusters, which we believe is a stride for the interpretable neural network for cell analysis with high-throughput IFCs.

| S-EPMC10667244 | biostudies-literature

Generic, network schema agnostic sparse tensor factorization for single-pass clustering of heterogeneous information networks.

Project description:Heterogeneous information networks (e.g. bibliographic networks and social media networks) that consist of multiple interconnected objects are ubiquitous. Clustering analysis is an effective method to understand the semantic information and interpretable structure of the heterogeneous information networks, and it has attracted the attention of many researchers in recent years. However, most studies assume that heterogeneous information networks usually follow some simple schemas, such as bi-typed networks or star network schema, and they can only cluster one type of object in the network each time. In this paper, a novel clustering framework is proposed based on sparse tensor factorization for heterogeneous information networks, which can cluster multiple types of objects simultaneously in a single pass without any network schema information. The types of objects and the relations between them in the heterogeneous information networks are modeled as a sparse tensor. The clustering issue is modeled as an optimization problem, which is similar to the well-known Tucker decomposition. Then, an Alternating Least Squares (ALS) algorithm and a feasible initialization method are proposed to solve the optimization problem. Based on the tensor factorization, we simultaneously partition different types of objects into different clusters. The experimental results on both synthetic and real-world datasets have demonstrated that our proposed clustering framework, STFClus, can model heterogeneous information networks efficiently and can outperform state-of-the-art clustering algorithms as a generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic.

| S-EPMC5330508 | biostudies-literature

Single-cell nucleic acid profiling in droplets (SNAPD) enables high-throughput analysis of heterogeneous cell populations.

Project description:Experimental methods that capture the individual properties of single cells are revealing the key role of cell-to-cell variability in countless biological processes. These single-cell methods are becoming increasingly important across the life sciences in fields such as immunology, regenerative medicine and cancer biology. In addition to high-dimensional transcriptomic techniques such as single-cell RNA sequencing, there is a need for fast, simple and high-throughput assays to enumerate cell samples based on RNA biomarkers. In this work, we present single-cell nucleic acid profiling in droplets (SNAPD) to analyze sets of transcriptional markers in tens of thousands of single mammalian cells. Individual cells are encapsulated in aqueous droplets on a microfluidic chip and the RNA markers in each cell are amplified. Molecular logic circuits then integrate these amplicons to categorize cells based on the transcriptional markers and produce a detectable fluorescence output. SNAPD is capable of analyzing over 100,000 cells per hour and can be used to quantify distinct cell types within heterogeneous populations, detect rare cells at frequencies down to 0.1% and enrich specific cell types using microfluidic sorting. SNAPD provides a simple, rapid, low cost and scalable approach to study complex phenotypes in heterogeneous cell populations.

| S-EPMC8501953 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data