Project description:Single-cell data integration can provide a comprehensive molecular view of cells. However, how to integrate heterogeneous single-cell multi-omics as well as spatially resolved transcriptomic data remains a major challenge. Here we introduce uniPort, a unified single-cell data integration framework that combines a coupled variational autoencoder (coupled-VAE) and minibatch unbalanced optimal transport (Minibatch-UOT). It leverages both highly variable common and dataset-specific genes for integration to handle the heterogeneity across datasets, and it is scalable to large-scale datasets. uniPort jointly embeds heterogeneous single-cell multi-omics datasets into a shared latent space. It can further construct a reference atlas for gene imputation across datasets. Meanwhile, uniPort provides a flexible label transfer framework to deconvolute heterogeneous spatial transcriptomic data using an optimal transport plan, instead of embedding latent space. We demonstrate the capability of uniPort by applying it to integrate a variety of datasets, including single-cell transcriptomics, chromatin accessibility, and spatially resolved transcriptomic data.
Project description:Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biological variability. In this work we present STACAS, a batch correction method for scRNA-seq that leverages prior knowledge on cell types to preserve biological variability upon integration. Through an open-source benchmark, we show that semi-supervised STACAS outperforms state-of-the-art unsupervised methods, as well as supervised methods such as scANVI and scGen. STACAS scales well to large datasets and is robust to incomplete and imprecise input cell type labels, which are commonly encountered in real-life integration tasks. We argue that the incorporation of prior cell type information should be a common practice in single-cell data integration, and we provide a flexible framework for semi-supervised batch effect correction.
Project description:Polyploidization drives regulatory and phenotypic innovation. How the merger of different genomes contributes to polyploid development is a fundamental issue in evolutionary developmental biology and breeding research. Clarifying this issue is challenging because of genome complexity and the difficulty in tracking stochastic subgenome divergence during development. Recent single-cell sequencing techniques enabled probing subgenome divergent regulation in the context of cellular differentiation. However, analyzing single-cell data suffers from high error rates due to high-dimensionality, noise, and sparsity, and the errors stack up in polyploid analysis due to the increased dimensionality of comparisons between subgenomes of each cell, hindering deeper mechanistic understandings. Here, we developed a quantitative computational framework, pseudo-genome divergence quantification (pgDQ), for quantifying and tracking subgenome divergence directly at the cellular level. Further comparing with cellular differentiation trajectories derived from scRNA-seq data allowed for an examination of the relationship between subgenome divergence and the progression of development. pgDQ produces robust results and is insensitive to data dropout and noise, avoiding high error rates due to multiple comparisons of genes, cells, and subgenomes. A statistical diagonostic approach is proposed to identify genes that are central to subgenome divergence during development, which facilitates the integration of different data modalities, enabling the identification of factors and pathways that mediate subgenome-divergent activity during development. Case studies demonstrated that applying pgDQ to single cell and bulk tissue transcriptome data promotes a systematic and deeper understanding of how dynamic subgenome divergence contributes to developmental trajectories in polyploid evolution.
Project description:Single-cell transcriptomics datasets from the same anatomical sites generated by different research labs are becoming increasingly common. However, fast and computationally inexpensive tools for standardization of cell-type annotation and data integration are still needed in order to increase research inclusivity. To standardize cell-type annotation and integrate single-cell transcriptomics datasets, we have built a fast model-free integration method, named MASI (Marker-Assisted Standardization and Integration). MASI first identifies putative cell-type markers from reference data through an ensemble approach. Then, it converts gene expression matrix to cell-type score matrix with the identified putative cell-type markers for the purpose of cell-type annotation and data integration. Because of integration through cell-type markers instead of model inference, MASI can annotate approximately one million cells on a personal laptop, which provides a cheap computational alternative for the single-cell community. We benchmark MASI with other well-established methods and demonstrate that MASI outperforms other methods based on speed. Its performance for both tasks of data integration and cell-type annotation are comparable or even superior to these existing methods. To harness knowledge from single-cell atlases, we demonstrate three case studies that cover integration across biological conditions, surveyed participants, and research groups, respectively.
Project description:MotivationSingle-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study's conclusions, and therefore computational strategies for the identification of doublets are needed.ResultsWith scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds.Availability and implementationscds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds).Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:As the number of single-cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA-seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state-of-the-art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi-tools.
Project description:Integrating single cell omics and single cell imaging allows for a more effective characterisation of the underlying mechanisms that drive a phenotype at the tissue level, creating a comprehensive profile at the cellular level. Although the use of imaging data is well established in biomedical research, its primary application has been to observe phenotypes at the tissue or organ level, often using medical imaging techniques such as MRI, CT, and PET. These imaging technologies complement omics-based data in biomedical research because they are helpful for identifying associations between genotype and phenotype, along with functional changes occurring at the tissue level. Single cell imaging can act as an intermediary between these levels. Meanwhile new technologies continue to arrive that can be used to interrogate the genome of single cells and its related omics datasets. As these two areas, single cell imaging and single cell omics, each advance independently with the development of novel techniques, the opportunity to integrate these data types becomes more and more attractive. This review outlines some of the technologies and methods currently available for generating, processing, and analysing single-cell omics- and imaging data, and how they could be integrated to further our understanding of complex biological phenomena like ageing. We include an emphasis on machine learning algorithms because of their ability to identify complex patterns in large multidimensional data.
Project description:BackgroundIntegrating data from different sources is a recurring question in computational biology. Much effort has been devoted to the integration of data sets of the same type, typically multiple numerical data tables. However, data types are generally heterogeneous: it is a common place to gather data in the form of trees, networks or factorial maps, as these representations all have an appealing visual interpretation that helps to study grouping patterns and interactions between entities. The question we aim to answer in this paper is that of the integration of such representations.ResultsTo this end, we provide a simple procedure to compare data with various types, in particular trees or networks, that relies essentially on two steps: the first step projects the representations into a common coordinate system; the second step then uses a multi-table integration approach to compare the projected data. We rely on efficient and well-known methodologies for each step: the projection step is achieved by retrieving a distance matrix for each representation form and then applying multidimensional scaling to provide a new set of coordinates from all the pairwise distances. The integration step is then achieved by applying a multiple factor analysis to the multiple tables of the new coordinates. This procedure provides tools to integrate and compare data available, for instance, as tree or network structures. Our approach is complementary to kernel methods, traditionally used to answer the same question.ConclusionOur approach is evaluated on simulation and used to analyze two real-world data sets: first, we compare several clusterings for different cell-types obtained from a transcriptomics single-cell data set in mouse embryos; second, we use our procedure to aggregate a multi-table data set from the TCGA breast cancer database, in order to compare several protein networks inferred for different breast cancer subtypes.
Project description:BackgroundAlthough electronic health records (EHRs) have been widely used in secondary assessments, clinical documents are relatively less utilized owing to the lack of standardized clinical text frameworks across different institutions.ObjectiveThis study aimed to develop a framework for processing unstructured clinical documents of EHRs and integration with standardized structured data.MethodsWe developed a framework known as Staged Optimization of Curation, Regularization, and Annotation of clinical text (SOCRATex). SOCRATex has the following four aspects: (1) extracting clinical notes for the target population and preprocessing the data, (2) defining the annotation schema with a hierarchical structure, (3) performing document-level hierarchical annotation using the annotation schema, and (4) indexing annotations for a search engine system. To test the usability of the proposed framework, proof-of-concept studies were performed on EHRs. We defined three distinctive patient groups and extracted their clinical documents (ie, pathology reports, radiology reports, and admission notes). The documents were annotated and integrated into the Observational Medical Outcomes Partnership (OMOP)-common data model (CDM) database. The annotations were used for creating Cox proportional hazard models with different settings of clinical analyses to measure (1) all-cause mortality, (2) thyroid cancer recurrence, and (3) 30-day hospital readmission.ResultsOverall, 1055 clinical documents of 953 patients were extracted and annotated using the defined annotation schemas. The generated annotations were indexed into an unstructured textual data repository. Using the annotations of pathology reports, we identified that node metastasis and lymphovascular tumor invasion were associated with all-cause mortality among colon and rectum cancer patients (both P=.02). The other analyses involving measuring thyroid cancer recurrence using radiology reports and 30-day hospital readmission using admission notes in depressive disorder patients also showed results consistent with previous findings.ConclusionsWe propose a framework for hierarchical annotation of textual data and integration into a standardized OMOP-CDM medical database. The proof-of-concept studies demonstrated that our framework can effectively process and integrate diverse clinical documents with standardized structured data for clinical research.