Project description:We propose a Bayesian method based on Gaussian process regression modeling to solve the problem. We use it to perform time annotation in old \textit{Clostridium botulinum} microarray experiments by utilising recently collected RNA-Seq time series. Additionally, we apply the method to time annotate RNA-Seq experiments and test its robustness toward noise on synthetically generated data.
Project description:The recent advance of single cell sequencing (scRNA-seq) technology such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) allows researchers to quantify cell surface protein abundance and RNA expression simultaneously at single cell resolution. Although CITE-seq and other similar technologies have quickly gained enormous popularity, novel methods for analyzing this new type of single cell multi-omics data are still in urgent need. A limited number of available tools utilize data-driven approach, which may undermine the biological importance of surface protein data. In this study, we developed SECANT, a biology-guided SEmi-supervised method for Clustering, classification, and ANnoTation of single-cell multi-omics. SECANT can be used to analyze CITE-seq data, or jointly analyze CITE-seq and scRNA-seq data. The novelties of SECANT include 1) using confident cell type labels identified from surface protein data as guidance for cell clustering, 2) providing general annotation of confident cell types for each cell cluster, 3) fully utilizing cells with uncertain or missing cell type labels to increase performance, and 4) accurate prediction of confident cell types identified from surface protein data for scRNA-seq data. Besides, as a model-based approach, SECANT can quantify the uncertainty of the results, and our framework can be easily extended to handle other types of multi-omics data. We successfully demonstrated the validity and advantages of SECANT via simulation studies and analysis of public and in-house real datasets. We believe this new method will greatly help researchers characterize novel cell types and make new biological discoveries using single cell multi-omics data.
Project description:Single-cell analysis of the transcriptome deepens our understanding of an individual cell's contribution to its microenvironment. Using single-cell analysis to study complex biological processes requires state-of-the-art computational tools. Assessing similarity is highly important for bioinformatics algorithms in order to determine correlations between biological information. Similarity can appear by chance, particularly for low expressed entities. This is especially relevant in single cell RNA-seq (scRNA-seq) because the read counts obtained are lower compared to bulk RNA-sequencing and therefore classic bioinformatics tools are insufficient to obtain reproducible results. Recently, a Bayesian correlation scheme, that assigns low correlation values to correlations coming from low expressed genes, has been proposed to assess similarity for bulk RNA-seq and miRNA. This Bayesian method uses a prior distribution before using empirical evidence. Our goal was to extend the properties of this Bayesian correlation scheme to scRNA-seq data. We assessed 3 ways to compute similarity. First, we computed the similarity of each pair of genes over all cells. Second, we identified specific cell populations and computed the correlation in those specific cells. Third, we computed the similarity of each pair of genes over all clusters, by including the total mRNA expression in those cells. To study the effect of the number of cells on the method, we did not rely on simulated data, we generated 4 scRNA-seq mouse liver cell libraries with a varying number of input cells. Results: We show that Bayesian correlations are more reproducible than Pearson correlations in all the scenarios studied. Compared to Pearson correlations, Bayesian correlations have a smaller dependence on the number of input cells. We demonstrate that the Bayesian correlation algorithm assigns high similarity values to genes with a biological relevance in a specific population. Significance: Our results demonstrate that Bayesian correlation is a robust similarity measure for scRNA-seq datasets. The Bayesian method allows researchers to study similarity between pairs of genes without discarding low expressed entities and to minimize biasing the results by fake correlations. Taken together, using our method of Bayesian correlation the reproducibility of scRNA-seq experiments is increased significantly.
Project description:We developed a semi-supervised deep learning framework for the identification of doublets in scRNA-seq analysis called Solo. To validate our method, we used MULTI-seq, cholesterol modified oligos (CMOs), to experimentally identify doublets in a solid tissue with diverse cell types, mouse kidney, and showed Solo recapitulated experimentally identified doublets.
Project description:Prokaryotic genome annotation is highly dependent on automated methods, as manual curation cannot keep up with the exponential growth of sequenced genomes. Current automated techniques depend heavily on sequence context and often underestimate the complexity of the proteome. We developed REPARATION (RibosomeE Profiling Assisted (Re-)AnnotaTION), a de novo algorithm that takes advantage of experimental evidence from ribosome profiling (Ribo-seq) to delineate translated open reading frames (ORFs) in bacteria, independent of genome annotation. Ribo-seq next generation sequencing technique that provides a genome-wide snapshot of the position translating ribosome along an mRNA at the time of the experiment. REPARATION evaluates all possible ORFs in the genome and estimates minimum thresholds to screen for spurious ORFs based on a growth curve model. We applied REPARATION to three annotated bacterial species to obtain a more comprehensive mapping of their translation landscape in support of experimental data. In all cases, we identified hundreds of novel ORFs including variants of previously annotated and novel small ORFs (<71 codons). Our predictions were supported by matching mass spectrometry (MS) proteomics data and sequence conservation analysis. REPARATION is unique in that it makes use of experimental Ribo-seq data to perform de novo ORF delineation in bacterial genomes, and thus can identify putative coding ORFs irrespective of the sequence context of the reading frame.
Project description:The delineation of genes in bacteria has remained an important challenge because prokaryotic genomes are often tightly packed frequently resulting in overlapping genes. We hereby present a de novo approach called REPARATION (RibosomeE Profiling Assisted (Re-)AnnotaTION) to delineate translated open reading frames (ORFs) in bacteria independent of (available) genome annotation. By deep sequencing of ribosome protected mRNA fragments (RPF) to map translating ribosomes across the entire genome, REPARATION takes advantage of the recently developed ribosome profiling (Ribo-seq) technique. REPARATION starts by traversing the entire genome to generate all possible ORFs and then collects their corresponding RPF signal information. Based on a growth curve model to estimate minimum ORF read density and Ribo-seq RPF coverage, thresholds indicative of translation is estimated. Finally, our algorithm applies a random forest model to build a classifier to classify putative protein coding ORFs. We evaluated the performance of REPARATION on 3 annotated bacterial species using in-house generated Ribo-seq data and matching N-terminal and shotgun proteomics data next to publically available Ribo-seq data. In all cases, about 80% of the ORFs predicted by REPARATION were previously annotated as protein coding. While 13-20% were variants of previously annotated ORFs and about 3-4% point to novel translated ORFs within intergenic or other regions previously annotated as non-coding. Without stringent length restrictions REPARATION was able to identify several small ORFs (sORFs). Multiple supportive evidence from matching MS data and sequence conservation analysis was obtained to validate predicted ORFs.
Project description:Prokaryotic genome annotation is highly dependent on automated methods, as manual curation cannot keep up with the exponential growth of sequenced genomes. Current automated techniques depend heavily on sequence context and often underestimate the complexity of the proteome. We developed REPARATION (RibosomeE Profiling Assisted (Re-)AnnotaTION), a de novo algorithm that takes advantage of experimental evidence from ribosome profiling (Ribo-seq) to delineate translated open reading frames (ORFs) in bacteria, independent of genome annotation. Ribo-seq next generation sequencing technique that provides a genome-wide snapshot of the position translating ribosome along an mRNA at the time of the experiment. REPARATION evaluates all possible ORFs in the genome and estimates minimum thresholds to screen for spurious ORFs based on a growth curve model. We applied REPARATION to three annotated bacterial species to obtain a more comprehensive mapping of their translation landscape in support of experimental data. In all cases, we identified hundreds of novel ORFs including variants of previously annotated and novel small ORFs (<71 codons). Our predictions were supported by matching mass spectrometry (MS) proteomics data and sequence conservation analysis. REPARATION is unique in that it makes use of experimental Ribo-seq data to perform de novo ORF delineation in bacterial genomes, and thus can identify putative coding ORFs irrespective of the sequence context of the reading frame.
Project description:In the current study, we present iSMNN, a supervised batch effect correction method for scRNA-seq data via multiple iterations of mutual nearest neighbor refinement. To validate the performance of iSMNN, we performed single-cell RNA-seq for adult murine heart using 10X Chromium platform.