Project description:RNA-sequencing (RNA-seq) is a flexible technology for measuring genome-wide expression that is rapidly replacing microarrays as costs become comparable. Current differential expression analysis methods for RNA-seq data fall into two broad classes: (1) methods that quantify expression within the boundaries of genes previously published in databases and (2) methods that attempt to reconstruct full length RNA transcripts. The first class cannot discover differential expression outside of previously known genes. While the second approach does possess discovery capabilities, statistical analysis of differential expression is complicated by the ambiguity and variability incurred while assembling transcripts and estimating their abundances. Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region. Optionally, DERs can be annotated using a reference database of genomic features. We compare our approach with leading competitors from both current classes of differential expression methods and highlight the strengths and weaknesses of each. A software implementation of our method is available on github (https://github.com/alyssafrazee/derfinder).
Project description:RNA-Seq is increasingly being used for gene expression profiling. In this approach, next-generation sequencing (NGS) platforms are used for sequencing. Due to highly parallel nature, millions of reads are generated in a short time and at low cost. Therefore analysis of the data is a major challenge and development of statistical and computational methods is essential for drawing meaningful conclusions from this huge data. In here, we assessed three different types of normalization (transcript parts per million, trimmed mean of M values, quantile normalization) and evaluated if normalized data reduces technical variability across replicates. In addition, we also proposed two novel methods for detecting differentially expressed genes between two biological conditions: (i) likelihood ratio method, and (ii) Bayesian method. Our proposed methods for finding differentially expressed genes were tested on three real datasets. Our methods performed at least as well as, and often better than, the existing methods for analysis of differential expression.
Project description:We consider an increasingly popular study design where single-cell RNA-seq data are collected from multiple individuals and the question of interest is to find genes that are differentially expressed between two groups of individuals. Towards this end, we propose a statistical method named IDEAS (individual level differential expression analysis for scRNA-seq). For each gene, IDEAS summarizes its expression in each individual by a distribution and then assesses whether these individual-specific distributions are different between two groups of individuals. We apply IDEAS to assess gene expression differences of autism patients versus controls and COVID-19 patients with mild versus severe symptoms.
Project description:BackgroundRNA-Seq technology measures the transcript abundance by generating sequence reads and counting their frequencies across different biological conditions. To identify differentially expressed genes between two conditions, it is important to consider the experimental design as well as the distributional property of the data. In many RNA-Seq studies, the expression data are obtained as multiple pairs, e.g., pre- vs. post-treatment samples from the same individual. We seek to incorporate paired structure into analysis.ResultsWe present a Bayesian hierarchical mixture model for RNA-Seq data to separately account for the variability within and between individuals from a paired data structure. The method assumes a Poisson distribution for the data mixed with a gamma distribution to account variability between pairs. The effect of differential expression is modeled by two-component mixture model. The performance of this approach is examined by simulated and real data.ConclusionsIn this setting, our proposed model provides higher sensitivity than existing methods to detect differential expression. Application to real RNA-Seq data demonstrates the usefulness of this method for detecting expression alteration for genes with low average expression levels or shorter transcript length.
Project description:We propose eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNA-seq data analysis aiming to make the transformed sample mean to be representative of the distribution center since it is not always possible to transform count data to satisfy the normality assumption. Simulation studies showed that for data sets with small (e.g., nCases = nControls = 3) or large sample size (e.g., nCases = nControls = 100) limma based on data from the l, l2, and r2 transformations performed better than limma based on data from the voom transformation in term of accuracy, FDR, and FNR. For datasets with moderate sample size (e.g., nCases = nControls = 30 or 50), limma with the rv and rv2 transformations performed similarly to limma with the voom transformation. Real data analysis results are consistent with simulation analysis results: limma with the r, l, r2, and l2 transformation performed better than limma with the voom transformation when sample sizes are small or large; limma with the rv and rv2 transformations performed similarly to limma with the voom transformation when sample sizes are moderate. We also observed from our data analyses that for datasets with large sample size, the gene-selection via the Wilcoxon rank sum test (a non-parametric two sample test method) based on the raw data outperformed limma based on the transformed data.
Project description:The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.
Project description:Staphylococcus aureus is a gram-positive cocci and an important human commensal bacteria and pathogen. S. aureus infections are increasingly difficult to treat because of the emergence of highly resistant MRSA (methicillin-resistant S. aureus) strains. Here we present a method to study differential gene expression in S. aureus using high-throughput RNA-sequencing (RNA-seq). We used RNA-seq to examine gene expression in S. aureus RN4220 cells containing an exogenously expressed transcription factor and between two S. aureus strains (RN4220 and NCTC8325-4). We investigated the sequence and gene expression differences between RN4220 and NCTC8325-4 and used the RNA-seq data to identify S. aureus promoters suitable for in vitro analysis. We used RNA-seq to describe, on a genome wide scale, genes positively and negatively regulated by the phage encoded transcription factor gp67. RNA-seq offers the ability to study differential gene expression with single-nucleotide resolution, and is a considerable improvement over the predominant genome-wide transcriptome technologies used in S. aureus.
Project description:Single-cell RNA sequencing (scRNA-seq) is a powerful technology that is capable of generating gene expression data at the resolution of individual cell. The scRNA-seq data is characterized by the presence of dropout events, which severely bias the results if they remain unaddressed. There are limited Differential Expression (DE) approaches which consider the biological processes, which lead to dropout events, in the modeling process. So, we develop, SwarnSeq, an improved method for DE, and other downstream analysis that considers the molecular capture process in scRNA-seq data modeling. The performance of the proposed method is benchmarked with 11 existing methods on 10 different real scRNA-seq datasets under three comparison settings. We demonstrate that SwarnSeq method has improved performance over the 11 existing methods. This improvement is consistently observed across several public scRNA-seq datasets generated using different scRNA-seq protocols. The external spike-ins data can be used in the SwarnSeq method to enhance its performance. AVAILABILITY AND IMPLEMENTATION: The method is implemented as a publicly available R package available at https://github.com/sam-uofl/SwarnSeq.
Project description:BackgroundThe development of single-cell RNA sequencing has enabled profound discoveries in biology, ranging from the dissection of the composition of complex tissues to the identification of novel cell types and dynamics in some specialized cellular environments. However, the large-scale generation of single-cell RNA-seq (scRNA-seq) data collected at multiple time points remains a challenge to effective measurement gene expression patterns in transcriptome analysis.ResultsWe present an algorithm based on the Dynamic Time Warping score (DTWscore) combined with time-series data, that enables the detection of gene expression changes across scRNA-seq samples and recovery of potential cell types from complex mixtures of multiple cell types.ConclusionsThe DTWscore successfully classify cells of different types with the most highly variable genes from time-series scRNA-seq data. The study was confined to methods that are implemented and available within the R framework. Sample datasets and R packages are available at https://github.com/xiaoxiaoxier/DTWscore .
Project description:With the advent of single-cell RNA-sequencing (scRNA-seq), it is possible to measure the expression dynamics of genes at the single-cell level. Through scRNA-seq, a huge amount of expression data for several thousand(s) of genes over million(s) of cells are generated in a single experiment. Differential expression analysis is the primary downstream analysis of such data to identify gene markers for cell type detection and also provide inputs to other secondary analyses. Many statistical approaches for differential expression analysis have been reported in the literature. Therefore, we critically discuss the underlying statistical principles of the approaches and distinctly divide them into six major classes, i.e., generalized linear, generalized additive, Hurdle, mixture models, two-class parametric, and non-parametric approaches. We also succinctly discuss the limitations that are specific to each class of approaches, and how they are addressed by other subsequent classes of approach. A number of challenges are identified in this study that must be addressed to develop the next class of innovative approaches. Furthermore, we also emphasize the methodological challenges involved in differential expression analysis of scRNA-seq data that researchers must address to draw maximum benefit from this recent single-cell technology. This study will serve as a guide to genome researchers and experimental biologists to objectively select options for their analysis.