Dataset Information

Differential expression analysis for RNAseq using Poisson mixed models.

ABSTRACT: Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.

SUBMITTER: Sun S

PROVIDER: S-EPMC5499851 | biostudies-literature | 2017 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Differential expression analysis for RNAseq using Poisson mixed models.

Sun Shiquan S Hood Michelle M Scott Laura L Peng Qinke Q Mukherjee Sayan S Tung Jenny J Zhou Xiang X

Nucleic acids research 20170601 11

Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to related ...[more]

PMID: 28369632

Similar Datasets

Project description:BackgroundTsetse flies are the major vectors of human trypanosomiasis of the form Trypanosoma brucei rhodesiense and T.b.gambiense. They are widely spread across the sub-Saharan Africa and rendering a lot of challenges to both human and animal health. This stresses effective agricultural production and productivity in Africa. Delimiting the extent and magnitude of tsetse coverage has been a challenge over decades due to limited resources and unsatisfactory technology. In a bid to overcome these limitations, this study attempted to explore modelling skills that can be applied to spatially estimate tsetse abundance in the country using limited tsetse data and a set of remote-sensed environmental variables.MethodologyEntomological data for the period 2008-2018 as used in the model were obtained from various sources and systematically assembled using a structured protocol. Data harmonisation for the purposes of responsiveness and matching was carried out. The key tool for tsetse trapping was itemized as pyramidal trap in many instances and biconical trap in others. Based on the spatially explicit assembled data, we ran two regression models; standard Poisson and Zero-Inflated Poisson (ZIP), to explore the associations between tsetse abundance in Uganda and several environmental and climatic covariates. The covariate data were constituted largely by satellite sensor data in form of meteorological and vegetation surrogates in association with elevation and land cover data. We finally used the Zero-Inflated Poisson (ZIP) regression model to predict tsetse abundance due to its superiority over the standard Poisson after model fitting and testing using the Vuong Non-Nested statistic.ResultsA total of 1,187 tsetse sampling points were identified and considered as representative for the country. The model results indicated the significance and level of responsiveness of each covariate in influencing tsetse abundance across the study area. Woodland vegetation, elevation, temperature, rainfall, and dry season normalised difference vegetation index (NDVI) were important in determining tsetse abundance and spatial distribution at varied scales. The resultant prediction map shows scaled tsetse abundance with estimated fitted numbers ranging from 0 to 59 flies per trap per day (FTD). Tsetse abundance was found to be largest at low elevations, in areas of high vegetative activity, in game parks, forests and shrubs during the dry season. There was very limited responsiveness of selected predictors to tsetse abundance during the wet season, matching the known fact that tsetse disperse most significantly during wet season.ConclusionsA methodology was advanced to enable compilation of entomological data for 10 years, which supported the generation of tsetse abundance maps for Uganda through modelling. Our findings indicate the spatial distribution of the G. f. fuscipes as; low 0-5 FTD (48%), medium 5.1-35 FTD (18%) and high 35.1-60 FTD (34%) grounded on seasonality. This approach, amidst entomological data shortages due to limited resources and absence of expertise, can be adopted to enable mapping of the vector to provide better decision support towards designing and implementing targeted tsetse and tsetse-transmitted African trypanosomiasis control strategies.

Project description:BackgroundRNAseq is nowadays the method of choice for transcriptome analysis. In the last decades, a high number of statistical methods, and associated bioinformatics tools, for RNAseq analysis were developed. More recently, statistical studies realised neutral comparison studies using benchmark datasets, shedding light on the most appropriate approaches for RNAseq data analysis.ResultsDiCoExpress is a script-based tool implemented in R that includes methods chosen based on their performance in neutral comparisons studies. DiCoExpress uses pre-existing R packages including FactoMineR, edgeR and coseq, to perform quality control, differential, and co-expression analysis of RNAseq data. Users can perform the full analysis, providing a mapped read expression data file and a file containing the information on the experimental design. Following the quality control step, the user can move on to the differential expression analysis performed using generalized linear models thanks to the automated contrast writing function. A co-expression analysis is implemented using the coseq package. Lists of differentially expressed genes and identified co-expression clusters are automatically analyzed for enrichment of annotations provided by the user. We used DiCoExpress to analyze a publicly available RNAseq dataset on the transcriptional response of Brassica napus L. to silicon treatment in plant roots and mature leaves. This dataset, including two biological factors and three replicates for each condition, allowed us to demonstrate in a tutorial all the features of DiCoExpress.ConclusionsDiCoExpress is an R script-based tool allowing users to perform a full RNAseq analysis from quality controls to co-expression analysis through differential analysis based on contrasts inside generalized linear models. DiCoExpress focuses on the statistical modelling of gene expression according to the experimental design and facilitates the data analysis leading the biological interpretation of the results.

Dataset Information

Differential expression analysis for RNAseq using Poisson mixed models.

Publications

Differential expression analysis for RNAseq using Poisson mixed models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets