Project description:MotivationResearchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers' marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters.ResultsWe develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers.Availability and implementationR package is available at https://github.com/fushengstat/MetaGIM.
Project description:This study presents a method for genomic prediction that uses individual-level data and summary statistics from multiple populations. Genome-wide markers are nowadays widely used to predict complex traits, and genomic prediction using multi-population data are an appealing approach to achieve higher prediction accuracies. However, sharing of individual-level data across populations is not always possible. We present a method that enables integration of summary statistics from separate analyses with the available individual-level data. The data can either consist of individuals with single or multiple (weighted) phenotype records per individual. We developed a method based on a hypothetical joint analysis model and absorption of population-specific information. We show that population-specific information is fully captured by estimated allele substitution effects and the accuracy of those estimates, i.e., the summary statistics. The method gives identical result as the joint analysis of all individual-level data when complete summary statistics are available. We provide a series of easy-to-use approximations that can be used when complete summary statistics are not available or impractical to share. Simulations show that approximations enable integration of different sources of information across a wide range of settings, yielding accurate predictions. The method can be readily extended to multiple-traits. In summary, the developed method enables integration of genome-wide data in the individual-level or summary statistics from multiple populations to obtain more accurate estimates of allele substitution effects and genomic predictions.
Project description:Most existing tools for constructing genetic prediction models begin with the assumption that all genetic variants contribute equally towards the phenotype. However, this represents a suboptimal model for how heritability is distributed across the genome. Therefore, we develop prediction tools that allow the user to specify the heritability model. We compare individual-level data prediction tools using 14 UK Biobank phenotypes; our new tool LDAK-Bolt-Predict outperforms the existing tools Lasso, BLUP, Bolt-LMM and BayesR for all 14 phenotypes. We compare summary statistic prediction tools using 225 UK Biobank phenotypes; our new tool LDAK-BayesR-SS outperforms the existing tools lassosum, sBLUP, LDpred and SBayesR for 223 of the 225 phenotypes. When we improve the heritability model, the proportion of phenotypic variance explained increases by on average 14%, which is equivalent to increasing the sample size by a quarter.
Project description:The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.
Project description:In addition to applications in meta-analysis, funnel plots have emerged as an effective graphical tool for visualizing the detection of health care providers with unusual performance. Although there already exist a variety of approaches to producing funnel plots in the literature of provider profiling, limited attention has been paid to elucidating the critical relationship between funnel plots and hypothesis testing. Within the framework of generalized linear models, here we establish methodological guidelines for creating funnel plots specific to the statistical tests of interest. Moreover, we show that the test-specific funnel plots can be created merely leveraging summary statistics instead of individual-level information. This appealing feature inhibits the leak of protected health information and reduces the cost of inter-institutional data transmission. Two data examples, one for surgical patients from Michigan hospitals and the other for Medicare-certified dialysis facilities, demonstrate the applicability to different types of providers and outcomes with either individual- or summary-level information.
Project description:Urbanization level is an important indicator of socioeconomic development, and projecting its dynamics is fundamental for studies related to global socioeconomic and climate change. This paper aims to update the projections of global urbanization from 2015 to 2100 under the Shared Socioeconomic Pathways by using the logistic fitting model and iteratively identifying reference countries. Based on historical urbanization level database from the World Urbanization Prospects, projected urbanization levels and uncertainties are provided for 204 countries and areas every five years. The 2010-2100 year-by-year projected urbanization levels and uncertainties based on the annual historical data from the World Bank (WB) for 188 of countries and areas are also provided. The projections based on the two datasets were compared and the latter were validated using the historical values of the WB for the years 2010-2018. The updated dataset of urbanization level is relevant for understanding future socioeconomic development, its implications for climate change and policy planning.
Project description:UnlabelledThe estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example, on the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses. We present a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments on re-annotation that does not require re-analysis of the entire dataset. Our approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. We demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, we provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised.Availability and implementationOur methods are implemented in software called ReXpress and are freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:Pancreatic ductal adenocarcinoma (PDAC) is categorized as the leading cause of cancer mortality worldwide. However, its predictive markers for long-term survival are not well known. It is interesting to delineate individual-specific perturbed genes when comparing long-term (LT) and short-term (ST) PDAC survivors and integrate individual- and group-based transcriptome profiling. Using a discovery cohort of 19 PDAC patients from CHU-Liège (Belgium), we first performed differential gene expression analysis comparing LT to ST survivor. Second, we adopted systems biology approaches to obtain clinically relevant gene modules. Third, we created individual-specific perturbation profiles. Furthermore, we used Degree-Aware disease gene prioritizing (DADA) method to develop PDAC disease modules; Network-based Integration of Multi-omics Data (NetICS) to integrate group-based and individual-specific perturbed genes in relation to PDAC LT survival. We identified 173 differentially expressed genes (DEGs) in ST and LT survivors and five modules (including 38 DEGs) showing associations to clinical traits. Validation of DEGs in the molecular lab suggested a role of REG4 and TSPAN8 in PDAC survival. Via NetICS and DADA, we identified various known oncogenes such as CUL1 and TGFB1. Our proposed analytic workflow shows the advantages of combining clinical and omics data as well as individual- and group-level transcriptome profiling.
Project description:BackgroundAlthough lung cancer screening (LCS) for high-risk individuals reduces lung cancer mortality in clinical trial settings, many questions remain about how to implement high-quality LCS in real-world programs. With the increasing use of telemedicine in healthcare, studies examining this approach in the context of LCS are urgently needed. We aimed to identify sociodemographic and other factors associated with screening completion among individuals undergoing telemedicine Shared Decision Making (SDM) for LCS.MethodsThis retrospective study examined patients who completed Shared Decision Making (SDM) via telemedicine between May 4, 2020 - March 18, 2021 in a centralized LCS program. Individuals were categorized into Complete Screening vs. Incomplete Screening subgroups based on the status of subsequent LDCT completion. A multi-level, multivariate model was constructed to identify factors associated with incomplete screening.ResultsAmong individuals undergoing telemedicine SDM during the study period, 20.6% did not complete a LDCT scan. Bivariate analysis demonstrated that Black/African-American race, Medicaid insurance status, and new patient type were associated with greater odds of incomplete screening. On multi-level, multivariate analysis, individuals who were new patients undergoing baseline LDCT or resided in a census tract with a high level of socioeconomic deprivation had significantly higher odds of incomplete screening. Individuals with a greater level of education experienced lower odds of incomplete screening.ConclusionsAmong high-risk individuals undergoing telemedicine SDM for LCS, predictors of incomplete screening included low education, high neighborhood-level deprivation, and new patient type. Future research should focus on testing implementation strategies to improve LDCT completion rates while leveraging telemedicine for high-quality LCS.