Project description:This study presents a method for genomic prediction that uses individual-level data and summary statistics from multiple populations. Genome-wide markers are nowadays widely used to predict complex traits, and genomic prediction using multi-population data are an appealing approach to achieve higher prediction accuracies. However, sharing of individual-level data across populations is not always possible. We present a method that enables integration of summary statistics from separate analyses with the available individual-level data. The data can either consist of individuals with single or multiple (weighted) phenotype records per individual. We developed a method based on a hypothetical joint analysis model and absorption of population-specific information. We show that population-specific information is fully captured by estimated allele substitution effects and the accuracy of those estimates, i.e., the summary statistics. The method gives identical result as the joint analysis of all individual-level data when complete summary statistics are available. We provide a series of easy-to-use approximations that can be used when complete summary statistics are not available or impractical to share. Simulations show that approximations enable integration of different sources of information across a wide range of settings, yielding accurate predictions. The method can be readily extended to multiple-traits. In summary, the developed method enables integration of genome-wide data in the individual-level or summary statistics from multiple populations to obtain more accurate estimates of allele substitution effects and genomic predictions.
Project description:Most existing tools for constructing genetic prediction models begin with the assumption that all genetic variants contribute equally towards the phenotype. However, this represents a suboptimal model for how heritability is distributed across the genome. Therefore, we develop prediction tools that allow the user to specify the heritability model. We compare individual-level data prediction tools using 14 UK Biobank phenotypes; our new tool LDAK-Bolt-Predict outperforms the existing tools Lasso, BLUP, Bolt-LMM and BayesR for all 14 phenotypes. We compare summary statistic prediction tools using 225 UK Biobank phenotypes; our new tool LDAK-BayesR-SS outperforms the existing tools lassosum, sBLUP, LDpred and SBayesR for 223 of the 225 phenotypes. When we improve the heritability model, the proportion of phenotypic variance explained increases by on average 14%, which is equivalent to increasing the sample size by a quarter.
Project description:The accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWASs). However, it is now common for researchers to have access to large individual-level data as well, such as the UK Biobank data. To the best of our knowledge, it has not yet been explored how best to combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using 12 real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and meta-PRS. We find that, when large individual-level data are available, the linear combination of PRSs (meta-PRS) is both a simple alternative to meta-GWAS and often more accurate.
Project description:In addition to applications in meta-analysis, funnel plots have emerged as an effective graphical tool for visualizing the detection of health care providers with unusual performance. Although there already exist a variety of approaches to producing funnel plots in the literature of provider profiling, limited attention has been paid to elucidating the critical relationship between funnel plots and hypothesis testing. Within the framework of generalized linear models, here we establish methodological guidelines for creating funnel plots specific to the statistical tests of interest. Moreover, we show that the test-specific funnel plots can be created merely leveraging summary statistics instead of individual-level information. This appealing feature inhibits the leak of protected health information and reduces the cost of inter-institutional data transmission. Two data examples, one for surgical patients from Michigan hospitals and the other for Medicare-certified dialysis facilities, demonstrate the applicability to different types of providers and outcomes with either individual- or summary-level information.
Project description:UNLABELLED:The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example, on the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses. We present a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments on re-annotation that does not require re-analysis of the entire dataset. Our approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. We demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, we provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised. AVAILABILITY AND IMPLEMENTATION:Our methods are implemented in software called ReXpress and are freely available, together with source code, at http://bio.math.berkeley.edu/ReXpress/. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Pancreatic ductal adenocarcinoma (PDAC) is categorized as the leading cause of cancer mortality worldwide. However, its predictive markers for long-term survival are not well known. It is interesting to delineate individual-specific perturbed genes when comparing long-term (LT) and short-term (ST) PDAC survivors and integrate individual- and group-based transcriptome profiling. Using a discovery cohort of 19 PDAC patients from CHU-Liège (Belgium), we first performed differential gene expression analysis comparing LT to ST survivor. Second, we adopted systems biology approaches to obtain clinically relevant gene modules. Third, we created individual-specific perturbation profiles. Furthermore, we used Degree-Aware disease gene prioritizing (DADA) method to develop PDAC disease modules; Network-based Integration of Multi-omics Data (NetICS) to integrate group-based and individual-specific perturbed genes in relation to PDAC LT survival. We identified 173 differentially expressed genes (DEGs) in ST and LT survivors and five modules (including 38 DEGs) showing associations to clinical traits. Validation of DEGs in the molecular lab suggested a role of REG4 and TSPAN8 in PDAC survival. Via NetICS and DADA, we identified various known oncogenes such as CUL1 and TGFB1. Our proposed analytic workflow shows the advantages of combining clinical and omics data as well as individual- and group-level transcriptome profiling.
Project description:Cardiovascular disease (CVD) is considered a primary driver of global mortality and is estimated to be responsible for approximately 17.9 million deaths annually. Consequently, a substantial body of research related to CVD has developed, with an emphasis on identifying strategies for the prevention and effective treatment of CVD. In this review, we critically examine the existing CVD literature, and specifically highlight the contribution of Mendelian randomization analyses in CVD research. Throughout this review, we assess the extent to which research findings agree across a range of studies of differing design within a triangulation framework. If differing study designs are subject to non-overlapping sources of bias, consistent findings limit the extent to which results are merely an artefact of study design. Consequently, broad agreement across differing studies can be viewed as providing more robust causal evidence in contrast to limiting the scope of the review to a single specific study design. Utilising the triangulation approach, we highlight emerging patterns in research findings, and explore the potential of identified risk factors as targets for precision medicine and novel interventions.
Project description:BackgroundAlthough lung cancer screening (LCS) for high-risk individuals reduces lung cancer mortality in clinical trial settings, many questions remain about how to implement high-quality LCS in real-world programs. With the increasing use of telemedicine in healthcare, studies examining this approach in the context of LCS are urgently needed. We aimed to identify sociodemographic and other factors associated with screening completion among individuals undergoing telemedicine Shared Decision Making (SDM) for LCS.MethodsThis retrospective study examined patients who completed Shared Decision Making (SDM) via telemedicine between May 4, 2020 - March 18, 2021 in a centralized LCS program. Individuals were categorized into Complete Screening vs. Incomplete Screening subgroups based on the status of subsequent LDCT completion. A multi-level, multivariate model was constructed to identify factors associated with incomplete screening.ResultsAmong individuals undergoing telemedicine SDM during the study period, 20.6% did not complete a LDCT scan. Bivariate analysis demonstrated that Black/African-American race, Medicaid insurance status, and new patient type were associated with greater odds of incomplete screening. On multi-level, multivariate analysis, individuals who were new patients undergoing baseline LDCT or resided in a census tract with a high level of socioeconomic deprivation had significantly higher odds of incomplete screening. Individuals with a greater level of education experienced lower odds of incomplete screening.ConclusionsAmong high-risk individuals undergoing telemedicine SDM for LCS, predictors of incomplete screening included low education, high neighborhood-level deprivation, and new patient type. Future research should focus on testing implementation strategies to improve LDCT completion rates while leveraging telemedicine for high-quality LCS.
Project description:BackgroundIndividual patient data meta-analyses (IPDMAs) prevail as the gold standard in clinical evaluations. We investigated the distribution and epidemiological characteristics of published IPDMA articles.Methodology/principal findingsIPDMA articles were identified through comprehensive literature searches from PubMed, Embase, and Cochrane library. Two investigators independently conducted article identification, data classification and extraction. Data related to the article characteristics were collected and analyzed descriptively. A total of 829 IPDMA articles indexed until 9 August 2012 were identified. An average of 3.7 IPDMA articles was published per year. Malignant neoplasms (267 [32.2%]) and circulatory diseases (179 [21.6%]) were the most frequently occurring topics. On average, each IPDMA article included a median of 8 studies (Interquartile range, IQR 5 to 15) involving 2,563 patients (IQR 927 to 8,349). Among 829 IPDMA articles, 229 (27.6%) did not perform a systematic search to identify related studies. In total, 207 (25.0%) sought and included individual patient data (IPD) from the "grey literature". Only 496 (59.8%) successfully obtained IPD from all identified studies.Conclusions/significanceThe number of IPDMA articles exhibited an increasing trend over the past few years and mainly focused on cancer and circulatory diseases. Our data indicated that literature searches, including grey literature and data availability were inconsistent among different IPDMA articles. Possible biases may arise. Thus, decision makers should not uncritically accept all IPDMAs.