Project description:Numerous observational studies have attempted to identify risk factors for infection with SARS-CoV-2 and COVID-19 disease outcomes. Studies have used datasets sampled from patients admitted to hospital, people tested for active infection, or people who volunteered to participate. Here, we highlight the challenge of interpreting observational evidence from such non-representative samples. Collider bias can induce associations between two or more variables which affect the likelihood of an individual being sampled, distorting associations between these variables in the sample. Analysing UK Biobank data, compared to the wider cohort the participants tested for COVID-19 were highly selected for a range of genetic, behavioural, cardiovascular, demographic, and anthropometric traits. We discuss the mechanisms inducing these problems, and approaches that could help mitigate them. While collider bias should be explored in existing studies, the optimal way to mitigate the problem is to use appropriate sampling strategies at the study design stage.
Project description:In comparative effectiveness research (CER) for rare types of cancer, it is appealing to combine primary cohort data containing detailed tumor profiles together with aggregate information derived from cancer registry databases. Such integration of data may improve statistical efficiency in CER. A major challenge in combining information from different resources, however, is that the aggregate information from the cancer registry databases could be incomparable with the primary cohort data, which are often collected from a single cancer center or a clinical trial. We develop an adaptive estimation procedure, which uses the combined information to determine the degree of information borrowing from the aggregate data of the external resource. We establish the asymptotic properties of the estimators and evaluate the finite sample performance via simulation studies. The proposed method yields a substantial gain in statistical efficiency over the conventional method using the primary cohort only, and avoids undesirable biases when the given external information is incomparable to the primary cohort. We apply the proposed method to evaluate the long-term effect of trimodality treatment to inflammatory breast cancer (IBC) by tumor subtypes, while combining the IBC patient cohort at The University of Texas MD Anderson Cancer Center and the external aggregate information from the National Cancer Data Base.
Project description:Over the last decade the availability of SNP-trait associations from genome-wide association studies has led to an array of methods for performing Mendelian randomization studies using only summary statistics. A common feature of these methods, besides their intuitive simplicity, is the ability to combine data from several sources, incorporate multiple variants and account for biases due to weak instruments and pleiotropy. With the advent of large and accessible fully-genotyped cohorts such as UK Biobank, there is now increasing interest in understanding how best to apply these well developed summary data methods to individual level data, and to explore the use of more sophisticated causal methods allowing for non-linearity and effect modification. In this paper we describe a general procedure for optimally applying any two sample summary data method using one sample data. Our procedure first performs a meta-analysis of summary data estimates that are intentionally contaminated by collider bias between the genetic instruments and unmeasured confounders, due to conditioning on the observed exposure. These estimates are then used to correct the standard observational association between an exposure and outcome. Simulations are conducted to demonstrate the method's performance against naive applications of two sample summary data MR. We apply the approach to the UK Biobank cohort to investigate the causal role of sleep disturbance on HbA1c levels, an important determinant of diabetes. Our approach can be viewed as a generalization of Dudbridge et al. (Nat. Comm. 10: 1561), who developed a technique to adjust for index event bias when uncovering genetic predictors of disease progression based on case-only data. Our work serves to clarify that in any one sample MR analysis, it can be advantageous to estimate causal relationships by artificially inducing and then correcting for collider bias.
Project description:Large-scale cross-sectional and cohort studies have transformed our understanding of the genetic and environmental determinants of health outcomes. However, the representativeness of these samples may be limited-either through selection into studies, or by attrition from studies over time. Here we explore the potential impact of this selection bias on results obtained from these studies, from the perspective that this amounts to conditioning on a collider (i.e. a form of collider bias). Whereas it is acknowledged that selection bias will have a strong effect on representativeness and prevalence estimates, it is often assumed that it should not have a strong impact on estimates of associations. We argue that because selection can induce collider bias (which occurs when two variables independently influence a third variable, and that third variable is conditioned upon), selection can lead to substantially biased estimates of associations. In particular, selection related to phenotypes can bias associations with genetic variants associated with those phenotypes. In simulations, we show that even modest influences on selection into, or attrition from, a study can generate biased and potentially misleading estimates of both phenotypic and genotypic associations. Our results highlight the value of knowing which population your study sample is representative of. If the factors influencing selection and attrition are known, they can be adjusted for. For example, having DNA available on most participants in a birth cohort study offers the possibility of investigating the extent to which polygenic scores predict subsequent participation, which in turn would enable sensitivity analyses of the extent to which bias might distort estimates.
Project description:Mendelian randomization (MR) uses genetic variants as instrumental variables to investigate the causal effect of a risk factor on an outcome. A collider is a variable influenced by two or more other variables. Naive calculation of MR estimates in strata of the population defined by a collider, such as a variable affected by the risk factor, can result in collider bias. We propose an approach that allows MR estimation in strata of the population while avoiding collider bias. This approach constructs a new variable, the residual collider, as the residual from regression of the collider on the genetic instrument, and then calculates causal estimates in strata defined by quantiles of the residual collider. Estimates stratified on the residual collider will typically have an equivalent interpretation to estimates stratified on the collider, but they are not subject to collider bias. We apply the approach in several simulation scenarios considering different characteristics of the collider variable and strengths of the instrument. We then apply the proposed approach to investigate the causal effect of smoking on bladder cancer in strata of the population defined by bodyweight. The new approach generated unbiased estimates in all the simulation settings. In the applied example, we observed a trend in the stratum-specific MR estimates at different bodyweight levels that suggested stronger effects of smoking on bladder cancer among individuals with lower bodyweight. The proposed approach can be used to perform MR studying heterogeneity among subgroups of the population while avoiding collider bias.
Project description:BackgroundHealthcare-associated infections (HAIs) represent a major Public Health issue. Hospital-based prevalence studies are a common tool of HAI surveillance, but data quality problems and non-representativeness can undermine their reliability.MethodsThis study proposes three algorithms that, given a convenience sample and variables relevant for the outcome of the study, select a subsample with specific distributional characteristics, boosting either representativeness (Probability and Distance procedures) or risk factors' balance (Uniformity procedure). A "Quality Score" (QS) was also developed to grade sampled units according to data completeness and reliability. The methodologies were evaluated through bootstrapping on a convenience sample of 135 hospitals collected during the 2016 Italian Point Prevalence Survey (PPS) on HAIs.ResultsThe QS highlighted wide variations in data quality among hospitals (median QS 52.9 points, range 7.98-628, lower meaning better quality), with most problems ascribable to ward and hospital-related data reporting. Both Distance and Probability procedures produced subsamples with lower distributional bias (Log-likelihood score increased from 7.3 to 29 points). The Uniformity procedure increased the homogeneity of the sample characteristics (e.g., - 58.4% in geographical variability). The procedures selected hospitals with higher data quality, especially the Probability procedure (lower QS in 100% of bootstrap simulations). The Distance procedure produced lower HAI prevalence estimates (6.98% compared to 7.44% in the convenience sample), more in line with the European median.ConclusionsThe QS and the subsampling procedures proposed in this study could represent effective tools to improve the quality of prevalence studies, decreasing the biases that can arise due to non-probabilistic sample collection.
Project description:The application of polygenic scores has transformed our ability to investigate whether and how genetic and environmental factors jointly contribute to the variation of complex traits. Modelling the complex interplay between genes and environment, however, raises serious methodological challenges. Here we illustrate the largely unrecognised impact of gene-environment dependencies on the identification of the effects of genes and their variation across environments. We show that controlling for heritable covariates in regression models that include polygenic scores as independent variables introduces endogenous selection bias when one or more of these covariates depends on unmeasured factors that also affect the outcome. This results in the problem of conditioning on a collider, which in turn leads to spurious associations and effect sizes. Using graphical and simulation methods we demonstrate that the degree of bias depends on the strength of the gene-covariate correlation and of hidden heterogeneity linking covariates with outcomes, regardless of whether the main analytic focus is mediation, confounding, or gene × covariate (commonly gene × environment) interactions. We offer potential solutions, highlighting the importance of causal inference. We also urge further caution when fitting and interpreting models with polygenic scores and non-exogenous environments or phenotypes and demonstrate how spurious associations are likely to arise, advancing our understanding of such results.
Project description:Circulating microRNAs (miRNAs) have been shown to be excellent disease diagnostic or prognostic biomarkers in a wide range of chronic and acute inflammatory and infectious diseases including viral respiratory infection. Crucially, circulating miRNA levels are thought to reflect the state of the diseased tissue. Despite their proven value as mechanism-based clinical stratification indicators, miRNAs have only started being explored in the context of COVID-19. here, we aimed to explore whether integrating miRNA with other clinical and biological measurements would reveal more accurate correlates of COVID-19 severity and outcome, and to identify severity-specific correlations of miRNAs with COVID-19-associated inflammatory mediators, clinical parameters, and otucome.
Project description:Collider bias, or stratifying data by a covariate consequence rather than cause (confounder) of treatment and outcome, plagues randomised and observational trauma research. Of the seven trials of prehospital hypertonic saline in dextran (HSD) that have been evaluated in systematic reviews, none found an overall between-group difference in survival, but four reported significant subgroup effects. We hypothesised that an avoidable type of collider bias often introduced inadvertently into trauma comparative effectiveness research could explain the incongruous findings.The two most recent HSD trials, a single-site pilot and a multi-site pivotal study, provided data for a secondary analysis to more closely examine the potential for collider bias. The two trials had followed the a priori statistical analysis plan to subgroup patients by a post-randomisation covariate and well-established surrogate for bleeding severity, massive transfusion (MT), ? 10 unit of red blood cells within 24h of admission. Despite favourable HSD effects in the MT subgroup, opposite effects in the non-transfused subgroup halted the pivotal trial early. In addition to analyzing the data from the two trials, we constructed causal diagrams and performed a meta-analysis of the results from all seven trials to assess the extent to which collider bias could explain null overall effects with subgroup heterogeneity.As in previous trials, HSD induced significantly greater increases in systolic blood pressure (SBP) from prehospital to admission than control crystalloid (p=0.003). Proportionately more HSD than control decedents accrued in the non-transfused subgroup, but with paradoxically longer survival. Despite different study populations and a span of over 20 years across the seven trials, the reported mortality effects were consistently null, summary RR=0.99 (p=0.864, homogeneity p=0.709).HSD delayed blood transfusion by modifying standard triggers like SBP with no detectable effect on survival. The reported heterogeneous HSD effects in subgroups can be explained by collider bias that trauma researchers can avoid by improved covariate selection and data capture strategies.
Project description:Estimated genetic associations with prognosis, or conditional on a phenotype (e.g. disease incidence), may be affected by collider bias, whereby conditioning on the phenotype induces associations between causes of the phenotype and prognosis. We propose a method, 'Slope-Hunter', that uses model-based clustering to identify and utilise the class of variants only affecting the phenotype to estimate the adjustment factor, assuming this class explains more variation in the phenotype than any other variant classes. Simulation studies show that our approach eliminates the bias and outperforms alternatives even in the presence of genetic correlation. In a study of fasting blood insulin levels (FI) conditional on body mass index, we eliminate paradoxical associations of the underweight loci: COBLLI; PPARG with increased FI, and reveal an association for the locus rs1421085 (FTO). In an analysis of a case-only study for breast cancer mortality, a single region remains associated with more pronounced results.