Project description:Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models, and thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating the node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, and test a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the best-performing algorithm on one human protein-protein interaction network and four nonhuman protein-protein interaction networks. Although we focus on a specific mechanistic network model, the proposed framework is more generally applicable to reversible models.
Project description:Mediation analysis is commonly used to identify mechanisms and intermediate factors between causes and outcomes. Studies drawing on polygenic scores (PGSs) can readily employ traditional regression-based procedures to assess whether trait M mediates the relationship between the genetic component of outcome Y and outcome Y itself. However, this approach suffers from attenuation bias, as PGSs capture only a (small) part of the genetic variance of a given trait. To overcome this limitation, we developed MA-GREML: a method for Mediation Analysis using Genome-based Restricted Maximum Likelihood (GREML) estimation. Using MA-GREML to assess mediation between genetic factors and traits comes with two main advantages. First, we circumvent the limited predictive accuracy of PGSs that regression-based mediation approaches suffer from. Second, compared to methods employing summary statistics from genome-wide association studies, the individual-level data approach of GREML allows to directly control for confounders of the association between M and Y. In addition to typical GREML parameters (e.g., the genetic correlation), MA-GREML estimates (i) the effect of M on Y, (ii) the direct effect (i.e., the genetic variance of Y that is not mediated by M), and (iii) the indirect effect (i.e., the genetic variance of Y that is mediated by M). MA-GREML also provides standard errors of these estimates and assesses the significance of the indirect effect. We use analytical derivations and simulations to show the validity of our approach under two main assumptions, viz., that M precedes Y and that environmental confounders of the association between M and Y are controlled for. We conclude that MA-GREML is an appropriate tool to assess the mediating role of trait M in the relationship between the genetic component of Y and outcome Y. Using data from the US Health and Retirement Study, we provide evidence that genetic effects on Body Mass Index (BMI), cognitive functioning and self-reported health in later life run partially through educational attainment. For mental health, we do not find significant evidence for an indirect effect through educational attainment. Further analyses show that the additive genetic factors of these four outcomes do partially (cognition and mental health) and fully (BMI and self-reported health) run through an earlier realization of these traits.
Project description:SummaryGenetic correlations are the genome-wide aggregate effects of causal variants affecting multiple traits. Traditionally, genetic correlations between complex traits are estimated from pedigree studies, but such estimates can be confounded by shared environmental factors. Moreover, for diseases, low prevalence rates imply that even if the true genetic correlation between disorders was high, co-aggregation of disorders in families might not occur or could not be distinguished from chance. We have developed and implemented statistical methods based on linear mixed models to obtain unbiased estimates of the genetic correlation between pairs of quantitative traits or pairs of binary traits of complex diseases using population-based case-control studies with genome-wide single-nucleotide polymorphism data. The method is validated in a simulation study and applied to estimate genetic correlation between various diseases from Wellcome Trust Case Control Consortium data in a series of bivariate analyses. We estimate a significant positive genetic correlation between risk of Type 2 diabetes and hypertension of ~0.31 (SE 0.14, P = 0.024).AvailabilityOur methods, appropriate for both quantitative and binary traits, are implemented in the freely available software GCTA (http://www.complextraitgenomics.com/software/gcta/reml_bivar.html).Contacthong.lee@uq.edu.auSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:Markov models of codon substitution naturally incorporate the structure of the genetic code and the selection intensity at the protein level, providing a more realistic representation of protein-coding sequences compared with nucleotide or amino acid models. Thus, for protein-coding genes, phylogenetic inference is expected to be more accurate under codon models. So far, phylogeny reconstruction under codon models has been elusive due to computational difficulties of dealing with high dimension matrices. Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML. CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties. In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.
Project description:Genetic correlation is a key population parameter that describes the shared genetic architecture of complex traits and diseases. It can be estimated by current state-of-art methods, i.e., linkage disequilibrium score regression (LDSC) and genomic restricted maximum likelihood (GREML). The massively reduced computing burden of LDSC compared to GREML makes it an attractive tool, although the accuracy (i.e., magnitude of standard errors) of LDSC estimates has not been thoroughly studied. In simulation, we show that the accuracy of GREML is generally higher than that of LDSC. When there is genetic heterogeneity between the actual sample and reference data from which LD scores are estimated, the accuracy of LDSC decreases further. In real data analyses estimating the genetic correlation between schizophrenia (SCZ) and body mass index, we show that GREML estimates based on ∼150,000 individuals give a higher accuracy than LDSC estimates based on ∼400,000 individuals (from combined meta-data). A GREML genomic partitioning analysis reveals that the genetic correlation between SCZ and height is significantly negative for regulatory regions, which whole genome or LDSC approach has less power to detect. We conclude that LDSC estimates should be carefully interpreted as there can be uncertainty about homogeneity among combined meta-datasets. We suggest that any interesting findings from massive LDSC analysis for a large number of complex traits should be followed up, where possible, with more detailed analyses with GREML methods, even if sample sizes are lesser.
Project description:Panel count data, in which the observation for each study subject consists of the number of recurrent events between successive examinations, are commonly encountered in industrial reliability testing, medical research, and various other scientific investigations. We formulate the effects of potentially time-dependent covariates on one or more types of recurrent events through non-homogeneous Poisson processes with random effects. We adopt nonparametric maximum likelihood estimation under arbitrary examination schemes and develop a simple and stable EM algorithm. We show that the resulting estimators of the regression parameters are consistent and asymptotically normal, with a covariance matrix that achieves the semiparametric efficiency bound and can be estimated through profile likelihood. We evaluate the performance of the proposed methods through extensive simulation studies and present a skin cancer clinical trial.
Project description:Many protein sequences have distinct domains that evolve with different rates, different selective pressures, or may differ in codon bias. Instead of modeling these differences by more and more complex models of molecular evolution, we present a multipartition approach that allows maximum-likelihood phylogeny inference using different codon models at predefined partitions in the data. Partition models can, but do not have to, share free parameters in the estimation process. We test this approach with simulated data as well as in a phylogenetic study of the origin of the leucin-rich repeat regions in the type III effector proteins of the pythopathogenic bacteria Ralstonia solanacearum. Our study does not only show that a simple two-partition model resolves the phylogeny better than a one-partition model but also gives more evidence supporting the hypothesis of lateral gene transfer events between the bacterial pathogens and its eukaryotic hosts.
Project description:Interval censoring arises frequently in clinical, epidemiological, financial and sociological studies, where the event or failure of interest is known only to occur within an interval induced by periodic monitoring. We formulate the effects of potentially time-dependent covariates on the interval-censored failure time through a broad class of semiparametric transformation models that encompasses proportional hazards and proportional odds models. We consider nonparametric maximum likelihood estimation for this class of models with an arbitrary number of monitoring times for each subject. We devise an EM-type algorithm that converges stably, even in the presence of time-dependent covariates, and show that the estimators for the regression parameters are consistent, asymptotically normal, and asymptotically efficient with an easily estimated covariance matrix. Finally, we demonstrate the performance of our procedures through simulation studies and application to an HIV/AIDS study conducted in Thailand.
Project description:HIV dynamics studies, based on differential equations, have significantly improved the knowledge on HIV infection. While first studies used simplified short-term dynamic models, recent works considered more complex long-term models combined with a global analysis of whole patient data based on nonlinear mixed models, increasing the accuracy of the HIV dynamic analysis. However statistical issues remain, given the complexity of the problem. We proposed to use the SAEM (stochastic approximation expectation-maximization) algorithm, a powerful maximum likelihood estimation algorithm, to analyze simultaneously the HIV viral load decrease and the CD4 increase in patients using a long-term HIV dynamic system. We applied the proposed methodology to the prospective COPHAR2-ANRS 111 trial. Very satisfactory results were obtained with a model with latent CD4 cells defined with five differential equations. One parameter was fixed, the 10 remaining parameters (eight with between-patient variability) of this model were well estimated. We showed that the efficacy of nelfinavir was reduced compared to indinavir and lopinavir.
Project description:Interval-censored multivariate failure time data arise when there are multiple types of failure or there is clustering of study subjects and each failure time is known only to lie in a certain interval. We investigate the effects of possibly time-dependent covariates on multivariate failure times by considering a broad class of semiparametric transformation models with random effects, and we study nonparametric maximum likelihood estimation under general interval-censoring schemes. We show that the proposed estimators for the finite-dimensional parameters are consistent and asymptotically normal, with a limiting covariance matrix that attains the semiparametric efficiency bound and can be consistently estimated through profile likelihood. In addition, we develop an EM algorithm that converges stably for arbitrary datasets. Finally, we assess the performance of the proposed methods in extensive simulation studies and illustrate their application using data derived from the Atherosclerosis Risk in Communities Study.