Dataset Information

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity.

ABSTRACT: Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10-C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4,096 components. Detailed analyses of the UDM models demonstrate the removal of various long-branch attraction artifacts and improved performance compared with the C10-C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).

SUBMITTER: Schrempf D

PROVIDER: S-EPMC7743758 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity.

Schrempf Dominik D Lartillot Nicolas N Szöllősi Gergely G

Molecular biology and evolution 20201201 12

Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models ...[more]

PMID: 32877529

Similar Datasets

Project description:Landscape structure affects animal movement. Differences between landscapes may induce heterogeneity in home range size and movement rates among individuals within a population. These types of heterogeneity can cause bias when estimating population size or density and are seldom considered during analyses. Individual heterogeneity, attributable to unknown or unobserved covariates, is often modelled using latent mixture distributions, but these are demanding of data, and abundance estimates are sensitive to the parameters of the mixture distribution. A recent extension of spatially explicit capture-recapture models allows landscape structure to be modelled explicitly by incorporating landscape connectivity using non-Euclidean least-cost paths, improving inference, especially in highly structured (riparian & mountainous) landscapes. Our objective was to investigate whether these novel models could improve inference about black bear (Ursus americanus) density. We fit spatially explicit capture-recapture models with standard and complex structures to black bear data from 51 separate study areas. We found that non-Euclidean models were supported in over half of our study areas. Associated density estimates were higher and less precise than those from simple models and only slightly more precise than those from finite mixture models. Estimates were sensitive to the scale (pixel resolution) at which least-cost paths were calculated, but there was no consistent pattern across covariates or resolutions. Our results indicate that negative bias associated with ignoring heterogeneity is potentially severe. However, the most popular method for dealing with this heterogeneity (finite mixtures) yielded potentially unreliable point estimates of abundance that may not be comparable across surveys, even in data sets with 136-350 total detections, 3-5 detections per individual, 97-283 recaptures, and 80-254 spatial recaptures. In these same study areas with high sample sizes, we expected that landscape features would not severely constrain animal movements and modelling non-Euclidian distance would not consistently improve inference. Our results suggest caution in applying non-Euclidean SCR models when there is no clear landscape covariate that is known to strongly influence the movement of the focal species, and in applying finite mixture models except when abundant data are available.

Dataset Information

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity.

Publications

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets