Dataset Information

Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model.

ABSTRACT: The relative evolutionary rates at individual sites in proteins are informative measures of conservation or adaptation. Often used as evolutionarily aware conservation scores, relative rates reveal key functional or strongly selected residues. Estimating rates in a phylogenetic context requires specifying a protein substitution model, which is typically a phenomenological model trained on a large empirical data set. A strong emphasis has traditionally been placed on selecting the "best-fit" model, with the implicit understanding that suboptimal or otherwise ill-fitting models might bias inferences. However, the pervasiveness and degree of such bias has not been systematically examined. We investigated how model choice impacts site-wise relative rates in a large set of empirical protein alignments. We compared models designed for use on any general protein, models designed for specific domains of life, and the simple equal-rates Jukes Cantor-style model (JC). As expected, information theoretic measures showed overwhelming evidence that some models fit the data decidedly better than others. By contrast, estimates of site-specific evolutionary rates were impressively insensitive to the substitution model used, revealing an unexpected degree of robustness to potential model misspecification. A deeper examination of the fewer than 5% of sites for which model inferences differed in a meaningful way showed that the JC model could uniquely identify rapidly evolving sites that models with empirically derived exchangeabilities failed to detect. We conclude that relative protein rates appear robust to the applied substitution model, and any sensible model of protein evolution, regardless of its fit to the data, should produce broadly consistent evolutionary rates.

SUBMITTER: Spielman SJ

PROVIDER: S-EPMC6107055 | biostudies-literature | 2018 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model.

Spielman Stephanie J SJ Kosakovsky Pond Sergei L SL

Molecular biology and evolution 20180901 9

The relative evolutionary rates at individual sites in proteins are informative measures of conservation or adaptation. Often used as evolutionarily aware conservation scores, relative rates reveal key functional or strongly selected residues. Estimating rates in a phylogenetic context requires specifying a protein substitution model, which is typically a phenomenological model trained on a large empirical data set. A strong emphasis has traditionally been placed on selecting the "best-fit" mode ...[more]

PMID: 29924340

Similar Datasets

Project description:MotivationIn a nucleotide or amino acid sequence, not all sites evolve at the same rate, due to differing selective constraints at each site. Currently in computational molecular evolution, models incorporating rate heterogeneity always share two assumptions. First, the rate of evolution at each site is assumed to be independent of every other site. Second, the values of these rates are assumed to be drawn from a known prior distribution. Although often assumed to be small, the actual effect of these assumptions has not been previously quantified in the literature.ResultsHerein we describe an algorithm to simultaneously infer the set of n-1 relative rates that parameterize the likelihood of an n-site alignment. Unlike previous work (a) these relative rates are completely identifiable and distinct from the branch-length parameters, and (b) a far more general class of rate priors can be used, and their effects quantified. Although described in a Bayesian framework, we discuss a future maximum likelihood extension.ConclusionsUsing both synthetic data and alignments from the Myc, Max and p53 protein families, we find that inferring relative rather than absolute rates has several advantages. First, both empirical likelihoods and Bayes factors show strong preference for the relative-rate model, with a mean Delta ln P=-0.458 per alignment site. Second, the computed likelihoods and Bayes factors were essentially independent of the relative-rate prior, indicating that good estimates of the posterior rate distribution are not required a priori. Third, a novel finding is that rates can be accurately inferred even when up to approximately 4 substitutions per site have occurred. Thus biologically relevant putative hypervariable sites can be identified as easily as conserved sites. Lastly, our model treats rates and tree branch-lengths as completely identifiable, allowing for the first time coherent simultaneous inference of branch-lengths and site-specific evolutionary rates.AvailabilitySource code for the utility described is available under a BSD-style license at http://www.fernandes.org/txp/article/9/site-specific-relative-evolutionary-rates.

Project description:BackgroundAn accurate timescale of evolutionary history is essential to testing hypotheses about the influence of historical events and processes, and the timescale for evolution is increasingly derived from analysis of DNA sequences. But variation in the rate of molecular evolution complicates the inference of time from DNA. Evidence is growing for numerous factors, such as life history and habitat, that are linked both to the molecular processes of mutation and fixation and to rates of macroevolutionary diversification. However, the most widely used methods rely on idealised models of rate variation, such as the uncorrelated and autocorrelated clocks, and molecular dating methods are rarely tested against complex models of rate change. One relationship that is not accounted for in molecular dating is the potential for interaction between molecular substitution rates and speciation, a relationship that has been supported by empirical studies in a growing number of taxa. If these relationships are as widespread as current evidence suggests, they may have a significant influence on molecular dates.ResultsWe simulate phylogenies and molecular sequences under three different realistic rate variation models-one in which speciation rates and substitution rates both vary but are unlinked, one in which they covary continuously and one punctuated model in which molecular change is concentrated in speciation events, using empirical case studies to parameterise realistic simulations. We test three commonly used "relaxed clock" molecular dating methods against these realistic simulations to explore the degree of error in molecular dates under each model. We find average divergence time inference errors ranging from 12% of node age for the unlinked model when reconstructed under an uncorrelated rate prior using BEAST 2, to up to 91% when sequences evolved under the punctuated model are reconstructed under an autocorrelated prior using PAML.ConclusionsWe demonstrate the potential for substantial errors in molecular dates when both speciation rates and substitution rates vary between lineages. This study highlights the need for tests of molecular dating methods against realistic models of rate variation generated from empirical parameters and known relationships.

Project description:BackgroundMultiple studies have demonstrated that partitioning of molecular datasets is important in model-based phylogenetic analyses. Commonly, partitioning is done a priori based on some known properties of sequence evolution, e.g. differences in rate of evolution among codon positions of a protein-coding gene. Here we propose a new method for data partitioning based on relative evolutionary rates of the sites in the alignment of the dataset being analysed. The rates are inferred using the previously published Tree Independent Generation of Evolutionary Rates (TIGER), and the partitioning is conducted using our novel python script RatePartitions. We conducted simulations to assess the performance of our new method, and we applied it to eight published multi-locus phylogenetic datasets, representing different taxonomic ranks within the insect order Lepidoptera (butterflies and moths) and one phylogenomic dataset, which included ultra-conserved elements as well as introns.MethodsWe used TIGER-rates to generate relative evolutionary rates for all sites in the alignments. Then, using RatePartitions, we partitioned the data into partitions based on their relative evolutionary rate. RatePartitions applies a simple formula that ensures a distribution of sites into partitions following the distribution of rates of the characters from the full dataset. This ensures that the invariable sites are placed in a partition with slowly evolving sites, avoiding the pitfalls of previously used methods, such as k-means. Different partitioning strategies were evaluated using BIC scores as calculated by PartitionFinder.ResultsSimulations did not highlight any misbehaviour of our partitioning approach, even under difficult parameter conditions or missing data. In all eight phylogenetic datasets, partitioning using TIGER-rates and RatePartitions was significantly better as measured by the BIC scores than other partitioning strategies, such as the commonly used partitioning by gene and codon position. We compared the resulting topologies and node support for these eight datasets as well as for the phylogenomic dataset.DiscussionWe developed a new method of partitioning phylogenetic datasets without using any prior knowledge (e.g. DNA sequence evolution). This method is entirely based on the properties of the data being analysed and can be applied to DNA sequences (protein-coding, introns, ultra-conserved elements), protein sequences, as well as morphological characters. A likely explanation for why our method performs better than other tested partitioning strategies is that it accounts for the heterogeneity in the data to a much greater extent than when data are simply subdivided based on prior knowledge.

Project description:Molecular evolutionary theory predicts that the ratio of autosomal to X-linked adaptive substitution (K(A)/K(x)) is primarily determined by the average dominance coefficient of beneficial mutations. Although this theory has profoundly influenced analysis and interpretation of comparative genomic data, its predictions are based upon two unverified assumptions about the genetic basis of adaptation. The theory assumes that 1) the rate of adaptively driven molecular evolution is limited by the availability of beneficial mutations, and 2) the scaling of evolutionary parameters between the X and the autosomes (e.g., the beneficial mutation rate, and the fitness effect distribution of beneficial alleles, per X-linked versus autosomal locus) is constant across molecular evolutionary timescales. Here, we show that the genetic architecture underlying bouts of adaptive substitution can influence both assumptions, and consequently, the theoretical relationship between K(A)/K(x) and mean dominance. Quantitative predictions of prior theory apply when 1) many genomically dispersed genes potentially contribute beneficial substitutions during individual steps of adaptive walks, and 2) the population beneficial mutation rate, summed across the set of potentially contributing genes, is sufficiently small to ensure that adaptive substitutions are drawn from new mutations rather than standing genetic variation. Current research into the genetic basis of adaptation suggests that both assumptions are plausibly violated. We find that the qualitative positive relationship between mean dominance and K(A)/K(x) is relatively robust to the specific conditions underlying adaptive substitution, yet the quantitative relationship between dominance and K(A)/K(x) is quite flexible and context dependent. This flexibility may partially account for the puzzlingly variable X versus autosome substitution patterns reported in the empirical evolutionary genomics literature. The new theory unites the previously separate analysis of adaptation using new mutations versus standing genetic variation and makes several useful predictions about the interaction between genetic architecture, evolutionary genetic constraints, and effective population size in determining the ratio of adaptive substitution between autosomal and X-linked genes.

Dataset Information

Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model.

Publications

Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets