Dataset Information

Performance of criteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets.

ABSTRACT: BACKGROUND: Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory. RESULTS: We demonstrate that the Bayesian information criterion and decision theory are the most appropriate model-selection criteria because of their high accuracy and precision. Our results also indicate that in some situations different models are selected by different criteria for the same dataset. Such dissimilarity was the highest between the hierarchical likelihood-ratio test and Akaike information criterion, and lowest between the Bayesian information criterion and decision theory. The hierarchical likelihood-ratio test performed poorly when the true model included a proportion of invariable sites, while the Bayesian information criterion and decision theory generally exhibited similar performance to each other. CONCLUSIONS: Our results indicate that the Bayesian information criterion and decision theory should be preferred for model selection. Together with model-adequacy tests, accurate model selection will serve to improve the reliability of phylogenetic inference and related analyses.

SUBMITTER: Luo A

PROVIDER: S-EPMC2925852 | biostudies-other | 2010

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Performance of criteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets.

Luo Arong A Qiao Huijie H Zhang Yanzhou Y Shi Weifeng W Ho Simon Yw SY Xu Weijun W Zhang Aibing A Zhu Chaodong C

BMC evolutionary biology 20100809

<h4>Background</h4>Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, ...[more]

PMID: 20696057

Similar Datasets

Project description:BackgroundAriids or sea catfishes are one of the two otophysan fish families (out of about 67 families in four orders) that inhabit mainly marine and brackish waters (although some species occur strictly in fresh waters). The group includes over 150 species placed in approximately 29 genera and two subfamilies (Galeichthyinae and Ariinae). Despite their global distribution, ariids are largely restricted to the continental shelves due in part to their specialized reproductive behavior (i.e., oral incubation). Thus, among marine fishes, ariids offer an excellent opportunity for inferring historical biogeographic scenarios. Phylogenetic hypotheses available for ariids have focused on restricted geographic areas and comprehensive phylogenies are still missing. This study inferred phylogenetic hypotheses for 123 ariid species in 28 genera from different biogeographic provinces using both mitochondrial and nuclear sequences (up to approximately 4 kb).ResultsWhile the topologies obtained support the monophyly of basal groups, up to ten genera validated in previous morphological studies were incongruent with the molecular topologies. New World ariines were recovered as paraphyletic and Old World ariines were grouped into a well-supported clade that was further divided into subclades mainly restricted to major Gondwanan landmasses. A general area cladogram derived from the area cladograms of ariines and three other fish groups was largely congruent with the geological area cladogram of Gondwana. Nonetheless, molecular clock estimations provided variable results on the timing of ariine diversification (approximately 105-41 mya).ConclusionThis study provides the most comprehensive phylogeny of sea catfishes to date and highlights the need for re-assessment of their classification. While from a topological standpoint the evolutionary history of ariines is mostly congruent with vicariance associated with the sequence of events during Gondwanan fragmentation, ambiguous divergence time estimations hinders assessing the vicariant hypothesis on a temporal framework. Further examination of ariid fossils might provide the basis for more accurate inferences on the timing of ariine diversification.

Project description:Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.

Project description:BackgroundGenomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models.MethodsWe developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation.ResultsFor 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints.ConclusionsOur findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding "optimized" model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.

Dataset Information

Performance of criteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets.

Publications

Performance of criteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets