Dataset Information

Employing phylogenetic tree shape statistics to resolve the underlying host population structure.

ABSTRACT:

Background

Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying population structure.

Results

In this paper, we simulate phylogenetic trees from both structured and non-structured host populations. We compute eight statistics for the simulated trees, which are: the number of cherries; Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. Based on the estimated tree statistics, we classify the simulated trees as from either a non-structured or a structured population using the decision tree (DT), K-nearest neighbor (KNN) and support vector machine (SVM). We incorporate the basic reproductive number ([Formula: see text]) in our tree simulation procedure. Sensitivity analysis is done to investigate whether the classifiers are robust to different choice of model parameters and to size of trees. Cross-validated results for area under the curve (AUC) for receiver operating characteristic (ROC) curves yield mean values of over 0.9 for most of the classification models.

Conclusions

Our classification procedure distinguishes well between trees from structured and non-structured populations using the classifiers, the two-sample Kolmogorov-Smirnov, Cucconi and Podgor-Gastwirth tests and the box plots. SVM models were more robust to changes in model parameters and tree size compared to KNN and DT classifiers. Our classification procedure was applied to real -world data and the structured population was revealed with high accuracy of [Formula: see text] using SVM-polynomial classifier.

SUBMITTER: Kayondo HW

PROVIDER: S-EPMC8579572 | biostudies-literature | 2021 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Employing phylogenetic tree shape statistics to resolve the underlying host population structure.

Kayondo Hassan W HW Ssekagiri Alfred A Nabakooza Grace G Bbosa Nicholas N Ssemwanga Deogratius D Kaleebu Pontiano P Mwalili Samuel S Mango John M JM Leigh Brown Andrew J AJ Saenz Roberto A RA Galiwango Ronald R Kitayimbwa John M JM

BMC bioinformatics 20211110 1

<h4>Background</h4>Host population structure is a key determinant of pathogen and infectious disease transmission patterns. Pathogen phylogenetic trees are useful tools to reveal the population structure underlying an epidemic. Determining whether a population is structured or not is useful in informing the type of phylogenetic methods to be used in a given study. We employ tree statistics derived from phylogenetic trees and machine learning classification techniques to reveal an underlying popu ...[more]

PMID: 34758743

Similar Datasets

Project description:The shape of phylogenetic trees can be used to gain evolutionary insights. A tree's shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.

Project description:To what extent are generalist species cohesive evolutionary units rather than a compilation of recently diverged lineages? We examine this question in the context of host specificity and geographic structure in the insect pathogen and nematode mutualist Xenorhabdus bovienii. This bacterial species partners with multiple nematode species across two clades in the genus Steinernema. We sequenced the genomes of 42 X. bovienii strains isolated from four different nematode species and three field sites within a 240-km2 region and compared them to globally available reference genomes. We hypothesized that X. bovienii would comprise several host-specific lineages, such that bacterial and nematode phylogenies would be largely congruent. Alternatively, we hypothesized that spatial proximity might be a dominant signal, as increasing geographic distance might lower shared selective pressures and opportunities for gene flow. We found partial support for both hypotheses. Isolates clustered largely by nematode host species but did not strictly match the nematode phylogeny, indicating that shifts in symbiont associations across nematode species and clades have occurred. Furthermore, both genetic similarity and gene flow decreased with geographic distance across nematode species, suggesting differentiation and constraints on gene flow across both factors, although no absolute barriers to gene flow were observed across the regional isolates. Several genes associated with biotic interactions were found to be undergoing selective sweeps within this regional population. The interactions included several insect toxins and genes implicated in microbial competition. Thus, gene flow maintains cohesiveness across host associations in this symbiont and may facilitate adaptive responses to a multipartite selective environment. IMPORTANCE Microbial populations and species are notoriously hard to delineate. We used a population genomics approach to examine the population structure and the spatial scale of gene flow in Xenorhabdus bovienii, an intriguing species that is both a specialized mutualistic symbiont of nematodes and a broadly virulent insect pathogen. We found a strong signature of nematode host association, as well as evidence for gene flow connecting isolates associated with different nematode host species and collected from distinct study sites. Furthermore, we saw signatures of selective sweeps for genes involved with nematode host associations, insect pathogenicity, and microbial competition. Thus, X. bovienii exemplifies the growing consensus that recombination not only maintains cohesion but can also allow the spread of niche-beneficial alleles.

Dataset Information

Employing phylogenetic tree shape statistics to resolve the underlying host population structure.

Background

Results

Conclusions

Publications

Employing phylogenetic tree shape statistics to resolve the underlying host population structure.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets