Dataset Information

ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes.

ABSTRACT:

Motivation

Coalescent- and reconciliation-based methods are now widely used to infer species phylogenies from genomic data. They typically use per-gene phylogenies as input, which requires conducting multiple individual tree inferences on a large set of multiple sequence alignments (MSAs). At present, no easy-to-use parallel tool for this task exists. Ad hoc scripts for this purpose do not only induce additional implementation overhead, but can also lead to poor resource utilization and long times-to-solution. We present ParGenes, a tool for simultaneously determining the best-fit model and inferring maximum likelihood (ML) phylogenies on thousands of independent MSAs using supercomputers.

Results

ParGenes executes common phylogenetic pipeline steps such as model-testing, ML inference(s), bootstrapping and computation of branch support values via a single parallel program invocation. We evaluated ParGenes by inferring > 20 000 phylogenetic gene trees with bootstrap support values from Ensembl Compara and VectorBase alignments in 28?h on a cluster with 1024 nodes.

Availability and implementation

GNU GPL at https://github.com/BenoitMorel/ParGenes.

Supplementary information

Supplementary material is available at Bioinformatics online.

SUBMITTER: Morel B

PROVIDER: S-EPMC6513153 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes.

Morel Benoit B Kozlov Alexey M AM Stamatakis Alexandros A

Bioinformatics (Oxford, England) 20190501 10

<h4>Motivation</h4>Coalescent- and reconciliation-based methods are now widely used to infer species phylogenies from genomic data. They typically use per-gene phylogenies as input, which requires conducting multiple individual tree inferences on a large set of multiple sequence alignments (MSAs). At present, no easy-to-use parallel tool for this task exists. Ad hoc scripts for this purpose do not only induce additional implementation overhead, but can also lead to poor resource utilization and ...[more]

PMID: 30321303

Similar Datasets

Project description:Recent advances in next-generation sequencing (NGS) technologies spur progress in determining the microbial diversity in various ecosystems by highlighting, for example, the rare biosphere. Currently, high-throughput pyrotag sequencing of PCR-amplified SSU rRNA gene regions is mainly used to characterize bacterial and archaeal communities, and rarely to characterize protist communities. In addition, although taxonomic assessment through phylogeny is considered as the most robust approach, similarity and probabilistic approaches remain the most commonly used for taxonomic affiliation. In a first part of this work, a tree-based method was compared with different approaches of taxonomic affiliation (BLAST and RDP) of 18S rRNA gene sequences and was shown to be the most accurate for near full-length sequences and for 400 bp amplicons, with the exception of amplicons covering the V5-V6 region. Secondly, the applicability of this method was tested by running a full scale test using an original pyrosequencing dataset of 18S rRNA genes of small lacustrine protists (0.2-5 µm) from eight freshwater ecosystems. Our results revealed that i) fewer than 5% of the operational taxonomic units (OTUs) identified through clustering and phylogenetic affiliation had been previously detected in lakes, based on comparison to sequence in public databases; ii) the sequencing depth provided by the NGS coupled with a phylogenetic approach allowed to shed light on clades of freshwater protists rarely or never detected with classical molecular ecology approaches; and iii) phylogenetic methods are more robust in describing the structuring of under-studied or highly divergent populations. More precisely, new putative clades belonging to Mamiellophyceae, Foraminifera, Dictyochophyceae and Euglenida were detected. Beyond the study of protists, these results illustrate that the tree-based approach for NGS based diversity characterization allows an in-depth description of microbial communities including taxonomic profiling, community structuring and the description of clades of any microorganisms (protists, Bacteria and Archaea).

Dataset Information

ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes.

Motivation

Results

Availability and implementation

Supplementary information

Publications

ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets