Dataset Information

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

ABSTRACT: Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed $1$ billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under $7$ h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of $30$ in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.

SUBMITTER: Barbera P

PROVIDER: S-EPMC6368480 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

Barbera Pierre P Kozlov Alexey M AM Czech Lucas L Morel Benoit B Darriba Diego D Flouri Tomáš T Stamatakis Alexandros A

Systematic biology 20190301 2

Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACE ...[more]

PMID: 30165689

Similar Datasets

Project description:There have been two major eras in the history of gene discovery. The first was the era of linkage analysis, with approximately 1,300 disease-related genes identified by positional cloning by the turn of the millennium. The second era has been powered by two major breakthroughs: the publication of the human genome and the development of massively parallel sequencing (MPS). MPS has greatly accelerated disease gene identification, such that disease genes that would have taken years to map previously can now be determined in a matter of weeks. Additionally, the number of affected families needed to map a causative gene and the size of such families have fallen: de novo mutations, previously intractable by linkage analysis, can be identified through sequencing of the parent-child trio, and genes for recessive disease can be identified through MPS even of a single affected individual. MPS technologies include whole exome sequencing (WES), whole genome sequencing (WGS), and panel sequencing, each with their strengths. While WES has been responsible for most gene discoveries through MPS, WGS is superior in detecting copy number variants, chromosomal rearrangements, and repeat-rich regions. Panels are commonly used for diagnostic purposes as they are extremely cost-effective and generate manageable quantities of data, with no risk of unexpected findings. However, in instances of diagnostic uncertainty, it can be challenging to choose the right panel, and in these circumstances WES has a higher diagnostic yield. MPS has ethical, social, and legal implications, many of which are common to genetic testing generally but amplified due to the magnitude of data (e.g., relationship misattribution, identification of variants of uncertain significance, and genetic discrimination); others are unique to WES and WGS technologies (e.g., incidental or secondary findings). Nonetheless, MPS is rapidly translating into clinical practice as an extremely useful part of the clinical armamentarium.

Dataset Information

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

Publications

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets