Dataset Information

Large-scale machine learning for metagenomics sequence classification.

ABSTRACT:

Motivation

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.

Results

We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 10(8) samples in 10(7) dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2-17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.

Availability and implementation

Data and codes are available at http://cbio.ensmp.fr/largescalemetagenomics

Contact

pierre.mahe@biomerieux.com

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Vervier K

PROVIDER: S-EPMC4896366 | biostudies-literature | 2016 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Large-scale machine learning for metagenomics sequence classification.

Vervier Kévin K Mahé Pierre P Tournoud Maud M Veyrieras Jean-Baptiste JB Vert Jean-Philippe JP

Bioinformatics (Oxford, England) 20151120 7

<h4>Motivation</h4>Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide st ...[more]

PMID: 26589281

Dataset Information

Large-scale machine learning for metagenomics sequence classification.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

Large-scale machine learning for metagenomics sequence classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Machine learning approaches for large scale classification of produce.
| S-EPMC5869718 | biostudies-literature

Supervised machine learning for diagnostic classification from large-scale neuroimaging datasets.
| S-EPMC7198352 | biostudies-literature

Nested Machine Learning Facilitates Increased Sequence Content for Large-Scale Automated High Resolution Melt Genotyping.
| S-EPMC4726007 | biostudies-literature

Trainable high resolution melt curve machine learning classifier for large-scale reliable genotyping of sequence variants.
| S-EPMC4183555 | biostudies-literature

Learning supervised embeddings for large scale sequence comparisons.
| S-EPMC7069636 | biostudies-literature

Recovering large-scale battery aging dataset with machine learning.
| S-EPMC8369168 | biostudies-literature

Sequence-Based Prediction of Plant Allergenic Proteins: Machine Learning Classification Approach.
| S-EPMC9893444 | biostudies-literature

Code4ML: a large-scale dataset of annotated Machine Learning code.
| S-EPMC10280557 | biostudies-literature

Machine Learning-Enabled Pipeline for Large-Scale Virtual Drug Screening.
| S-EPMC8478848 | biostudies-literature

Alzheimer's disease risk assessment using large-scale machine learning methods.
| S-EPMC3826736 | biostudies-literature