Unknown

Dataset Information

0

Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences.


ABSTRACT: Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ?50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ?100% at 100% identity but ?50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal.

SUBMITTER: Edgar RC 

PROVIDER: S-EPMC5910792 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

altmetric image

Publications

Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences.

Edgar Robert C RC  

PeerJ 20180418


Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference da  ...[more]

Similar Datasets

| S-EPMC4297541 | biostudies-literature
| S-EPMC6615214 | biostudies-literature
| S-EPMC10269663 | biostudies-literature
| S-EPMC6003391 | biostudies-literature
| PRJNA958098 | ENA
| PRJEB78691 | ENA
2021-09-09 | PXD017844 | Pride
| S-EPMC8925046 | biostudies-literature
| S-EPMC3333176 | biostudies-literature
| S-EPMC8067651 | biostudies-literature