Unknown

Dataset Information

0

Classifying short genomic fragments from novel lineages using composition and homology.


ABSTRACT:

Background

The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (e.g., LCA) and composition-based (e.g., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition.

Results

We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification.

Conclusions

Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term ?-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and ?-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences.

SUBMITTER: Parks DH 

PROVIDER: S-EPMC3173459 | biostudies-literature | 2011 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Classifying short genomic fragments from novel lineages using composition and homology.

Parks Donovan H DH   MacDonald Norman J NJ   Beiko Robert G RG  

BMC bioinformatics 20110809


<h4>Background</h4>The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given seque  ...[more]

Similar Datasets

| S-EPMC6855806 | biostudies-literature
| S-EPMC3772700 | biostudies-literature
| S-EPMC2839357 | biostudies-literature
| S-EPMC2933786 | biostudies-literature
| S-EPMC3384351 | biostudies-literature
| S-EPMC516078 | biostudies-literature
| S-EPMC5467940 | biostudies-literature
| S-EPMC2653487 | biostudies-literature
| S-EPMC6215429 | biostudies-literature
| S-EPMC2924854 | biostudies-literature