Unknown

Dataset Information

0

Mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species.


ABSTRACT: Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called 'mlplasmids'. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.

SUBMITTER: Arredondo-Alonso S 

PROVIDER: S-EPMC6321875 | biostudies-literature | 2018 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species.

Arredondo-Alonso Sergio S   Rogers Malbert R C MRC   Braat Johanna C JC   Verschuuren Tess D TD   Top Janetta J   Corander Jukka J   Willems Rob J L RJL   Schürch Anita C AC  

Microbial genomics 20181101 11


Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (S  ...[more]

Similar Datasets

| S-EPMC10603768 | biostudies-literature
| S-EPMC1456992 | biostudies-literature
| S-EPMC4355611 | biostudies-literature
| S-EPMC5310375 | biostudies-literature
| S-EPMC9757591 | biostudies-literature
| S-EPMC10538763 | biostudies-literature
| S-EPMC6289135 | biostudies-literature
| S-EPMC4138379 | biostudies-literature
| S-EPMC5615795 | biostudies-literature
| S-EPMC7406044 | biostudies-literature