Unknown

Dataset Information

0

GGRaSP: a R-package for selecting representative genomes using Gaussian mixture models.


ABSTRACT: Motivation:The vast number of available sequenced bacterial genomes occasionally exceeds the facilities of comparative genomic methods or is dominated by a single outbreak strain, and thus a diverse and representative subset is required. Generation of the reduced subset currently requires a priori supervised clustering and sequence-only selection of medoid genomic sequences, independent of any additional genome metrics or strain attributes. Results:The Gaussian Genome Representative Selector with Prioritization (GGRaSP) R-package described below generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. The package also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian mixture model to select an appropriate cluster threshold. We demonstrate the capabilities of GGRaSP by generating a reduced list of 315 genomes from a genomic dataset of 4600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Availability and implementaion:GGRaSP is available at https://github.com/JCVenterInstitute/ggrasp/. Supplementary information:Supplementary data are available at Bioinformatics online.

SUBMITTER: Clarke TH 

PROVIDER: S-EPMC6129299 | biostudies-literature | 2018 Sep

REPOSITORIES: biostudies-literature

altmetric image

Publications

GGRaSP: a R-package for selecting representative genomes using Gaussian mixture models.

Clarke Thomas H TH   Brinkac Lauren M LM   Sutton Granger G   Fouts Derrick E DE  

Bioinformatics (Oxford, England) 20180901 17


<h4>Motivation</h4>The vast number of available sequenced bacterial genomes occasionally exceeds the facilities of comparative genomic methods or is dominated by a single outbreak strain, and thus a diverse and representative subset is required. Generation of the reduced subset currently requires a priori supervised clustering and sequence-only selection of medoid genomic sequences, independent of any additional genome metrics or strain attributes.<h4>Results</h4>The Gaussian Genome Representati  ...[more]

Similar Datasets

| S-EPMC6403234 | biostudies-literature
| S-EPMC5860603 | biostudies-literature
| S-EPMC4905523 | biostudies-other
| S-EPMC5561081 | biostudies-other
2017-10-08 | GSE104714 | GEO
| S-EPMC7021245 | biostudies-literature
| S-EPMC4311641 | biostudies-literature
| S-EPMC8577282 | biostudies-literature
| S-EPMC6935449 | biostudies-literature
| S-EPMC2717951 | biostudies-literature