Dataset Information

Hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R.

ABSTRACT: Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new approach. Since PCA models are dependent on the members of the group being modeled, large datasets with many members make it difficult to accurately model the variance in the data. Our tool, hcapca, first groups strains based on the similarity of their chemical composition, and then applies PCA to the smaller sub-groups yielding more robust PCA models. This allows for scalable chemical comparisons among hundreds of strains with thousands of molecular features. As a proof of concept, we applied our open-source tool to a dataset with 1046 LCMS profiles of marine invertebrate associated bacteria and discovered three new analogs of an established anticancer agent from one promising strain.

SUBMITTER: Chanana S

PROVIDER: S-EPMC7407629 | biostudies-literature | 2020 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

<i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R.

Chanana Shaurya S Thomas Chris S CS Zhang Fan F Rajski Scott R SR Bugni Tim S TS

Metabolites 20200721 7

Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new appro ...[more]

PMID: 32708222

Dataset Information

Hcapca: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R.

Publications

<i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R.

OmicsDI is part of the ELIXIR infrastructure

Tweets