Dataset Information

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

ABSTRACT: Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.

SUBMITTER: Libbrecht MW

PROVIDER: S-EPMC5835207 | biostudies-literature | 2018 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Libbrecht Maxwell W MW Bilmes Jeffrey A JA Noble William Stafford WS

Proteins 20180201 4

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization ...[more]

PMID: 29345009

Dataset Information

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Publications

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Choosing panels of genomics assays using submodular optimization.
| S-EPMC5111315 | biostudies-literature

Homology-driven assembly of NOn-redundant protEin sequence sets (NOmESS) for mass spectrometry.
| S-EPMC4848398 | biostudies-literature

Selection of representative protein data sets.
| S-EPMC2142204 | biostudies-other

Choosing your (Friedel) mates wisely: grouping data sets to improve anomalous signal.
| S-EPMC6400255 | biostudies-literature

The annotation-enriched non-redundant patent sequence databases.
| S-EPMC3568390 | biostudies-literature

Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets.
| S-EPMC5542532 | biostudies-other

BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis.
| S-EPMC7203750 | biostudies-literature

OWL--a non-redundant composite protein sequence database.
| S-EPMC308323 | biostudies-other

Data Sets Representative of the Structures and Experimental Properties of FDA-Approved Drugs.
| S-EPMC5846051 | biostudies-literature

Partitioning clustering algorithms for protein sequence data sets.
| S-EPMC2678123 | biostudies-literature