Unknown

Dataset Information

0

Efficient use of unlabeled data for protein sequence classification: a comparative study.


ABSTRACT: BACKGROUND: Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. RESULTS: Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. CONCLUSION: The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

SUBMITTER: Kuksa P 

PROVIDER: S-EPMC2681072 | biostudies-literature | 2009

REPOSITORIES: biostudies-literature

altmetric image

Publications

Efficient use of unlabeled data for protein sequence classification: a comparative study.

Kuksa Pavel P   Huang Pai-Hsi PH   Pavlovic Vladimir V  

BMC bioinformatics 20090429


<h4>Background</h4>Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated com  ...[more]

Similar Datasets

| S-EPMC7994853 | biostudies-literature
| S-EPMC7856229 | biostudies-literature
| S-EPMC2515873 | biostudies-literature
| S-EPMC6101392 | biostudies-literature
| S-EPMC6196534 | biostudies-literature
| S-EPMC2275242 | biostudies-literature
| S-EPMC6629670 | biostudies-literature
| S-EPMC3250929 | biostudies-literature
2023-08-08 | GSE229791 | GEO
| S-EPMC2475627 | biostudies-literature