Dataset Information

Word correlation matrices for protein sequence analysis and remote homology detection.

ABSTRACT:

Background

Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.

Results

In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.

Conclusion

Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.

SUBMITTER: Lingner T

PROVIDER: S-EPMC2438326 | biostudies-literature | 2008 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Word correlation matrices for protein sequence analysis and remote homology detection.

Lingner Thomas T Meinicke Peter P

BMC bioinformatics 20080603

<h4>Background</h4>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.<h4>Results</h4>In this work we present a novel kernel for protein sequences based on average word s ...[more]

PMID: 18522726

Dataset Information

Word correlation matrices for protein sequence analysis and remote homology detection.

Background

Results

Conclusion

Publications

Word correlation matrices for protein sequence analysis and remote homology detection.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection.
| S-EPMC7537947 | biostudies-literature

Protein Remote Homology Detection Based on an Ensemble Learning Approach.
| S-EPMC4875977 | biostudies-literature

Protein remote homology detection based on bidirectional long short-term memory.
| S-EPMC5634958 | biostudies-literature

Using amino acid physicochemical distance transformation for fast protein remote homology detection.
| S-EPMC3460876 | biostudies-literature

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.
| S-EPMC10981738 | biostudies-literature

Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.
| S-EPMC7022958 | biostudies-literature

dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation.
| S-EPMC5007510 | biostudies-literature

CPHmodels-3.0--remote homology modeling using structure-guided sequence profiles.
| S-EPMC2896139 | biostudies-literature

A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models.
| S-EPMC3078102 | biostudies-literature

Motif kernel generated by genetic programming improves remote homology and fold detection.
| S-EPMC1794419 | biostudies-literature