Dataset Information

A Protein Classification Benchmark collection for machine learning.

ABSTRACT: Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.

SUBMITTER: Sonego P

PROVIDER: S-EPMC1669728 | biostudies-literature | 2007 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A Protein Classification Benchmark collection for machine learning.

Sonego Paolo P Pacurar Mircea M Dhir Somdutta S Kertész-Farkas Attila A Kocsor András A Gáspári Zoltán Z Leunissen Jack A M JA Pongor Sándor S

Nucleic acids research 20061116 Database issue

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences a ...[more]

PMID: 17142240

Dataset Information

A Protein Classification Benchmark collection for machine learning.

Publications

A Protein Classification Benchmark collection for machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Benchmark on a large cohort for sleep-wake classification with machine learning techniques.
| S-EPMC6555808 | biostudies-literature

A systematic benchmark of machine learning methods for protein-RNA interaction prediction.
| S-EPMC10516373 | biostudies-literature

Multi-site benchmark classification of major depressive disorder using machine learning on cortical and subcortical measures.
| S-EPMC10784593 | biostudies-literature

MoleculeNet: a benchmark for molecular machine learning.
| S-EPMC5868307 | biostudies-literature

TECRR: a benchmark dataset of radiological reports for BI-RADS classification with machine learning, deep learning, and large language model baselines.
| S-EPMC11515610 | biostudies-literature

A benchmark dataset for machine learning in ecotoxicology.
| S-EPMC10584858 | biostudies-literature

Photosynthetic protein classification using genome neighborhood-based machine learning feature.
| S-EPMC7189237 | biostudies-literature

Ion-pumping microbial rhodopsin protein classification by machine learning approach.
| S-EPMC9881276 | biostudies-literature

Modern Machine Learning as a Benchmark for Fitting Neural Responses.
| S-EPMC6060269 | biostudies-literature

Drone and ground-truth data collection, image annotation and machine learning: A protocol for coastal habitat mapping and classification.
| S-EPMC11409010 | biostudies-literature