Dataset Information

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions.

ABSTRACT: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

SUBMITTER: Kim Y

PROVIDER: S-EPMC4111843 | biostudies-other | 2014 Jul

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions.

Kim Yohan Y Sidney John J Buus Søren S Sette Alessandro A Nielsen Morten M Peters Bjoern B

BMC bioinformatics 20140714

<h4>Background</h4>It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive m ...[more]

PMID: 25017736

Similar Datasets

Project description:IntroductionIn brain-computer interfaces (BCI) research, recording data is time-consuming and expensive, which limits access to big datasets. This may influence the BCI system performance as machine learning methods depend strongly on the training dataset size. Important questions arise: taking into account neuronal signal characteristics (e.g., non-stationarity), can we achieve higher decoding performance with more data to train decoders? What is the perspective for further improvement with time in the case of long-term BCI studies? In this study, we investigated the impact of long-term recordings on motor imagery decoding from two main perspectives: model requirements regarding dataset size and potential for patient adaptation.MethodsWe evaluated the multilinear model and two deep learning (DL) models on a long-term BCI & Tetraplegia (ClinicalTrials.gov identifier: NCT02550522) clinical trial dataset containing 43 sessions of ECoG recordings performed with a tetraplegic patient. In the experiment, a participant executed 3D virtual hand translation using motor imagery patterns. We designed multiple computational experiments in which training datasets were increased or translated to investigate the relationship between models' performance and different factors influencing recordings.ResultsOur results showed that DL decoders showed similar requirements regarding the dataset size compared to the multilinear model while demonstrating higher decoding performance. Moreover, high decoding performance was obtained with relatively small datasets recorded later in the experiment, suggesting motor imagery patterns improvement and patient adaptation during the long-term experiment. Finally, we proposed UMAP embeddings and local intrinsic dimensionality as a way to visualize the data and potentially evaluate data quality.DiscussionDL-based decoding is a prospective approach in BCI which may be efficiently applied with real-life dataset size. Patient-decoder co-adaptation is an important factor to consider in long-term clinical BCI.

Dataset Information

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions.

Publications

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets