Dataset Information

Training data composition affects performance of protein structure analysis algorithms.

ABSTRACT: The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We Find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.

SUBMITTER: Derry A

PROVIDER: S-EPMC8669736 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:ObjectiveTo identify non-EEG-based signals and algorithms for detection of motor and non-motor seizures in people lying in bed during video-EEG (VEEG) monitoring and to test whether these algorithms work in freely moving people during mobile EEG recordings.MethodsData of three groups of adult people with epilepsy (PwE) were analyzed. Group 1 underwent VEEG with additional devices (accelerometry, ECG, electrodermal activity); group 2 underwent VEEG; and group 3 underwent mobile EEG recordings both including one-lead ECG. All seizure types were analyzed. Feature extraction and machine-learning techniques were applied to develop seizure detection algorithms. Performance was expressed as sensitivity, precision, F1 score, and false positives per 24 hours.ResultsThe algorithms were developed in group 1 (35 PwE, 33 seizures) and achieved best results (F1 score 56%, sensitivity 67%, precision 45%, false positives 0.7/24 hours) when ECG features alone were used, with no improvement by including accelerometry and electrodermal activity. In group 2 (97 PwE, 255 seizures), this ECG-based algorithm largely achieved the same performance (F1 score 51%, sensitivity 39%, precision 73%, false positives 0.4/24 hours). In group 3 (30 PwE, 51 seizures), the same ECG-based algorithm failed to meet up with the performance in groups 1 and 2 (F1 score 27%, sensitivity 31%, precision 23%, false positives 1.2/24 hours). ECG-based algorithms were also separately trained on data of groups 2 and 3 and tested on the data of the other groups, yielding maximal F1 scores between 8% and 26%.SignificanceOur results suggest that algorithms based on ECG features alone can provide clinically meaningful performance for automatic detection of all seizure types. Our study also underscores that the circumstances under which such algorithms were developed, and the selection of the training and test data sets need to be considered and limit the application of such systems to unseen patient groups behaving in different conditions.

Project description:BACKGROUND:Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. RESULTS:In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. CONCLUSIONS:Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.

Dataset Information

Training data composition affects performance of protein structure analysis algorithms.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets