Dataset Information

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

ABSTRACT: The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.

SUBMITTER: Ferrer Florensa A

PROVIDER: S-EPMC11327874 | biostudies-literature | 2024 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

Ferrer Florensa Alfred A Almagro Armenteros Jose Juan JJ Nielsen Henrik H Aarestrup Frank Møller FM Clausen Philip Thomas Lanken Conradsen PTLC

NAR genomics and bioinformatics 20240816 3

The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to <i>generalize</i>), it is common to split the ...[more]

PMID: 39157582

Similar Datasets

Project description:Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra from PRIDE Cluster. Also, we developed a new bioinformatics tool (mslookup - https://github.com/bigbio/DLEAMSE/) that allows users to quickly search for spectra in previously identified mass spectra publish in public repositories and spectral libraries. Finally, we released a human database to enable bioinformaticians and biologists to search for identified spectra in their machines. SIGNIFICANCE STATEMENT: Spectral similarity calculation plays an important role in proteomics data analysis. With deep learning's ability to learn the implicit and effective features from large-scale training datasets, deep learning-based MS/MS spectra embedding models has emerged as a solution to improve mass spectral clustering similarity calculation algorithms. We compare multiple similarity scoring and deep learning methods in terms of accuracy (compute the similarity for a pair of the mass spectrum) and computing-time performance. The benchmark results showed no major differences in accuracy between DLEAMSE and normalized dot product for spectrum similarity calculations. The DLEAMSE GPU implementation is faster than NDP in preprocessing on the GPU server and the similarity calculation of DLEAMSE (Euclidean distance on 32-D vectors) takes about 1/3 of dot product calculations. The deep learning model (DLEAMSE) encoding and embedding steps needed to run once for each spectrum and the embedded 32-D points can be persisted in the repository for future comparison, which is faster for future comparisons and large-scale data. Based on these, we proposed a new tool mslookup that enables the researcher to find spectra previously identified in public data. The tool can be also used to generate in-house databases of previously identified spectra to share with other laboratories and consortiums.

Project description:Recent climate change (CC) scenarios from the Coupled Model Intercomparison Project Phase 6 (CMIP6) have just been released in coarse resolution. Deep learning (DL) based on statistical downscaling has recently been used, but more research is needed, particularly in arid regions, because little is known about their suitability for extrapolating future CC scenarios. Here we analyzed this issue by downscaling maximum, and minimum temperature over the Egyptian domain based on one General Circulation Model (GCM) as CanESM5 and two shared socioeconomic pathways (SSPs) as SSP4.5 and SSP8.5 from CMIP6 using Convolutional Neural Network (CNN) herein after called CNNSD. The downscaled maximum and minimum temperatures based CNNSD was able to reproduce the observed climate over historical and future periods at a finer resolution (0.1°), reducing the biases exhibited by the original scenario. To the best of our knowledge, this is the first time CNN has been used to downscale CMIP6 scenarios, particularly in arid regions. The downscaled analysis showed that maximum and minimum temperatures are expected to rise by 4.8 °C and 4.0 °C, respectively, in the future (2015-2100), compared to the historical period, under the moderate scenario (SSP4.5). Meanwhile, under the Fossil-fueled Development scenario (SSP8.5), these values will rise by 6.3 °C and 4.2 °C, respectively as analyzed by the CNNSD. The developed approach could be used not only in Egypt but also in other developing countries, which are especially vulnerable to climate change and has a scarcity of related research. The established downscaled approach's supply can be used to provide climate services, as a driver for impact studies and adaptation decisions, and as information for policy development. More research is needed, however, to include multi-GCMs to quantify the uncertainties between GCMs and SSPs, improving the outputs for use in climate change impacts and adaptations for food and nutrition security.

Dataset Information

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

Publications

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets