Unknown

Dataset Information

0

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.


ABSTRACT: The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.

SUBMITTER: Ferrer Florensa A 

PROVIDER: S-EPMC11327874 | biostudies-literature | 2024 Sep

REPOSITORIES: biostudies-literature

altmetric image

Publications

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

Ferrer Florensa Alfred A   Almagro Armenteros Jose Juan JJ   Nielsen Henrik H   Aarestrup Frank Møller FM   Clausen Philip Thomas Lanken Conradsen PTLC  

NAR genomics and bioinformatics 20240816 3


The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to <i>generalize</i>), it is common to split the  ...[more]

Similar Datasets

| S-EPMC9898993 | biostudies-literature
| S-EPMC7613299 | biostudies-literature
| S-EPMC8432852 | biostudies-literature
| S-EPMC1949826 | biostudies-literature
| S-EPMC10138783 | biostudies-literature
| S-EPMC10393634 | biostudies-literature
2022-12-22 | GSE218466 | GEO
| S-EPMC9477370 | biostudies-literature
| S-EPMC7148117 | biostudies-literature
| S-EPMC9580936 | biostudies-literature