Unknown

Dataset Information

0

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats.


ABSTRACT: The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.

SUBMITTER: Rubio A 

PROVIDER: S-EPMC7673337 | biostudies-literature | 2020 Jan

REPOSITORIES: biostudies-literature

altmetric image

Publications

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats.

Rubio Alejandro A   Mier Pablo P   Andrade-Navarro Miguel A MA   Garzón Andrés A   Jiménez Juan J   Pérez-Pulido Antonio J AJ  

Database : the journal of biological databases and curation 20200101


The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that t  ...[more]

Similar Datasets

2009-11-30 | GSE18619 | GEO
2010-05-19 | E-GEOD-18619 | biostudies-arrayexpress
| S-EPMC152625 | biostudies-literature
| S-EPMC7168238 | biostudies-literature
| S-EPMC1790904 | biostudies-literature
| S-EPMC5105158 | biostudies-literature
| S-EPMC1896005 | biostudies-literature
| S-EPMC7821992 | biostudies-literature
| S-EPMC3257605 | biostudies-literature
| S-EPMC6604393 | biostudies-literature