Dataset Information

Detecting and correcting misclassified sequences in the large-scale public databases.

ABSTRACT:

Motivation

As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity.

Results

We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases.

Availability and implementation

Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Bagheri H

PROVIDER: S-EPMC7821992 | biostudies-literature | 2020 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Detecting and correcting misclassified sequences in the large-scale public databases.

Bagheri Hamid H Severin Andrew J AJ Rajan Hridesh H

Bioinformatics (Oxford, England) 20200901 18

<h4>Motivation</h4>As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR databa ...[more]

PMID: 32579213

Similar Datasets

Project description:BackgroundTEM beta-lactamases are the main cause for resistance against beta-lactam antibiotics. Sequence information about TEM beta-lactamases is mainly found in the NCBI peptide database and TEM mutation table at http://www.lahey.org/Studies/temtable.asp. While the TEM mutation table is manually curated by experts in the lactamase field, who guarantee reliable and consistent information, the rapidly growing sequence and annotation information from the NCBI peptide database is sometimes inconsistent. Therefore, the Lactamase Engineering Database has been developed to collect the TEM beta-lactamase sequences from the NCBI peptide database and the TEM mutation table, systematically compare sequence information and naming, identify inconsistencies, and thus provide a versatile tool for reconciliation of data and for an investigation of the sequence-function relationship.DescriptionThe LacED currently provides 2399 sequence entries and 37 structure entries. Sequence information on 150 different TEM beta-lactamases was derived from the TEM mutation table which provides a unique number to each protein classified as TEM beta-lactamase. 293 TEM-like proteins were found in the NCBI protein database, but only 113 TEM beta-lactamase were common to both data sets. The 180 TEM beta-lactamases from the NCBI protein database which have not yet been assigned to a TEM number fall in three classes: (1) 89 proteins from microbial organisms and 35 proteins from cloning or expression vectors had a new mutation profile; (2) 55 proteins had inconsistent annotation in terms of TEM assignment or reported mutation profile; (3) 39 proteins are fragments. The LacED is web accessible at http://www.LacED.uni-stuttgart.de and contains multisequence alignments, structure information and reconciled annotation of TEM beta-lactamases. The LacED is weekly updated and supplies all data for download.ConclusionThe Lactamase Engineering Database enables a systematic analysis of TEM beta-lactamase sequence and annotation data from different data sources, and thus provides a valuable tool to identify inconsistencies in sequences from the NCBI peptide database, to detect TEM beta-lactamases with a novel mutation profile, and to identify new amino acid positions at which mutations can occur.

Dataset Information

Detecting and correcting misclassified sequences in the large-scale public databases.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Detecting and correcting misclassified sequences in the large-scale public databases.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets