Dataset Information

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

ABSTRACT: Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL:https://github.com/rbouadjenek/DQBioinformatics.

SUBMITTER: Bouadjenek MR

PROVIDER: S-EPMC5467556 | biostudies-literature | 2017 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Bouadjenek Mohamed Reda MR Verspoor Karin K Zobel Justin J

Database : the journal of biological databases and curation 20170101 1

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek ...[more]

PMID: 28365737

Similar Datasets

Project description:BackgroundThe increased use of electronic medical records (EMRs) in Canadian primary health care practice has resulted in an expansion of the availability of EMR data. Potential users of these data need to understand their quality in relation to the uses to which they are applied. Herein, we propose a basic model for assessing primary health care EMR data quality, comprising a set of data quality measures within four domains. We describe the process of developing and testing this set of measures, share the results of applying these measures in three EMR-derived datasets, and discuss what this reveals about the measures and EMR data quality. The model is offered as a starting point from which data users can refine their own approach, based on their own needs.MethodsUsing an iterative process, measures of EMR data quality were created within four domains: comparability; completeness; correctness; and currency. We used a series of process steps to develop the measures. The measures were then operationalized, and tested within three datasets created from different EMR software products.ResultsA set of eleven final measures were created. We were not able to calculate results for several measures in one dataset because of the way the data were collected in that specific EMR. Overall, we found variability in the results of testing the measures (e.g. sensitivity values were highest for diabetes, and lowest for obesity), among datasets (e.g. recording of height), and by patient age and sex (e.g. recording of blood pressure, height and weight).ConclusionsThis paper proposes a basic model for assessing primary health care EMR data quality. We developed and tested multiple measures of data quality, within four domains, in three different EMR-derived primary health care datasets. The results of testing these measures indicated that not all measures could be utilized in all datasets, and illustrated variability in data quality. This is one step forward in creating a standard set of measures of data quality. Nonetheless, each project has unique challenges, and therefore requires its own data quality assessment before proceeding.

Dataset Information

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Publications

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets