Dataset Information

ALE: automated label extraction from GEO metadata.

ABSTRACT: BACKGROUND:NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. RESULTS:Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. CONCLUSION:Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.

SUBMITTER: Giles CB

PROVIDER: S-EPMC5751806 | biostudies-literature | 2017 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ALE: automated label extraction from GEO metadata.

Giles Cory B CB Brown Chase A CA Ripperger Michael M Dennis Zane Z Roopnarinesingh Xiavan X Porter Hunter H Perz Aleksandra A Wren Jonathan D JD

BMC bioinformatics 20171228 Suppl 14

<h4>Background</h4>NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning ...[more]

PMID: 29297276

Dataset Information

ALE: automated label extraction from GEO metadata.

Publications

ALE: automated label extraction from GEO metadata.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.
| S-EPMC6333964 | biostudies-literature

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).
| S-EPMC5643580 | biostudies-literature

Radtools: R utilities for convenient extraction of medical image metadata.
| S-EPMC6518432 | biostudies-literature

Microbench: automated metadata management for systems biology benchmarking and reproducibility in Python.
| S-EPMC9563693 | biostudies-literature

Building an annotated corpus for automatic metadata extraction from multilingual journal article references.
| S-EPMC9858828 | biostudies-literature

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records.
| S-EPMC4997033 | biostudies-literature

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts.
| S-EPMC6235242 | biostudies-literature

CytoSeg 2.0: automated extraction of actin filaments.
| S-EPMC7203740 | biostudies-literature

Olfactory Receptor Database: a metadata-driven automated population from sources of gene and protein sequences.
| S-EPMC99065 | biostudies-literature