Dataset Information

Maximizing the reusability of gene expression data by predicting missing metadata.

ABSTRACT: Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

SUBMITTER: Lung PY

PROVIDER: S-EPMC7673503 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Maximizing the reusability of gene expression data by predicting missing metadata.

Lung Pei-Yau PY Zhong Dongrui D Pang Xiaodong X Li Yan Y Zhang Jinfeng J

PLoS computational biology 20201106 11

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. ...[more]

PMID: 33156882

Dataset Information

Maximizing the reusability of gene expression data by predicting missing metadata.

Publications

Maximizing the reusability of gene expression data by predicting missing metadata.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation.
| S-EPMC10598006 | biostudies-literature

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).
| S-EPMC5643580 | biostudies-literature

Predicting structured metadata from unstructured metadata.
| S-EPMC4892825 | biostudies-literature

Multitask knowledge-primed neural network for predicting missing metadata and host phenotype based on human microbiome.
| S-EPMC11676323 | biostudies-literature

SigCom LINCS: data and metadata search engine for a million gene expression signatures.
| S-EPMC9252724 | biostudies-literature

Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package.
| S-EPMC8847733 | biostudies-literature

Predicting gene knockout effects from expression data.
| S-EPMC9938619 | biostudies-literature

Predicting proteome dynamics using gene expression data.
| S-EPMC6138643 | biostudies-literature

Discovering missing reactions of metabolic networks by using gene co-expression data.
| S-EPMC5288723 | biostudies-literature

Missing value imputation for microarray gene expression data using histone acetylation information.
| S-EPMC2432074 | biostudies-literature