Unknown

Dataset Information

0

A merged lung cancer transcriptome dataset for clinical predictive modeling.


ABSTRACT: The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.

SUBMITTER: Lim SB 

PROVIDER: S-EPMC6057440 | biostudies-literature | 2018 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

A merged lung cancer transcriptome dataset for clinical predictive modeling.

Lim Su Bin SB   Tan Swee Jin SJ   Lim Wan-Teck WT   Lim Chwee Teck CT  

Scientific data 20180724


The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; all  ...[more]

Similar Datasets

| S-EPMC5282551 | biostudies-literature
2021-02-18 | E-MTAB-10089 | biostudies-arrayexpress
| S-EPMC3221887 | biostudies-literature
2005-11-07 | GSE3141 | GEO
| S-EPMC8358057 | biostudies-literature
| S-EPMC7089634 | biostudies-literature
| S-EPMC6420731 | biostudies-literature
| S-EPMC7849382 | biostudies-literature
| S-EPMC8723929 | biostudies-literature
| S-EPMC3046483 | biostudies-literature