Unknown

Dataset Information

0

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.


ABSTRACT: While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96-570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

SUBMITTER: Mancuso CA 

PROVIDER: S-EPMC7708069 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.

Mancuso Christopher A CA   Canfield Jacob L JL   Singla Deepak D   Krishnan Arjun A  

Nucleic acids research 20201201 21


While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to  ...[more]

Similar Datasets

| S-EPMC5084780 | biostudies-literature
| S-EPMC4678722 | biostudies-literature
| S-EPMC7377334 | biostudies-literature
| S-EPMC7062144 | biostudies-literature
| S-EPMC5408923 | biostudies-literature
| S-EPMC3645958 | biostudies-literature
| S-EPMC4300824 | biostudies-literature
| S-EPMC8122155 | biostudies-literature
| S-EPMC3505039 | biostudies-literature
| S-EPMC7494000 | biostudies-literature