Unknown

Dataset Information

0

Simple integrative preprocessing preserves what is shared in data sources.


ABSTRACT:

Background

Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source.

Results

It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at http://www.cis.hut.fi/projects/mi/software/drCCA/.

Conclusion

We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.

SUBMITTER: Tripathi A 

PROVIDER: S-EPMC2278131 | biostudies-literature | 2008 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

Simple integrative preprocessing preserves what is shared in data sources.

Tripathi Abhishek A   Klami Arto A   Kaski Samuel S  

BMC bioinformatics 20080221


<h4>Background</h4>Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not im  ...[more]

Similar Datasets

| S-EPMC8596781 | biostudies-literature
| S-EPMC7738550 | biostudies-literature
| S-EPMC4652620 | biostudies-literature
| S-EPMC3909228 | biostudies-literature
| S-EPMC6500068 | biostudies-literature
| S-EPMC6352382 | biostudies-literature
| S-EPMC3769145 | biostudies-literature
| S-EPMC7409520 | biostudies-literature