Dataset Information

Non-specific filtering of beta-distributed data.

ABSTRACT:

Background

Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results

We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions

We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

SUBMITTER: Wang X

PROVIDER: S-EPMC4230495 | biostudies-literature | 2014 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Non-specific filtering of beta-distributed data.

Wang Xinhui X Laird Peter W PW Hinoue Toshinori T Groshen Susan S Siegmund Kimberly D KD

BMC bioinformatics 20140619

<h4>Background</h4>Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean valu ...[more]

PMID: 24943962

Dataset Information

Non-specific filtering of beta-distributed data.

Background

Results

Conclusions

Publications

Non-specific filtering of beta-distributed data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Filtering and inference for stochastic oscillators with distributed delays.
| S-EPMC6477979 | biostudies-literature

VariantBam: filtering and profiling of next-generational sequencing data using region-specific rules.
| S-EPMC4920121 | biostudies-literature

α- and β-myosin II can be non-uniformly distributed within the cardiac sarcomere.
| S-EPMC11994913 | biostudies-literature

On the Precise Tuning of Optical Filtering Features in Nanoporous Anodic Alumina Distributed Bragg Reflectors.
| S-EPMC5854698 | biostudies-literature

AlleleHMM: a data-driven method to identify allele specific differences in distributed functional genomic marks.
| S-EPMC6582321 | biostudies-literature

Filtering duplicate reads from 454 pyrosequencing data.
| S-EPMC3605598 | biostudies-literature

Output-Sensitive Filtering of Streaming Volume Data.
| S-EPMC5349295 | biostudies-literature

PERFect: PERmutation Filtering test for microbiome data.
| S-EPMC6797060 | biostudies-literature

Filtering procedures for untargeted LC-MS metabolomics data.
| S-EPMC6570933 | biostudies-literature

Filtering for increased power for microarray data analysis.
| S-EPMC2661050 | biostudies-literature