Dataset Information

Penalized unsupervised learning with outliers.

ABSTRACT: We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an "error" term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations' errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored.

SUBMITTER: Witten DM

PROVIDER: S-EPMC3716393 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Penalized unsupervised learning with outliers.

Witten Daniela M DM

Statistics and its interface 20130101 2

We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, <i>K</i>-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outl ...[more]

PMID: 23875057

Similar Datasets

Project description:Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

Project description:Two strikingly distinct types of activity have been observed in various brain structures during delay periods of delayed response tasks: Persistent activity (PA), in which a sub-population of neurons maintains an elevated firing rate throughout an entire delay period; and Sequential activity (SA), in which sub-populations of neurons are activated sequentially in time. It has been hypothesized that both types of dynamics can be "learned" by the relevant networks from the statistics of their inputs, thanks to mechanisms of synaptic plasticity. However, the necessary conditions for a synaptic plasticity rule and input statistics to learn these two types of dynamics in a stable fashion are still unclear. In particular, it is unclear whether a single learning rule is able to learn both types of activity patterns, depending on the statistics of the inputs driving the network. Here, we first characterize the complete bifurcation diagram of a firing rate model of multiple excitatory populations with an inhibitory mechanism, as a function of the parameters characterizing its connectivity. We then investigate how an unsupervised temporally asymmetric Hebbian plasticity rule shapes the dynamics of the network. Consistent with previous studies, we find that for stable learning of PA and SA, an additional stabilization mechanism is necessary. We show that a generalized version of the standard multiplicative homeostatic plasticity (Renart et al., 2003; Toyoizumi et al., 2014) stabilizes learning by effectively masking excitatory connections during stimulation and unmasking those connections during retrieval. Using the bifurcation diagram derived for fixed connectivity, we study analytically the temporal evolution and the steady state of the learned recurrent architecture as a function of parameters characterizing the external inputs. Slow changing stimuli lead to PA, while fast changing stimuli lead to SA. Our network model shows how a network with plastic synapses can stably and flexibly learn PA and SA in an unsupervised manner.

Project description:PurposeImage quality of positron emission tomography (PET) is limited by various physical degradation factors. Our study aims to perform PET image denoising by utilizing prior information from the same patient. The proposed method is based on unsupervised deep learning, where no training pairs are needed.MethodsIn this method, the prior high-quality image from the patient was employed as the network input and the noisy PET image itself was treated as the training label. Constrained by the network structure and the prior image input, the network was trained to learn the intrinsic structure information from the noisy image and output a restored PET image. To validate the performance of the proposed method, a computer simulation study based on the BrainWeb phantom was first performed. A 68Ga-PRGD2 PET/CT dataset containing 10 patients and a 18F-FDG PET/MR dataset containing 30 patients were later on used for clinical data evaluation. The Gaussian, non-local mean (NLM) using CT/MR image as priors, BM4D, and Deep Decoder methods were included as reference methods. The contrast-to-noise ratio (CNR) improvements were used to rank different methods based on Wilcoxon signed-rank test.ResultsFor the simulation study, contrast recovery coefficient (CRC) vs. standard deviation (STD) curves showed that the proposed method achieved the best performance regarding the bias-variance tradeoff. For the clinical PET/CT dataset, the proposed method achieved the highest CNR improvement ratio (53.35% ± 21.78%), compared with the Gaussian (12.64% ± 6.15%, P = 0.002), NLM guided by CT (24.35% ± 16.30%, P = 0.002), BM4D (38.31% ± 20.26%, P = 0.002), and Deep Decoder (41.67% ± 22.28%, P = 0.002) methods. For the clinical PET/MR dataset, the CNR improvement ratio of the proposed method achieved 46.80% ± 25.23%, higher than the Gaussian (18.16% ± 10.02%, P < 0.0001), NLM guided by MR (25.36% ± 19.48%, P < 0.0001), BM4D (37.02% ± 21.38%, P < 0.0001), and Deep Decoder (30.03% ± 20.64%, P < 0.0001) methods. Restored images for all the datasets demonstrate that the proposed method can effectively smooth out the noise while recovering image details.ConclusionThe proposed unsupervised deep learning framework provides excellent image restoration effects, outperforming the Gaussian, NLM methods, BM4D, and Deep Decoder methods.

Dataset Information

Penalized unsupervised learning with outliers.

Publications

Penalized unsupervised learning with outliers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets