Unknown

Dataset Information

0

Penalized unsupervised learning with outliers.


ABSTRACT: We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an "error" term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations' errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored.

SUBMITTER: Witten DM 

PROVIDER: S-EPMC3716393 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

altmetric image

Publications

Penalized unsupervised learning with outliers.

Witten Daniela M DM  

Statistics and its interface 20130101 2


We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, <i>K</i>-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outl  ...[more]

Similar Datasets

| S-EPMC4851172 | biostudies-other
| S-EPMC1187953 | biostudies-literature
| EMPIAR-10069 | biostudies-other
| S-EPMC7906460 | biostudies-literature
| S-EPMC8657702 | biostudies-literature
| S-EPMC8865843 | biostudies-literature
| S-EPMC4429262 | biostudies-other
| S-EPMC5893681 | biostudies-other
| S-EPMC6978734 | biostudies-literature
| S-EPMC7814987 | biostudies-literature