# On the Evaluation of Unsupervised Outlier Detection:

Measures, Datasets, and an Empirical Study

[Supplementary Material]

This webpage presents the supplementary material for the paper

*On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study*

by G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle

Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8

We provide all datasets together with their descriptions here as well as all results visualized in graphs.

## Repository structure

This repository contains
1) outlier detection datasets and the results of basic *k*NN-based outlier detection methods on these datasets and
2) aggregated results over the methods and datasets.

We stick to the categorization from the paper and distinguish the dataset to those provided in the *literature* and to *semantically meaningful* datasets. Each dataset is given in several variants according to the applied pre-processing. The 4 main categories (though not all of them are necessarily present for every dataset) are: normalized without duplicates, normalized with duplicates, not normalized without duplicates and not normalized with duplicates. Further variants may have been generated for each category due to different downsampling and handling of categorical attributes.

## Results

All results were generated with ELKI [1]. The results of the comparison of 12 basic outlier detection methods are provided for each variant of each dataset and they can be displayed in plots by clicking on the dataset name in the tables below.

## Datasets used in the literature

Download (44MB) all datasets of this type

Name | Instances | Outliers | Attributes |
---|---|---|---|

ALOI | 50,000 | 1508 | 27 |

Glass | 214 | 9 | 7 |

Ionosphere | 351 | 126 | 32 |

KDDCup99 | 60,632 | 246 | 38+3 |

Lymphography | 148 | 6 | 3+16 |

PenDigits | 9,868 | 20 | 16 |

Shuttle | 1,013 | 13 | 9 |

Waveform | 3,443 | 100 | 21 |

WBC | 454 | 10 | 9 |

WDBC | 367 | 10 | 30 |

WPBC | 198 | 47 | 33 |

## Semantically meaningful datasets

Download (83MB) all datasets of this type

Name | Instances | Outliers | Attributes |
---|---|---|---|

Annthyroid | 7,200 | 534 | 21 |

Arrhythmia | 450 | 206 | 259 |

Cardiotocography | 2,126 | 471 | 21 |

HeartDisease | 270 | 120 | 13 |

Hepatitis | 80 | 13 | 19 |

InternetAds | 3,264 | 454 | 1,555 |

PageBlocks | 5,473 | 560 | 10 |

Parkinson | 195 | 147 | 22 |

Pima | 768 | 268 | 8 |

SpamBase | 4,601 | 1,813 | 57 |

Stamps | 340 | 31 | 9 |

Wilt | 4,839 | 261 | 5 |

## Aggregated results

We provide an overview of aggregated results for each dataset over all methods and for each method over all datasets, again divided into four categories:

### Per methods, aggregated over data sets:

- Aggregated over all data sets
- Aggregated over data sets with up to 5% outliers
- Aggregated over data sets with up to 10% outliers
- Aggregated over data sets with up to 20% outliers

### Per data set, aggregated over methods:

- Aggregated over all data sets
- Aggregated over data sets with up to 5% outliers
- Aggregated over data sets with up to 10% outliers
- Aggregated over data sets with up to 20% outliers

Suitability of datasets for outlier evaluation (visualization of data set difficulties).

## Reproducibility

Download package to reproduce the experiments (6 MB, not including the data sets).
(See the included `README` for documentation.
To completely run these experiments, you need about 500 CPU hours and 50 GB of disk space.)

We are committed to reproducible research, and do our best to provide a download package to repeat our results.

However, complex systems like this are hard to make 100% reproducible.
We have tried to eliminate all sources of randomness,
but still do observe slightly different results on different hardware and
software versions that we cannot pinpoint exactly.

At first it appears that it must be possible to 100% exactly ensure the computation results.
But it turns out that e.g. query results may be ordered depending on a
memory location if e.g. the exact same distance occurs twice.
And while mathematically, the sum of a list of doubles does not depend on the order,
with floating point math it does. Due to an effect known as
catastrophic cancellation
we may observe a more or less significant loss in precision even on simple tasks
such as computing the sum of a few numbers.

For example, `1 + 1e16 - 1e16` will usually yield `0` with double precision floating point,
while obviously `1` would be the correct result. Also in most languages, computing the sum over
100 times the number `0.1` will yield a total just slightly short of the exact `1.0` result.

For such reasons, we have observed slightly different results for some data sets and some methods; in particular across different ELKI versions, without any of these versions being incorrect! Depending on when you attempt to reproduce these results, you may thus also observe slightly different values. The overall, aggregated results and conclusions should not change due to this.

Thus, you must expect different implementations of the algorithms to yield *subtly*
different results, unfortunately. Also, parameter differences can cause such a change.

For example, do the *k* nearest neighbors include the query point or not?
Do the *k* nearest neighbors include all neighbors tied at the *k*th
rank? What about duplicate points?

For these reasons, there usually is no "correct" result, ever.

[1] E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, A. Zimek.

A Framework for Clustering Uncertain Data. PVLDB 8(12): 1976-1979 (2015)