Supplementary Material for
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
by G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8

On the Evaluation of Unsupervised Outlier Detection:
Measures, Datasets, and an Empirical Study
[Supplementary Material]

This webpage presents the supplementary material for the paper
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
by G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
We provide all datasets together with their descriptions here as well as all results visualized in graphs.

Repository structure

This repository contains 1) outlier detection datasets and the results of basic kNN-based outlier detection methods on these datasets and 2) aggregated results over the methods and datasets.

We stick to the categorization from the paper and distinguish the dataset to those provided in the literature and to semantically meaningful datasets. Each dataset is given in several variants according to the applied pre-processing. The 4 main categories (though not all of them are necessarily present for every dataset) are: normalized without duplicates, normalized with duplicates, not normalized without duplicates and not normalized with duplicates. Further variants may have been generated for each category due to different downsampling and handling of categorical attributes.

Results

All results were generated with ELKI [1]. The results of the comparison of 12 basic outlier detection methods are provided for each variant of each dataset and they can be displayed in plots by clicking on the dataset name in the tables below.

Datasets used in the literature

Download (44MB) all datasets of this type

NameInstancesOutliersAttributes
ALOI50,000150827
Glass21497
Ionosphere35112632
KDDCup9960,63224638+3
Lymphography14863+16
PenDigits9,8682016
Shuttle1,013139
Waveform3,44310021
WBC454109
WDBC3671030
WPBC1984733

Semantically meaningful datasets

Download (83MB) all datasets of this type

NameInstancesOutliersAttributes
Annthyroid7,20053421
Arrhythmia450206259
Cardiotocography2,12647121
HeartDisease27012013
Hepatitis801319
InternetAds3,2644541,555
PageBlocks5,47356010
Parkinson19514722
Pima7682688
SpamBase4,6011,81357
Stamps340319
Wilt4,8392615

Aggregated results

We provide an overview of aggregated results for each dataset over all methods and for each method over all datasets, again divided into four categories:

Suitability of datasets for outlier evaluation (visualization of data set difficulties).

Reproducibility

Download package to reproduce the experiments (6 MB, not including the data sets). (See the included README for documentation. To completely run these experiments, you need about 500 CPU hours and 50 GB of disk space.)

We are committed to reproducible research, and do our best to provide a download package to repeat our results.

However, complex systems like this are hard to make 100% reproducible. We have tried to eliminate all sources of randomness, but still do observe slightly different results on different hardware and software versions that we cannot pinpoint exactly.
At first it appears that it must be possible to 100% exactly ensure the computation results. But it turns out that e.g. query results may be ordered depending on a memory location if e.g. the exact same distance occurs twice. And while mathematically, the sum of a list of doubles does not depend on the order, with floating point math it does. Due to an effect known as catastrophic cancellation we may observe a more or less significant loss in precision even on simple tasks such as computing the sum of a few numbers.
For example, 1 + 1e16 - 1e16 will usually yield 0 with double precision floating point, while obviously 1 would be the correct result. Also in most languages, computing the sum over 100 times the number 0.1 will yield a total just slightly short of the exact 1.0 result.

For such reasons, we have observed slightly different results for some data sets and some methods; in particular across different ELKI versions, without any of these versions being incorrect! Depending on when you attempt to reproduce these results, you may thus also observe slightly different values. The overall, aggregated results and conclusions should not change due to this.

Thus, you must expect different implementations of the algorithms to yield subtly different results, unfortunately. Also, parameter differences can cause such a change.
For example, do the k nearest neighbors include the query point or not? Do the k nearest neighbors include all neighbors tied at the kth rank? What about duplicate points?

For these reasons, there usually is no "correct" result, ever.


[1] E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, A. Zimek.
A Framework for Clustering Uncertain Data. PVLDB 8(12): 1976-1979 (2015)