On the Evaluation of Unsupervised Outlier Detection:
Measures, Datasets, and an Empirical Study
[Supplementary Material]
This webpage presents the supplementary material for the paper
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
by G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
We provide all datasets together with their descriptions here as well as all results visualized in graphs.
Repository structure
This repository contains 1) outlier detection datasets and the results of basic kNN-based outlier detection methods on these datasets and 2) aggregated results over the methods and datasets.
We stick to the categorization from the paper and distinguish the dataset to those provided in the literature and to semantically meaningful datasets. Each dataset is given in several variants according to the applied pre-processing. The 4 main categories (though not all of them are necessarily present for every dataset) are: normalized without duplicates, normalized with duplicates, not normalized without duplicates and not normalized with duplicates. Further variants may have been generated for each category due to different downsampling and handling of categorical attributes.
Results
All results were generated with ELKI [1]. The results of the comparison of 12 basic outlier detection methods are provided for each variant of each dataset and they can be displayed in plots by clicking on the dataset name in the tables below.
Datasets used in the literature
Download (44MB) all datasets of this type
Name | Instances | Outliers | Attributes |
---|---|---|---|
ALOI | 50,000 | 1508 | 27 |
Glass | 214 | 9 | 7 |
Ionosphere | 351 | 126 | 32 |
KDDCup99 | 60,632 | 246 | 38+3 |
Lymphography | 148 | 6 | 3+16 |
PenDigits | 9,868 | 20 | 16 |
Shuttle | 1,013 | 13 | 9 |
Waveform | 3,443 | 100 | 21 |
WBC | 454 | 10 | 9 |
WDBC | 367 | 10 | 30 |
WPBC | 198 | 47 | 33 |
Semantically meaningful datasets
Download (83MB) all datasets of this type
Name | Instances | Outliers | Attributes |
---|---|---|---|
Annthyroid | 7,200 | 534 | 21 |
Arrhythmia | 450 | 206 | 259 |
Cardiotocography | 2,126 | 471 | 21 |
HeartDisease | 270 | 120 | 13 |
Hepatitis | 80 | 13 | 19 |
InternetAds | 3,264 | 454 | 1,555 |
PageBlocks | 5,473 | 560 | 10 |
Parkinson | 195 | 147 | 22 |
Pima | 768 | 268 | 8 |
SpamBase | 4,601 | 1,813 | 57 |
Stamps | 340 | 31 | 9 |
Wilt | 4,839 | 261 | 5 |
Aggregated results
We provide an overview of aggregated results for each dataset over all methods and for each method over all datasets, again divided into four categories:
Per methods, aggregated over data sets:
- Aggregated over all data sets
- Aggregated over data sets with up to 5% outliers
- Aggregated over data sets with up to 10% outliers
- Aggregated over data sets with up to 20% outliers
Per data set, aggregated over methods:
- Aggregated over all data sets
- Aggregated over data sets with up to 5% outliers
- Aggregated over data sets with up to 10% outliers
- Aggregated over data sets with up to 20% outliers
Suitability of datasets for outlier evaluation (visualization of data set difficulties).
Reproducibility
Download package to reproduce the experiments (6 MB, not including the data sets). (See the included README for documentation. To completely run these experiments, you need about 500 CPU hours and 50 GB of disk space.)
We are committed to reproducible research, and do our best to provide a download package to repeat our results.
However, complex systems like this are hard to make 100% reproducible.
We have tried to eliminate all sources of randomness,
but still do observe slightly different results on different hardware and
software versions that we cannot pinpoint exactly.
At first it appears that it must be possible to 100% exactly ensure the computation results.
But it turns out that e.g. query results may be ordered depending on a
memory location if e.g. the exact same distance occurs twice.
And while mathematically, the sum of a list of doubles does not depend on the order,
with floating point math it does. Due to an effect known as
catastrophic cancellation
we may observe a more or less significant loss in precision even on simple tasks
such as computing the sum of a few numbers.
For example, 1 + 1e16 - 1e16 will usually yield 0 with double precision floating point,
while obviously 1 would be the correct result. Also in most languages, computing the sum over
100 times the number 0.1 will yield a total just slightly short of the exact 1.0 result.
For such reasons, we have observed slightly different results for some data sets and some methods; in particular across different ELKI versions, without any of these versions being incorrect! Depending on when you attempt to reproduce these results, you may thus also observe slightly different values. The overall, aggregated results and conclusions should not change due to this.
Thus, you must expect different implementations of the algorithms to yield subtly
different results, unfortunately. Also, parameter differences can cause such a change.
For example, do the k nearest neighbors include the query point or not?
Do the k nearest neighbors include all neighbors tied at the kth
rank? What about duplicate points?
For these reasons, there usually is no "correct" result, ever.
[1] E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, A. Zimek.
A Framework for Clustering Uncertain Data. PVLDB 8(12): 1976-1979 (2015)