Search:
Lehrstuhl  |  Institut  |  Fakultät  |  LMU
print


Outlier Detection in High-Dimensional Data

Tutorial

This tutorial was presented at:

Abstract

High dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term curse of dimensionality, more concrete aspects being the so-called distance concentration effect, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high dimensional data. In this tutorial, we discuss those aspects of the curse of dimensionality that are most important for outlier detection in detail and survey specialized algorithms for outlier detection from both categories.

Material

Survey

This tutorial is based on the survey article

A. Zimek, E. Schubert, H.-P. Kriegel: A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data. Statistical Analysis and Data Mining, 5(5): 363–387, 2012. EE (Wiley)

Selected References

2012
30T. de Vries, S. Chawla, M. E. Houle
Density-preserving projections for large-scale local anomaly detection
Knowledge and Information Systems (KAIS), 32(1): 25–52, 2012.
29 C. C. Aggarwal
Outlier Ensembles
ACM SIGKDD Explorations, 14(2): 49–58, 2012.
28H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Outlier Detection in Arbitrarily Oriented Subspaces
In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium: 379–388, 2012.
27 E. Müller, I. Assent, P. Iglesias, Y. Mülle, K. Böhm
Outlier Ranking via Subspace Analysis in Multiple Views of the Data
In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium: 529–538, 2012.
26E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 1047–1058, 2012.
25N. Pham, R. Pagh
A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data
In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China: 877–885, 2012.
24E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Clusterings – Metrics and Visual Support
In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC: 1285–1288, 2012.
23F. Keller, E. Müller, K. Böhm
HiCS: High Contrast subspaces for Density-Based Outlier Ranking
In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC: 1037–1048, 2012.
2011
22H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Interpreting and Unifying Outlier Scores
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ: 13–24, 2011.
21E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, A. Zimek
Spatial Outlier Detection: Data, Algorithms, Visualizations
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN: 512–516, 2011.
20H. V. Nguyen, V. Gopalkrishnan, I. Assent
An Unbiased Distance-based Outlier Detection Approach for High-dimensional Data
In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA), Hong Kong, China: 138–152, 2011.
19Y. Wang, S. Parthasarathy, S. Tatikonda
Locality Sensitive Outlier Detection: A ranking driven approach
In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany: 410–421, 2011.
18E. Müller, M. Schiffer, T. Seidl
Statistical Selection of Relevant Subspace Projections for Outlier Ranking
In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany: 434–445, 2011.
2010
17M. Radovanovi\'c, A. Nanopoulos, M. Ivanovi\'c
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
Journal of Machine Learning Research, 11: 2487–2531, 2010.
16T. de Vries, S. Chawla, M. E. Houle
Finding Local Anomalies in Very High Dimensional Space
In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia: 128–137, 2010.
15H. V. Nguyen, H. H. Ang, V. Gopalkrishnan
Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces
In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan: 368–383, 2010.
14E. Müller, M. Schiffer, T. Seidl
Adaptive Outlierness for Subspace Outlier Ranking
In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, ON, Canada: 1629–1632, 2010.
13M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany: 482–500, 2010.
2009
12H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data
In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand: 831–838, 2009.
2008
11A. Ghoting, S. Parthasarathy, M. E. Otey
Fast mining of distance-based outliers in high-dimensional datasets
Data Mining and Knowledge Discovery, 16(3): 349–364, 2008.
10H.-P. Kriegel, M. Schubert, A. Zimek
Angle-Based Outlier Detection in High-dimensional Data
In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV: 444–452, 2008.
9E. Müller, I. Assent, U. Steinhausen, T. Seidl
OutRank: ranking outliers in high dimensional data
In Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank), Cancun, Mexico: 600–603, 2008.
2005
8F. Angiulli, C. Pizzuti
Outlier mining in large high-dimensional data sets
IEEE Transactions on Knowledge and Data Engineering, 17(2): 203–215, 2005.
7A. Lazarevic, V. Kumar
Feature Bagging for Outlier Detection
In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL: 157–166, 2005.
2004
6J. Zhang, M. Lou, T. W. Ling, H. Wang
HOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional Data
In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada: 1265–1268, 2004.
2002
5F. Angiulli, C. Pizzuti
Fast Outlier Detection in High Dimensional Spaces
In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland: 15–26, 2002.
2001
4C. C. Aggarwal, P. S. Yu
Outlier Detection for High Dimensional Data
In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Santa Barbara, CA: 37–46, 2001.
2000
3S. Ramaswamy, R. Rastogi, K. Shim
Efficient algorithms for mining outliers from large data sets
In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX: 427–438, 2000.
1999
2K. P. Bennett, U. Fayyad, D. Geiger
Density-Based Indexing for Approximate Nearest-Neighbor Queries
In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA: 233–243, 1999.
1K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft
When Is ``Nearest Neighbor'' Meaningful?
In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel: 217–235, 1999.
blank