# Outlier Detection in High-Dimensional Data

## Tutorial

This tutorial was presented at:

### Abstract

High dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are often indiscriminately subsumed under the term *curse of dimensionality*, more concrete aspects being the so-called *distance concentration effect*, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high dimensional data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces (subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes, the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to high dimensional data. In this tutorial, we discuss those aspects of the *curse of dimensionality* that are most important for outlier detection in detail and survey specialized algorithms for outlier detection from both categories.

### Material

## Survey

This tutorial is based on the survey article

A. Zimek, E. Schubert, H.-P. Kriegel:
*A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data*.
Statistical Analysis and Data Mining, 5(5): 363–387, 2012.
EE (Wiley)

## Selected References

2012 | |

30 | T. de Vries, S. Chawla, M. E. HouleDensity-preserving projections for large-scale local anomaly detectionKnowledge and Information Systems (KAIS), 32(1): 25–52, 2012. |

29 | C. C. AggarwalOutlier EnsemblesACM SIGKDD Explorations, 14(2): 49–58, 2012. |

28 | H.-P. Kriegel, P. Kröger, E. Schubert, A. ZimekOutlier Detection in Arbitrarily Oriented SubspacesIn Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium: 379–388, 2012. |

27 | E. Müller, I. Assent, P. Iglesias, Y. Mülle, K. BöhmOutlier Ranking via Subspace Analysis in Multiple Views of the DataIn Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium: 529–538, 2012. |

26 | E. Schubert, R. Wojdanowski, A. Zimek, H.-P. KriegelOn Evaluation of Outlier Rankings and Outlier ScoresIn Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 1047–1058, 2012. |

25 | N. Pham, R. PaghA Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional DataIn Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China: 877–885, 2012. |

24 | E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, A. ZimekEvaluation of Clusterings – Metrics and Visual SupportIn Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC: 1285–1288, 2012. |

23 | F. Keller, E. Müller, K. BöhmHiCS: High Contrast subspaces for Density-Based Outlier RankingIn Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC: 1037–1048, 2012. |

2011 | |

22 | H.-P. Kriegel, P. Kröger, E. Schubert, A. ZimekInterpreting and Unifying Outlier ScoresIn Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ: 13–24, 2011. |

21 | E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, A. ZimekSpatial Outlier Detection: Data, Algorithms, VisualizationsIn Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN: 512–516, 2011. |

20 | H. V. Nguyen, V. Gopalkrishnan, I. AssentAn Unbiased Distance-based Outlier Detection Approach for High-dimensional DataIn Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA), Hong Kong, China: 138–152, 2011. |

19 | Y. Wang, S. Parthasarathy, S. TatikondaLocality Sensitive Outlier Detection: A ranking driven approachIn Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany: 410–421, 2011. |

18 | E. Müller, M. Schiffer, T. SeidlStatistical Selection of Relevant Subspace Projections for Outlier RankingIn Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany: 434–445, 2011. |

2010 | |

17 | M. Radovanovi\'c, A. Nanopoulos, M. Ivanovi\'cHubs in Space: Popular Nearest Neighbors in High-Dimensional DataJournal of Machine Learning Research, 11: 2487–2531, 2010. |

16 | T. de Vries, S. Chawla, M. E. HouleFinding Local Anomalies in Very High Dimensional SpaceIn Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia: 128–137, 2010. |

15 | H. V. Nguyen, H. H. Ang, V. GopalkrishnanMining Outliers with Ensemble of Heterogeneous Detectors on Random SubspacesIn Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan: 368–383, 2010. |

14 | E. Müller, M. Schiffer, T. SeidlAdaptive Outlierness for Subspace Outlier RankingIn Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, ON, Canada: 1629–1632, 2010. |

13 | M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. ZimekCan Shared-Neighbor Distances Defeat the Curse of Dimensionality?In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany: 482–500, 2010. |

2009 | |

12 | H.-P. Kriegel, P. Kröger, E. Schubert, A. ZimekOutlier Detection in Axis-Parallel Subspaces of High Dimensional DataIn Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand: 831–838, 2009. |

2008 | |

11 | A. Ghoting, S. Parthasarathy, M. E. OteyFast mining of distance-based outliers in high-dimensional datasetsData Mining and Knowledge Discovery, 16(3): 349–364, 2008. |

10 | H.-P. Kriegel, M. Schubert, A. ZimekAngle-Based Outlier Detection in High-dimensional DataIn Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV: 444–452, 2008. |

9 | E. Müller, I. Assent, U. Steinhausen, T. SeidlOutRank: ranking outliers in high dimensional dataIn Proceedings of the 24th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank), Cancun, Mexico: 600–603, 2008. |

2005 | |

8 | F. Angiulli, C. PizzutiOutlier mining in large high-dimensional data setsIEEE Transactions on Knowledge and Data Engineering, 17(2): 203–215, 2005. |

7 | A. Lazarevic, V. KumarFeature Bagging for Outlier DetectionIn Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL: 157–166, 2005. |

2004 | |

6 | J. Zhang, M. Lou, T. W. Ling, H. WangHOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional DataIn Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada: 1265–1268, 2004. |

2002 | |

5 | F. Angiulli, C. PizzutiFast Outlier Detection in High Dimensional SpacesIn Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland: 15–26, 2002. |

2001 | |

4 | C. C. Aggarwal, P. S. YuOutlier Detection for High Dimensional DataIn Proceedings of the ACM International Conference on Management of Data (SIGMOD), Santa Barbara, CA: 37–46, 2001. |

2000 | |

3 | S. Ramaswamy, R. Rastogi, K. ShimEfficient algorithms for mining outliers from large data setsIn Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX: 427–438, 2000. |

1999 | |

2 | K. P. Bennett, U. Fayyad, D. GeigerDensity-Based Indexing for Approximate Nearest-Neighbor QueriesIn Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA: 233–243, 1999. |

1 | K. Beyer, J. Goldstein, R. Ramakrishnan, U. ShaftWhen Is ``Nearest Neighbor'' Meaningful?In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel: 217–235, 1999. |