Search:
Lehrstuhl  |  Institut  |  Fakultät  |  LMU
print

Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces

Published at 22nd International Conference on Scientific and Statistical Database Management (SSDBM)

Conference Date: June 30th to July 2nd of 2010
Conference Location: Heidelberg, Germany
Conference Title: 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010
Conference Chairs: Andreas Reuter, Michael Gertz Conference Co-Chairs: Tony Hey, Bertram Ludäscher

Abstract:

There are abundant scenarios for applications of similarity search in databases where the similarity of objects is defined for a subset of attributes, i.e., in a subspace, only. While much research has been done in efficient support of single column similarity queries or of similarity queries in the full space, scarcely any support of similarity search in subspaces has been provided so far. The three existing approaches are variations of the sequential scan. Here, we propose the first index-based solution to subspace similarity search in arbitrary subspaces.

Copyright Notes:

Bernecker, T., Emrich, T., Graf, F., Kriegel, H.-P., Kröger, P., Renz, M., Schubert, E., and Zimek, A.: Subspace Similarity Search Using the Ideas of Ranking and Top-k Retrieval. In Proceedings of the 26th International Conference on Data Engineering (ICDE) Workshop on Ranking in Databases (DBRank), Long Beach, CA, 2010.

DOI: 10.1109/ICDEW.2010.5452771.

Bernecker, T., Emrich, T., Graf, F., Kriegel, H.-P., Kröger, P., Renz, M., Schubert, E., and Zimek, A.: Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

M. Gertz and B. Ludäscher (Eds.): SSDBM 2010, LNCS 6187, pp. 555–564, 2010. (c) Springer-Verlag Berlin Heidelberg 2010

DOI: 10.1007/978-3-642-13818-8_38.

Documents/Downloads:

This is the authors' version of the work. It is posted here by permission of Springer for your personal use. Not for redistribution.
Paper pdf.gif

BibTex:

@inproceedings{Bernecker2010,
  title = {Subspace Similarity Search: Efficient k-NN Queries in Arbitrary Subspaces},
  author = {Thomas Bernecker and Tobias Emrich and Franz Graf and Hans-Peter Kriegel and Peer Kröger and Matthias Renz and Erich Schubert and Arthur Zimek},
  booktitle = {Scientific and Statistical Database Management},
  publisher = {Springer Berlin / Heidelberg},
  issn = {0302-9743 (Print) 1611-3349 (Online)},
  isbn = {978-3-642-13817-1},
  url = {http://www.springerlink.com/content/1232010q607v0113/},
  volume = {6187},
  series = {Lecture Notes in Computer Science},
  year = {2010},
  pages = {555--564},
  subject_collection = {Computer Science},
  doi = {10.1007/978-3-642-13818-8_38}
}

Code:

The top-down approach can be tested with ELKI, release 0.5.5 - here is an example call:

java -cp elki-0.5.5.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication
-dbc.in inputdatafile
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-treeindex.pagesize 8192
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-algorithm benchmark.KNNBenchmarkAlgorithm
-knnbench.k 20
-algorithm.distancefunction subspace.SubspaceEuclideanDistanceFunction
-distance.dims 1,2,7,11
-resulthandler DiscardResultHandler
-evaluator NoAutomaticEvaluation
-time

The runtime will be reported as follows:

de.lmu.ifi.dbs.elki.algorithm.benchmark.KNNBenchmarkAlgorithm runtime
: 15928 milliseconds.

To get the linear scan runtime, remove the "-db.index" parameters to run without an index.

Results will depend a lot on the data set, dimensionality, query dimensionality and on the page size! As a rule of thumb, the runtime of an experiment should be on the magnitude of minutes. You should also take the average over runs with different subspaces: "-distance.dims".

Please, consider the hints for benchmarking with ELKI.

blank