Environment for
DeveLoping
KDD-Applications
Supported by Index-Structures

de.lmu.ifi.dbs.elki.algorithm.clustering.correlation
Class CASH

java.lang.Object
  extended by de.lmu.ifi.dbs.elki.logging.AbstractLoggable
      extended by de.lmu.ifi.dbs.elki.utilities.optionhandling.AbstractParameterizable
          extended by de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm<ParameterizationFunction,Clustering<Model>>
              extended by de.lmu.ifi.dbs.elki.algorithm.clustering.correlation.CASH
All Implemented Interfaces:
Algorithm<ParameterizationFunction,Clustering<Model>>, ClusteringAlgorithm<Clustering<Model>,ParameterizationFunction>, Parameterizable

public class CASH
extends AbstractAlgorithm<ParameterizationFunction,Clustering<Model>>
implements ClusteringAlgorithm<Clustering<Model>,ParameterizationFunction>

Provides the CASH algorithm, an subspace clustering algorithm based on the hough transform.

Reference: E. Achtert, C. Böhm, J. David, P. Kröger, A. Zimek: Robust clustering in arbitrarily oriented subspaces.
In Proc. 8th SIAM Int. Conf. on Data Mining (SDM'08), Atlanta, GA, 2008

Author:
Elke Achtert

Field Summary
private  boolean adjust
          Holds the value of ADJUST_FLAG.
private  Flag ADJUST_FLAG
          Flag to indicate that an adjustment of the applied heuristic for choosing an interval is performed after an interval is selected.
static OptionID ADJUST_ID
          OptionID for ADJUST_FLAG
private  Database<ParameterizationFunction> database
          The database holding the objects.
private  double jitter
          Holds the value of JITTER_PARAM.
static OptionID JITTER_ID
          OptionID for JITTER_PARAM
private  DoubleParameter JITTER_PARAM
          Parameter to specify the maximum jitter for distance values, must be a double greater than 0.
private  int maxLevel
          Holds the value of MAXLEVEL_PARAM.
static OptionID MAXLEVEL_ID
          OptionID for MAXLEVEL_PARAM.
private  IntParameter MAXLEVEL_PARAM
          Parameter to specify the maximum level for splitting the hypercube, must be an integer greater than 0.
private  int minDim
          Holds the value of MINDIM_PARAM.
static OptionID MINDIM_ID
          OptionID for MINDIM_PARAM
private  IntParameter MINDIM_PARAM
          Parameter to specify the minimum dimensionality of the subspaces to be found, must be an integer greater than 0.
private  int minPts
          Holds the value of MINPTS_PARAM.
static OptionID MINPTS_ID
          OptionID for MINPTS_PARAM
private  IntParameter MINPTS_PARAM
          Parameter to specify the threshold for minimum number of points in a cluster, must be an integer greater than 0.
private  int noiseDim
          Holds the dimensionality for noise.
private  Set<Integer> processedIDs
          Holds a set of processed ids.
private  Clustering<Model> result
          The result.
 
Fields inherited from class de.lmu.ifi.dbs.elki.utilities.optionhandling.AbstractParameterizable
optionHandler
 
Fields inherited from class de.lmu.ifi.dbs.elki.logging.AbstractLoggable
debug, logger
 
Constructor Summary
CASH()
          Provides a new CASH algorithm, adding parameters MINPTS_PARAM, MAXLEVEL_PARAM, MINDIM_PARAM, JITTER_PARAM, and flag ADJUST_FLAG to the option handler additionally to parameters of super class.
 
Method Summary
private  Database<ParameterizationFunction> buildDB(int dim, Matrix basis, Set<Integer> ids, Database<ParameterizationFunction> database)
          Builds a dim-1 dimensional database where the objects are projected into the specified subspace.
private  Database<DoubleVector> buildDerivatorDB(Database<ParameterizationFunction> database, CASHInterval interval)
          Builds a database for the derivator consisting of the ids in the specified interval.
private  Database<DoubleVector> buildDerivatorDB(Database<ParameterizationFunction> database, Set<Integer> ids)
          Builds a database for the derivator consisting of the ids in the specified interval.
private  Matrix determineBasis(double[] alpha)
          Determines a basis defining a subspace described by the specified alpha values.
private  double[] determineMinMaxDistance(Database<ParameterizationFunction> database, int dimensionality)
          Determines the minimum and maximum function value of all parameterization functions stored in the specified database.
private  CASHInterval determineNextIntervalAtMaxLevel(DefaultHeap<Integer,CASHInterval> heap)
          Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.
private  CASHInterval doDetermineNextIntervalAtMaxLevel(DefaultHeap<Integer,CASHInterval> heap)
          Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects
private  Clustering<Model> doRun(Database<ParameterizationFunction> database, FiniteProgress progress)
          Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.
private  Set<Integer> getDatabaseIDs(Database<ParameterizationFunction> database)
          Returns the set of ids belonging to the specified database.
 Description getDescription()
          Returns a description of the algorithm.
 Clustering<Model> getResult()
          Returns the result of the algorithm.
private  void initHeap(DefaultHeap<Integer,CASHInterval> heap, Database<ParameterizationFunction> database, int dim, Set<Integer> ids)
          Initializes the heap with the root intervals.
private  ParameterizationFunction project(Matrix basis, ParameterizationFunction f)
          Projects the specified parameterization function into the subspace described by the given basis.
private  Matrix runDerivator(Database<ParameterizationFunction> database, int dim, CASHInterval interval, Set<Integer> ids)
          Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
private  LinearEquationSystem runDerivator(Database<ParameterizationFunction> database, int dimensionality, Set<Integer> ids)
          Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
protected  Clustering<Model> runInTime(Database<ParameterizationFunction> database)
          Performs the CASH algorithm on the given database.
 List<String> setParameters(List<String> args)
          Calls the super method and sets additionally the values of the parameters MINPTS_PARAM, MAXLEVEL_PARAM, MINDIM_PARAM, JITTER_PARAM, and the flag ADJUST_FLAG.
private  double sinusProduct(int start, int end, double[] alpha)
          Computes the product of all sinus values of the specified angles from start to end index.
 
Methods inherited from class de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm
isTime, isVerbose, run, setTime, setVerbose
 
Methods inherited from class de.lmu.ifi.dbs.elki.utilities.optionhandling.AbstractParameterizable
addOption, addParameterizable, addParameterizable, checkGlobalParameterConstraints, collectOptions, getAttributeSettings, getParameters, rememberParametersExcept, removeOption, removeParameterizable, shortDescription
 
Methods inherited from class de.lmu.ifi.dbs.elki.logging.AbstractLoggable
debugFine, debugFiner, debugFinest, exception, progress, verbose, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.clustering.ClusteringAlgorithm
run
 
Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.Algorithm
setTime, setVerbose
 
Methods inherited from interface de.lmu.ifi.dbs.elki.utilities.optionhandling.Parameterizable
checkGlobalParameterConstraints, collectOptions, getParameters, shortDescription
 

Field Detail

MINPTS_ID

public static final OptionID MINPTS_ID
OptionID for MINPTS_PARAM


MINPTS_PARAM

private final IntParameter MINPTS_PARAM
Parameter to specify the threshold for minimum number of points in a cluster, must be an integer greater than 0.

Key: -cash.minpts


minPts

private int minPts
Holds the value of MINPTS_PARAM.


MAXLEVEL_ID

public static final OptionID MAXLEVEL_ID
OptionID for MAXLEVEL_PARAM.


MAXLEVEL_PARAM

private final IntParameter MAXLEVEL_PARAM
Parameter to specify the maximum level for splitting the hypercube, must be an integer greater than 0.

Key: -cash.maxlevel


maxLevel

private int maxLevel
Holds the value of MAXLEVEL_PARAM.


MINDIM_ID

public static final OptionID MINDIM_ID
OptionID for MINDIM_PARAM


MINDIM_PARAM

private final IntParameter MINDIM_PARAM
Parameter to specify the minimum dimensionality of the subspaces to be found, must be an integer greater than 0.

Default value: 1

Key: -cash.mindim


minDim

private int minDim
Holds the value of MINDIM_PARAM.


JITTER_ID

public static final OptionID JITTER_ID
OptionID for JITTER_PARAM


JITTER_PARAM

private final DoubleParameter JITTER_PARAM
Parameter to specify the maximum jitter for distance values, must be a double greater than 0.

Key: -cash.jitter


jitter

private double jitter
Holds the value of JITTER_PARAM.


ADJUST_ID

public static final OptionID ADJUST_ID
OptionID for ADJUST_FLAG


ADJUST_FLAG

private final Flag ADJUST_FLAG
Flag to indicate that an adjustment of the applied heuristic for choosing an interval is performed after an interval is selected.

Key: -cash.adjust


adjust

private boolean adjust
Holds the value of ADJUST_FLAG.


result

private Clustering<Model> result
The result.


noiseDim

private int noiseDim
Holds the dimensionality for noise.


processedIDs

private Set<Integer> processedIDs
Holds a set of processed ids.


database

private Database<ParameterizationFunction> database
The database holding the objects.

Constructor Detail

CASH

public CASH()
Provides a new CASH algorithm, adding parameters MINPTS_PARAM, MAXLEVEL_PARAM, MINDIM_PARAM, JITTER_PARAM, and flag ADJUST_FLAG to the option handler additionally to parameters of super class.

Method Detail

runInTime

protected Clustering<Model> runInTime(Database<ParameterizationFunction> database)
                               throws IllegalStateException
Performs the CASH algorithm on the given database.

Specified by:
runInTime in class AbstractAlgorithm<ParameterizationFunction,Clustering<Model>>
Parameters:
database - the database to run the algorithm on
Returns:
the Result computed by this algorithm
Throws:
IllegalStateException - if the algorithm has not been initialized properly (e.g. the setParameters(String[]) method has been failed to be called).

getResult

public Clustering<Model> getResult()
Returns the result of the algorithm.

Specified by:
getResult in interface Algorithm<ParameterizationFunction,Clustering<Model>>
Specified by:
getResult in interface ClusteringAlgorithm<Clustering<Model>,ParameterizationFunction>
Returns:
the result of the algorithm

getDescription

public Description getDescription()
Returns a description of the algorithm.

Specified by:
getDescription in interface Algorithm<ParameterizationFunction,Clustering<Model>>
Returns:
a description of the algorithm

setParameters

public List<String> setParameters(List<String> args)
                           throws ParameterException
Calls the super method and sets additionally the values of the parameters MINPTS_PARAM, MAXLEVEL_PARAM, MINDIM_PARAM, JITTER_PARAM, and the flag ADJUST_FLAG.

Specified by:
setParameters in interface Parameterizable
Overrides:
setParameters in class AbstractAlgorithm<ParameterizationFunction,Clustering<Model>>
Parameters:
args - parameters to set the attributes accordingly to
Returns:
a list containing the unused parameters
Throws:
ParameterException - in case of wrong parameter-setting

doRun

private Clustering<Model> doRun(Database<ParameterizationFunction> database,
                                FiniteProgress progress)
                         throws UnableToComplyException,
                                ParameterException,
                                NonNumericFeaturesException
Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.

Parameters:
database - the current database to run the CASH algorithm on
progress - the progress object for verbose messages
Returns:
a mapping of subspace dimensionalities to clusters
Throws:
UnableToComplyException - if an error according to the database occurs
ParameterException - if the parameter setting is wrong
NonNumericFeaturesException - if non numeric feature vectors are used

initHeap

private void initHeap(DefaultHeap<Integer,CASHInterval> heap,
                      Database<ParameterizationFunction> database,
                      int dim,
                      Set<Integer> ids)
Initializes the heap with the root intervals.

Parameters:
heap - the heap to be initialized
database - the database storing the paramterization functions
dim - the dimensionality of the database
ids - the ids of the database

buildDB

private Database<ParameterizationFunction> buildDB(int dim,
                                                   Matrix basis,
                                                   Set<Integer> ids,
                                                   Database<ParameterizationFunction> database)
                                            throws UnableToComplyException
Builds a dim-1 dimensional database where the objects are projected into the specified subspace.

Parameters:
dim - the dimensionality of the database
basis - the basis defining the subspace
ids - the ids for the new database
database - the database storing the parameterization functions
Returns:
a dim-1 dimensional database where the objects are projected into the specified subspace
Throws:
UnableToComplyException - if an error according to the database occurs

project

private ParameterizationFunction project(Matrix basis,
                                         ParameterizationFunction f)
Projects the specified parameterization function into the subspace described by the given basis.

Parameters:
basis - the basis defining he subspace
f - the parameterization function to be projected
Returns:
the projected parameterization function

determineBasis

private Matrix determineBasis(double[] alpha)
Determines a basis defining a subspace described by the specified alpha values.

Parameters:
alpha - the alpha values
Returns:
a basis defining a subspace described by the specified alpha values

sinusProduct

private double sinusProduct(int start,
                            int end,
                            double[] alpha)
Computes the product of all sinus values of the specified angles from start to end index.

Parameters:
start - the index to start
end - the index to end
alpha - the array of angles
Returns:
the product of all sinus values of the specified angles from start to end index

determineNextIntervalAtMaxLevel

private CASHInterval determineNextIntervalAtMaxLevel(DefaultHeap<Integer,CASHInterval> heap)
Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.

Parameters:
heap - the heap storing the intervals
Returns:
the next ''best'' interval at maximum level

doDetermineNextIntervalAtMaxLevel

private CASHInterval doDetermineNextIntervalAtMaxLevel(DefaultHeap<Integer,CASHInterval> heap)
Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects

Parameters:
heap - the heap storing the intervals
Returns:
the next ''best'' interval at maximum level

getDatabaseIDs

private Set<Integer> getDatabaseIDs(Database<ParameterizationFunction> database)
Returns the set of ids belonging to the specified database.

Parameters:
database - the database containing the parameterization functions.
Returns:
the set of ids belonging to the specified database

determineMinMaxDistance

private double[] determineMinMaxDistance(Database<ParameterizationFunction> database,
                                         int dimensionality)
Determines the minimum and maximum function value of all parameterization functions stored in the specified database.

Parameters:
database - the database containing the parameterization functions.
dimensionality - the dimensionality of the database
Returns:
an array containing the minimum and maximum function value of all parameterization functions stored in the specified database

runDerivator

private Matrix runDerivator(Database<ParameterizationFunction> database,
                            int dim,
                            CASHInterval interval,
                            Set<Integer> ids)
                     throws UnableToComplyException,
                            ParameterException
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.

Parameters:
database - the database containing the parameterization functions
interval - the interval to build the model
dim - the dimensionality of the database
ids - an empty set to assign the ids
Returns:
a basis of the found subspace
Throws:
UnableToComplyException - if an error according to the database occurs
ParameterException - if the parameter setting is wrong

buildDerivatorDB

private Database<DoubleVector> buildDerivatorDB(Database<ParameterizationFunction> database,
                                                CASHInterval interval)
                                         throws UnableToComplyException
Builds a database for the derivator consisting of the ids in the specified interval.

Parameters:
database - the database storing the parameterization functions
interval - the interval to build the database from
Returns:
a database for the derivator consisting of the ids in the specified interval
Throws:
UnableToComplyException - if an error according to the database occurs

runDerivator

private LinearEquationSystem runDerivator(Database<ParameterizationFunction> database,
                                          int dimensionality,
                                          Set<Integer> ids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.

Parameters:
database - the database containing the parameterization functions
ids - the ids to build the model
dimensionality - the dimensionality of the subspace
Returns:
a basis of the found subspace

buildDerivatorDB

private Database<DoubleVector> buildDerivatorDB(Database<ParameterizationFunction> database,
                                                Set<Integer> ids)
                                         throws UnableToComplyException
Builds a database for the derivator consisting of the ids in the specified interval.

Parameters:
database - the database storing the parameterization functions
ids - the ids to build the database from
Returns:
a database for the derivator consisting of the ids in the specified interval
Throws:
UnableToComplyException - if initialization of the database is not possible

Release 0.2 (2009-07-06_1820)