de.lmu.ifi.dbs.elki.algorithm.clustering.correlation
Class CASH

java.lang.Object
  extended by de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm<Clustering<Model>>
      extended by de.lmu.ifi.dbs.elki.algorithm.clustering.correlation.CASH
All Implemented Interfaces:
Algorithm, ClusteringAlgorithm<Clustering<Model>>, InspectionUtilFrequentlyScanned, Parameterizable

@Title(value="CASH: Robust clustering in arbitrarily oriented subspaces")
@Description(value="Subspace clustering algorithm based on the hough transform.")
@Reference(authors="E. Achtert, C. B\u00f6hm, J. David, P. Kr\u00f6ger, A. Zimek",
           title="Robust clustering in arbitraily oriented subspaces",
           booktitle="Proc. 8th SIAM Int. Conf. on Data Mining (SDM\'08), Atlanta, GA, 2008",
           url="http://www.siam.org/proceedings/datamining/2008/dm08_69_AchtertBoehmDavidKroegerZimek.pdf")
public class CASH
extends AbstractAlgorithm<Clustering<Model>>
implements ClusteringAlgorithm<Clustering<Model>>

Provides the CASH algorithm, an subspace clustering algorithm based on the hough transform.

Reference: E. Achtert, C. Böhm, J. David, P. Kröger, A. Zimek: Robust clustering in arbitrarily oriented subspaces.
In Proc. 8th SIAM Int. Conf. on Data Mining (SDM'08), Atlanta, GA, 2008


Nested Class Summary
static class CASH.Parameterizer
          Parameterization class.
 
Field Summary
private  boolean adjust
          Holds the value of ADJUST_ID.
static OptionID ADJUST_ID
          Flag to indicate that an adjustment of the applied heuristic for choosing an interval is performed after an interval is selected.
private  Relation<ParameterizationFunction> fulldatabase
          The entire database
private  double jitter
          Holds the value of JITTER_ID.
static OptionID JITTER_ID
          Parameter to specify the maximum jitter for distance values, must be a double greater than 0.
private static Logging logger
          The logger for this class.
private  int maxLevel
          Holds the value of MAXLEVEL_ID.
static OptionID MAXLEVEL_ID
          Parameter to specify the maximum level for splitting the hypercube, must be an integer greater than 0.
private  int minDim
          Holds the value of MINDIM_ID.
static OptionID MINDIM_ID
          Parameter to specify the minimum dimensionality of the subspaces to be found, must be an integer greater than 0.
private  int minPts
          Holds the value of MINPTS_ID.
static OptionID MINPTS_ID
          Parameter to specify the threshold for minimum number of points in a cluster, must be an integer greater than 0.
private  int noiseDim
          Holds the dimensionality for noise.
private  ModifiableDBIDs processedIDs
          Holds a set of processed ids.
 
Constructor Summary
CASH(int minPts, int maxLevel, int minDim, double jitter, boolean adjust)
          Constructor.
 
Method Summary
private  MaterializedRelation<ParameterizationFunction> buildDB(int dim, Matrix basis, DBIDs ids, Relation<ParameterizationFunction> relation)
          Builds a dim-1 dimensional database where the objects are projected into the specified subspace.
private  Database buildDerivatorDB(Relation<ParameterizationFunction> relation, CASHInterval interval)
          Builds a database for the derivator consisting of the ids in the specified interval.
private  Database buildDerivatorDB(Relation<ParameterizationFunction> relation, DBIDs ids)
          Builds a database for the derivator consisting of the ids in the specified interval.
private  Matrix determineBasis(double[] alpha)
          Determines a basis defining a subspace described by the specified alpha values.
private  double[] determineMinMaxDistance(Relation<ParameterizationFunction> relation, int dimensionality)
          Determines the minimum and maximum function value of all parameterization functions stored in the specified database.
private  CASHInterval determineNextIntervalAtMaxLevel(Heap<IntegerPriorityObject<CASHInterval>> heap)
          Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.
private  CASHInterval doDetermineNextIntervalAtMaxLevel(Heap<IntegerPriorityObject<CASHInterval>> heap)
          Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects
private  Clustering<Model> doRun(Relation<ParameterizationFunction> relation, FiniteProgress progress)
          Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.
 TypeInformation[] getInputTypeRestriction()
          Get the input type restriction used for negotiating the data query.
protected  Logging getLogger()
          Get the (STATIC) logger for this class.
private  void initHeap(Heap<IntegerPriorityObject<CASHInterval>> heap, Relation<ParameterizationFunction> relation, int dim, DBIDs ids)
          Initializes the heap with the root intervals.
private  ParameterizationFunction project(Matrix basis, ParameterizationFunction f)
          Projects the specified parameterization function into the subspace described by the given basis.
 Clustering<Model> run(Database database, Relation<ParameterizationFunction> relation)
          Run CASH on the relation.
private  Matrix runDerivator(Relation<ParameterizationFunction> relation, int dim, CASHInterval interval, ModifiableDBIDs ids)
          Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
private  LinearEquationSystem runDerivator(Relation<ParameterizationFunction> relation, int dimensionality, DBIDs ids)
          Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.
private  double sinusProduct(int start, int end, double[] alpha)
          Computes the product of all sinus values of the specified angles from start to end index.
 
Methods inherited from class de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm
makeParameterDistanceFunction, run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.clustering.ClusteringAlgorithm
run
 

Field Detail

logger

private static final Logging logger
The logger for this class.


MINPTS_ID

public static final OptionID MINPTS_ID
Parameter to specify the threshold for minimum number of points in a cluster, must be an integer greater than 0.

Key: -cash.minpts


MAXLEVEL_ID

public static final OptionID MAXLEVEL_ID
Parameter to specify the maximum level for splitting the hypercube, must be an integer greater than 0.

Key: -cash.maxlevel


MINDIM_ID

public static final OptionID MINDIM_ID
Parameter to specify the minimum dimensionality of the subspaces to be found, must be an integer greater than 0.

Default value: 1

Key: -cash.mindim


JITTER_ID

public static final OptionID JITTER_ID
Parameter to specify the maximum jitter for distance values, must be a double greater than 0.

Key: -cash.jitter


ADJUST_ID

public static final OptionID ADJUST_ID
Flag to indicate that an adjustment of the applied heuristic for choosing an interval is performed after an interval is selected.

Key: -cash.adjust


minPts

private int minPts
Holds the value of MINPTS_ID.


maxLevel

private int maxLevel
Holds the value of MAXLEVEL_ID.


minDim

private int minDim
Holds the value of MINDIM_ID.


jitter

private double jitter
Holds the value of JITTER_ID.


adjust

private boolean adjust
Holds the value of ADJUST_ID.


noiseDim

private int noiseDim
Holds the dimensionality for noise.


processedIDs

private ModifiableDBIDs processedIDs
Holds a set of processed ids.


fulldatabase

private Relation<ParameterizationFunction> fulldatabase
The entire database

Constructor Detail

CASH

public CASH(int minPts,
            int maxLevel,
            int minDim,
            double jitter,
            boolean adjust)
Constructor.

Parameters:
minPts - MinPts parameter
maxLevel - Maximum level
minDim - Minimum dimensionality
jitter - Jitter
adjust - Adjust
Method Detail

run

public Clustering<Model> run(Database database,
                             Relation<ParameterizationFunction> relation)
Run CASH on the relation.

Parameters:
database - Database
relation - Relation
Returns:
Clustering result

doRun

private Clustering<Model> doRun(Relation<ParameterizationFunction> relation,
                                FiniteProgress progress)
                         throws UnableToComplyException,
                                ParameterException,
                                NonNumericFeaturesException
Runs the CASH algorithm on the specified database, this method is recursively called until only noise is left.

Parameters:
relation - the Relation to run the CASH algorithm on
progress - the progress object for verbose messages
Returns:
a mapping of subspace dimensionalities to clusters
Throws:
UnableToComplyException - if an error according to the database occurs
ParameterException - if the parameter setting is wrong
NonNumericFeaturesException - if non numeric feature vectors are used

initHeap

private void initHeap(Heap<IntegerPriorityObject<CASHInterval>> heap,
                      Relation<ParameterizationFunction> relation,
                      int dim,
                      DBIDs ids)
Initializes the heap with the root intervals.

Parameters:
heap - the heap to be initialized
relation - the database storing the parameterization functions
dim - the dimensionality of the database
ids - the ids of the database

buildDB

private MaterializedRelation<ParameterizationFunction> buildDB(int dim,
                                                               Matrix basis,
                                                               DBIDs ids,
                                                               Relation<ParameterizationFunction> relation)
                                                        throws UnableToComplyException
Builds a dim-1 dimensional database where the objects are projected into the specified subspace.

Parameters:
dim - the dimensionality of the database
basis - the basis defining the subspace
ids - the ids for the new database
relation - the database storing the parameterization functions
Returns:
a dim-1 dimensional database where the objects are projected into the specified subspace
Throws:
UnableToComplyException - if an error according to the database occurs

project

private ParameterizationFunction project(Matrix basis,
                                         ParameterizationFunction f)
Projects the specified parameterization function into the subspace described by the given basis.

Parameters:
basis - the basis defining he subspace
f - the parameterization function to be projected
Returns:
the projected parameterization function

determineBasis

private Matrix determineBasis(double[] alpha)
Determines a basis defining a subspace described by the specified alpha values.

Parameters:
alpha - the alpha values
Returns:
a basis defining a subspace described by the specified alpha values

sinusProduct

private double sinusProduct(int start,
                            int end,
                            double[] alpha)
Computes the product of all sinus values of the specified angles from start to end index.

Parameters:
start - the index to start
end - the index to end
alpha - the array of angles
Returns:
the product of all sinus values of the specified angles from start to end index

determineNextIntervalAtMaxLevel

private CASHInterval determineNextIntervalAtMaxLevel(Heap<IntegerPriorityObject<CASHInterval>> heap)
Determines the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects.

Parameters:
heap - the heap storing the intervals
Returns:
the next ''best'' interval at maximum level

doDetermineNextIntervalAtMaxLevel

private CASHInterval doDetermineNextIntervalAtMaxLevel(Heap<IntegerPriorityObject<CASHInterval>> heap)
Recursive helper method to determine the next ''best'' interval at maximum level, i.e. the next interval containing the most unprocessed objects

Parameters:
heap - the heap storing the intervals
Returns:
the next ''best'' interval at maximum level

determineMinMaxDistance

private double[] determineMinMaxDistance(Relation<ParameterizationFunction> relation,
                                         int dimensionality)
Determines the minimum and maximum function value of all parameterization functions stored in the specified database.

Parameters:
relation - the database containing the parameterization functions.
dimensionality - the dimensionality of the database
Returns:
an array containing the minimum and maximum function value of all parameterization functions stored in the specified database

runDerivator

private Matrix runDerivator(Relation<ParameterizationFunction> relation,
                            int dim,
                            CASHInterval interval,
                            ModifiableDBIDs ids)
                     throws UnableToComplyException,
                            ParameterException
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.

Parameters:
relation - the database containing the parameterization functions
interval - the interval to build the model
dim - the dimensionality of the database
ids - an empty set to assign the ids
Returns:
a basis of the found subspace
Throws:
UnableToComplyException - if an error according to the database occurs
ParameterException - if the parameter setting is wrong

buildDerivatorDB

private Database buildDerivatorDB(Relation<ParameterizationFunction> relation,
                                  CASHInterval interval)
                           throws UnableToComplyException
Builds a database for the derivator consisting of the ids in the specified interval.

Parameters:
relation - the database storing the parameterization functions
interval - the interval to build the database from
Returns:
a database for the derivator consisting of the ids in the specified interval
Throws:
UnableToComplyException - if an error according to the database occurs

runDerivator

private LinearEquationSystem runDerivator(Relation<ParameterizationFunction> relation,
                                          int dimensionality,
                                          DBIDs ids)
Runs the derivator on the specified interval and assigns all points having a distance less then the standard deviation of the derivator model to the model to this model.

Parameters:
relation - the database containing the parameterization functions
ids - the ids to build the model
dimensionality - the dimensionality of the subspace
Returns:
a basis of the found subspace

buildDerivatorDB

private Database buildDerivatorDB(Relation<ParameterizationFunction> relation,
                                  DBIDs ids)
                           throws UnableToComplyException
Builds a database for the derivator consisting of the ids in the specified interval.

Parameters:
relation - the database storing the parameterization functions
ids - the ids to build the database from
Returns:
a database for the derivator consisting of the ids in the specified interval
Throws:
UnableToComplyException - if initialization of the database is not possible

getInputTypeRestriction

public TypeInformation[] getInputTypeRestriction()
Description copied from class: AbstractAlgorithm
Get the input type restriction used for negotiating the data query.

Specified by:
getInputTypeRestriction in interface Algorithm
Specified by:
getInputTypeRestriction in class AbstractAlgorithm<Clustering<Model>>
Returns:
Type restriction

getLogger

protected Logging getLogger()
Description copied from class: AbstractAlgorithm
Get the (STATIC) logger for this class.

Specified by:
getLogger in class AbstractAlgorithm<Clustering<Model>>
Returns:
the static logger

Release 0.4.0 (2011-09-20_1324)