Environment for
DeveLoping
KDD-Applications
Supported by Index-Structures

de.lmu.ifi.dbs.elki.algorithm.clustering
Class EM<V extends NumberVector<V,?>>

java.lang.Object
  extended by de.lmu.ifi.dbs.elki.logging.AbstractLoggable
      extended by de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm<V,Clustering<EMModel<V>>>
          extended by de.lmu.ifi.dbs.elki.algorithm.clustering.EM<V>
Type Parameters:
V - a type of NumberVector as a suitable datatype for this algorithm
All Implemented Interfaces:
Algorithm<V,Clustering<EMModel<V>>>, ClusteringAlgorithm<Clustering<EMModel<V>>,V>, Parameterizable

@Title(value="EM-Clustering: Clustering by Expectation Maximization")
@Description(value="Provides k Gaussian mixtures maximizing the probability of the given data")
@Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin",
           title="Maximum Likelihood from Incomplete Data via the EM algorithm",
           booktitle="Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31",
           url="http://www.jstor.org/stable/2984875")
public class EM<V extends NumberVector<V,?>>
extends AbstractAlgorithm<V,Clustering<EMModel<V>>>
implements ClusteringAlgorithm<Clustering<EMModel<V>>,V>

Provides the EM algorithm (clustering by expectation maximization).

Initialization is implemented as random initialization of means (uniformly distributed within the attribute ranges of the given database) and initial zero-covariance and variance=1 in covariance matrices.

Reference: A. P. Dempster, N. M. Laird, D. B. Rubin: Maximum Likelihood from Incomplete Data via the EM algorithm.
In Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31

Author:
Arthur Zimek

Field Summary
private  double delta
          Holds the value of DELTA_PARAM.
static OptionID DELTA_ID
          OptionID for DELTA_PARAM
private  DoubleParameter DELTA_PARAM
          Parameter to specify the termination criterion for maximization of E(M): E(M) - E(M') < em.delta, must be a double equal to or greater than 0.
private  int k
          Holds the value of K_PARAM.
static OptionID K_ID
          OptionID for K_PARAM
private  IntParameter K_PARAM
          Parameter to specify the number of clusters to find, must be an integer greater than 0.
private static double MIN_LOGLIKELIHOOD
           
private  HashMap<Integer,double[]> probClusterIGivenX
          Store the individual probabilities, for use by EMOutlierDetection etc.
private static double SINGULARITY_CHEAT
          Small value to increment diagonally of a matrix in order to avoid singularity before building the inverse.
 
Fields inherited from class de.lmu.ifi.dbs.elki.logging.AbstractLoggable
debug, logger
 
Constructor Summary
EM(Parameterization config)
          Constructor, adhering to Parameterizable
 
Method Summary
protected  double assignProbabilitiesToInstances(Database<V> database, List<Double> normDistrFactor, List<V> means, List<Matrix> invCovMatr, List<Double> clusterWeights, HashMap<Integer,double[]> probClusterIGivenX)
          Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.
 double[] getProbClusterIGivenX(Integer index)
          Get the probabilities for a given point.
protected  List<V> initialMeans(Database<V> database)
          Creates k random points distributed uniformly within the attribute ranges of the given database.
protected  Clustering<EMModel<V>> runInTime(Database<V> database)
          Performs the EM clustering algorithm on the given database.
 
Methods inherited from class de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm
isTime, isVerbose, run, setTime, setVerbose
 
Methods inherited from class de.lmu.ifi.dbs.elki.logging.AbstractLoggable
debugFine, debugFiner, debugFinest, exception, progress, verbose, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.clustering.ClusteringAlgorithm
run
 
Methods inherited from interface de.lmu.ifi.dbs.elki.algorithm.Algorithm
setTime, setVerbose
 

Field Detail

SINGULARITY_CHEAT

private static final double SINGULARITY_CHEAT
Small value to increment diagonally of a matrix in order to avoid singularity before building the inverse.

See Also:
Constant Field Values

K_ID

public static final OptionID K_ID
OptionID for K_PARAM


K_PARAM

private final IntParameter K_PARAM
Parameter to specify the number of clusters to find, must be an integer greater than 0.

Key: -em.k


k

private int k
Holds the value of K_PARAM.


DELTA_ID

public static final OptionID DELTA_ID
OptionID for DELTA_PARAM


MIN_LOGLIKELIHOOD

private static final double MIN_LOGLIKELIHOOD
See Also:
Constant Field Values

DELTA_PARAM

private final DoubleParameter DELTA_PARAM
Parameter to specify the termination criterion for maximization of E(M): E(M) - E(M') < em.delta, must be a double equal to or greater than 0.

Default value: 0.0

Key: -em.delta


delta

private double delta
Holds the value of DELTA_PARAM.


probClusterIGivenX

private HashMap<Integer,double[]> probClusterIGivenX
Store the individual probabilities, for use by EMOutlierDetection etc.

Constructor Detail

EM

public EM(Parameterization config)
Constructor, adhering to Parameterizable

Parameters:
config - Parameterization
Method Detail

runInTime

protected Clustering<EMModel<V>> runInTime(Database<V> database)
                                                              throws IllegalStateException
Performs the EM clustering algorithm on the given database.

Finally a hard clustering is provided where each clusters gets assigned the points exhibiting the highest probability to belong to this cluster. But still, the database objects hold associated the complete probability-vector for all models.

Specified by:
runInTime in class AbstractAlgorithm<V extends NumberVector<V,?>,Clustering<EMModel<V extends NumberVector<V,?>>>>
Parameters:
database - the database to run the algorithm on
Returns:
the Result computed by this algorithm
Throws:
IllegalStateException - if the algorithm has not been initialized properly (e.g. the setParameters(String[]) method has been failed to be called).

assignProbabilitiesToInstances

protected double assignProbabilitiesToInstances(Database<V> database,
                                                List<Double> normDistrFactor,
                                                List<V> means,
                                                List<Matrix> invCovMatr,
                                                List<Double> clusterWeights,
                                                HashMap<Integer,double[]> probClusterIGivenX)
Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions. Computed as the sum of the logarithms of the prior probability of each instance.

Parameters:
database - the database used for assignment to instances
normDistrFactor - normalization factor for density function, based on current covariance matrix
means - the current means
invCovMatr - the inverse covariance matrices
clusterWeights - the weights of the current clusters
Returns:
the expectation value of the current mixture of distributions

initialMeans

protected List<V> initialMeans(Database<V> database)
Creates k random points distributed uniformly within the attribute ranges of the given database.

Parameters:
database - the database must contain enough points in order to ascertain the range of attribute values. Less than two points would make no sense. The content of the database is not touched otherwise.
Returns:
a list of k random points distributed uniformly within the attribute ranges of the given database

getProbClusterIGivenX

public double[] getProbClusterIGivenX(Integer index)
Get the probabilities for a given point.

Parameters:
index - Point ID
Returns:
Probabilities of given point

Release 0.3 (2010-03-31_1612)