weka.datagenerators
Class BIRCHCluster

java.lang.Object
  extended byweka.datagenerators.ClusterGenerator
      extended byweka.datagenerators.BIRCHCluster
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class BIRCHCluster
extends ClusterGenerator
implements OptionHandler, java.io.Serializable

Cluster data generator designed for the BIRCH System Dataset is generated with instances in K clusters. Instances are 2-d data points. Each cluster is characterized by the number of data points in it its radius and its center. The location of the cluster centers is determined by the pattern parameter. Three patterns are currently supported grid, sine and random. todo: (out of: BIRCH: An Efficient Data Clustering Method for Very Large Databases; T. Zhang, R. Ramkrishnan, M. Livny; 1996 ACM) Class to generate data randomly by producing a decision list. The decision list consists of rules. Instances are generated randomly one by one. If decision list fails to classify the current instance, a new rule according to this current instance is generated and added to the decision list.

The option -V switches on voting, which means that at the end of the generation all instances are reclassified to the class value that is supported by the most rules.

This data generator can generate 'boolean' attributes (= nominal with the values {true, false}) and numeric attributes. The rules can be 'A' or 'NOT A' for boolean values and 'B < random_value' or 'B >= random_value' for numeric values.

Valid options are:

-G
The pattern for instance generation is grid.
This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.

-I
The pattern for instance generation is sine.
This flag cannot be used at the same time as flag G. The pattern is random, if neither flag G nor flag I is set.

-N num .. num
The range of the number of instances in each cluster (default 1..50).
Lower number must be between 0 and 2500, upper number must be between 50 and 2500.

-R num .. num
The range of the radius of the clusters (default 0.1 .. SQRT(2)).
Lower number must be between 0 and SQRT(2), upper number must be between
SQRT(2) and SQRT(32).

-M num
Distance multiplier, only used if pattern is grid (default 4).

-C num
Number of cycles, only used if pattern is sine (default 4).

-O
Flag for input order is ordered. If flag is not set then input order is randomized.

-P num
Noise rate in percent. Can be between 0% and 30% (default 0%).
(Remark: The original algorithm only allows noise up to 10%.)

-S seed
Random number seed for random function used (default 1).

Version:
$Revision: 1.2 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz)
See Also:
Serialized Form

Nested Class Summary
private  class BIRCHCluster.Cluster
          class to represent cluster
private  class BIRCHCluster.GridVector
          class to represent Vector for placement of the center in space
 
Field Summary
static int GRID
           
private  FastVector m_ClusterList
           
private  Instances m_DatasetFormat
           
private  int m_Debug
           
private  double m_DistMult
           
private  int m_GridSize
           
private  double m_GridWidth
           
private  int m_InputOrder
           
private  int m_MaxInstNum
           
private  double m_MaxRadius
           
private  int m_MinInstNum
           
private  double m_MinRadius
           
private  double m_NoiseRate
           
private  int m_NumCycles
           
private  int m_Pattern
           
private  java.util.Random m_Random
           
private  int m_Seed
           
static int ORDERED
           
static int RANDOM
           
static int RANDOMIZED
           
static int SINE
           
 
Fields inherited from class weka.datagenerators.ClusterGenerator
m_NumAttributes, m_NumClusters
 
Constructor Summary
BIRCHCluster()
           
 
Method Summary
private  FastVector defineClusters(java.util.Random random)
          Defines the clusters
private  FastVector defineClustersGRID(java.util.Random random)
          Defines the clusters if pattern is GRID
private  FastVector defineClustersRANDOM(java.util.Random random)
          Defines the clusters if pattern is RANDOM
 Instances defineDataFormat()
          Initializes the format for the dataset produced.
 Instance generateExample()
          Generate an example of the dataset.
 Instances generateExamples()
          Generate all examples of the dataset.
 Instances generateExamples(java.util.Random random, Instances format)
          Generate all examples of the dataset.
 java.lang.String generateFinished()
          Compiles documentation about the data generation after the generation process
private  Instance generateInstance(Instances format, java.util.Random randomG, double stdDev, double[] center, java.lang.String cName)
          Generate an example of the dataset.
 java.lang.String generateStart()
          Compiles documentation about the data generation before the generation process
 Instances getDatasetFormat()
          Gets the dataset format.
 double getDistMult()
          Gets the distance multiplier.
 boolean getGridFlag()
          Gets the grid flag (option G).
 int getInputOrder()
          Gets the input order.
 java.lang.String getInstNums()
          Gets the upper and lower boundary for instances per cluster.
 int getMaxInstNum()
          Gets the upper boundary for instances per cluster.
 double getMaxRadius()
          Gets the upper boundary for the radiuses of the clusters.
 int getMinInstNum()
          Gets the lower boundary for instances per cluster.
 double getMinRadius()
          Gets the lower boundary for the radiuses of the clusters.
 double getNoiseRate()
          Gets the percentage of noise set.
 int getNumCycles()
          Gets the number of cycles.
 java.lang.String[] getOptions()
          Gets the current settings of the datagenerator BIRCHCluster.
 boolean getOrderedFlag()
          Gets the ordered flag (option O).
 int getPattern()
          Gets the pattern type.
 java.lang.String getRadiuses()
          Gets the upper and lower boundary for the radius of the clusters.
 java.util.Random getRandom()
          Gets the random generator.
 int getSeed()
          Gets the random number seed.
 boolean getSineFlag()
          Gets the sine flag (option S).
 boolean getSingleModeFlag()
          Gets the single mode flag.
 java.lang.String globalInfo()
          Returns a string describing this data generator.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 void setDatasetFormat(Instances newDatasetFormat)
          Sets the dataset format.
 void setDefaultOptions()
          Sets all options to their default values.
 void setDistMult(double newDistMult)
          Sets the distance multiplier.
 void setInputOrder(int newInputOrder)
          Sets the input order.
 void setInstNums(java.lang.String fromTo)
          Sets the upper and lower boundary for instances per cluster.
 void setMaxInstNum(int newMaxInstNum)
          Sets the upper boundary for instances per cluster.
 void setMaxRadius(double newMaxRadius)
          Sets the upper boundary for the radiuses of the clusters.
 void setMinInstNum(int newMinInstNum)
          Sets the lower boundary for instances per cluster.
 void setMinRadius(double newMinRadius)
          Sets the lower boundary for the radiuses of the clusters.
 void setNoiseRate(double newNoiseRate)
          Sets the percentage of noise set.
 void setNumCycles(int newNumCycles)
          Sets the the number of cycles.
 void setOptions(java.lang.String[] options)
          Parses a list of options for this object.
 void setPattern(int newPattern)
          Sets the pattern type.
 void setRadiuses(java.lang.String fromTo)
          Sets the upper and lower boundary for the radius of the clusters.
 void setRandom(java.util.Random newRandom)
          Sets the random generator.
 void setSeed(int newSeed)
          Sets the random number seed.
 
Methods inherited from class weka.datagenerators.ClusterGenerator
getClassFlag, getDebug, getFormat, getNumAttributes, getNumClusters, getNumExamplesAct, getOutput, getRelationName, makeData, setClassFlag, setDebug, setFormat, setNumAttributes, setNumClusters, setNumExamplesAct, setOutput, setRelationName, toStringFormat
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_MinInstNum

private int m_MinInstNum

m_MaxInstNum

private int m_MaxInstNum

m_MinRadius

private double m_MinRadius

m_MaxRadius

private double m_MaxRadius

GRID

public static final int GRID
See Also:
Constant Field Values

SINE

public static final int SINE
See Also:
Constant Field Values

RANDOM

public static final int RANDOM
See Also:
Constant Field Values

m_Pattern

private int m_Pattern

m_DistMult

private double m_DistMult

m_NumCycles

private int m_NumCycles

ORDERED

public static final int ORDERED
See Also:
Constant Field Values

RANDOMIZED

public static final int RANDOMIZED
See Also:
Constant Field Values

m_InputOrder

private int m_InputOrder

m_NoiseRate

private double m_NoiseRate

m_Seed

private int m_Seed

m_DatasetFormat

private Instances m_DatasetFormat

m_Random

private java.util.Random m_Random

m_Debug

private int m_Debug

m_ClusterList

private FastVector m_ClusterList

m_GridSize

private int m_GridSize

m_GridWidth

private double m_GridWidth
Constructor Detail

BIRCHCluster

public BIRCHCluster()
Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this data generator.

Returns:
a description of the data generator suitable for displaying in the explorer/experimenter gui

setInstNums

public void setInstNums(java.lang.String fromTo)
Sets the upper and lower boundary for instances per cluster.


getInstNums

public java.lang.String getInstNums()
Gets the upper and lower boundary for instances per cluster.

Returns:
the string containing the upper and lower boundary for instances per cluster separated by ..

getMinInstNum

public int getMinInstNum()
Gets the lower boundary for instances per cluster.

Returns:
the the lower boundary for instances per cluster

setMinInstNum

public void setMinInstNum(int newMinInstNum)
Sets the lower boundary for instances per cluster.

Parameters:
newMinInstNum - new lower boundary for instances per cluster

getMaxInstNum

public int getMaxInstNum()
Gets the upper boundary for instances per cluster.

Returns:
the upper boundary for instances per cluster

setMaxInstNum

public void setMaxInstNum(int newMaxInstNum)
Sets the upper boundary for instances per cluster.

Parameters:
newMaxInstNum - new upper boundary for instances per cluster

setRadiuses

public void setRadiuses(java.lang.String fromTo)
Sets the upper and lower boundary for the radius of the clusters.


getRadiuses

public java.lang.String getRadiuses()
Gets the upper and lower boundary for the radius of the clusters.

Returns:
the string containing the upper and lower boundary for the radius of the clusters, separated by ..

getMinRadius

public double getMinRadius()
Gets the lower boundary for the radiuses of the clusters.

Returns:
the lower boundary for the radiuses of the clusters

setMinRadius

public void setMinRadius(double newMinRadius)
Sets the lower boundary for the radiuses of the clusters.

Parameters:
newMinRadius - new lower boundary for the radiuses of the clusters

getMaxRadius

public double getMaxRadius()
Gets the upper boundary for the radiuses of the clusters.

Returns:
the upper boundary for the radiuses of the clusters

setMaxRadius

public void setMaxRadius(double newMaxRadius)
Sets the upper boundary for the radiuses of the clusters.

Parameters:
newMaxRadius - new upper boundary for the radiuses of the clusters

getGridFlag

public boolean getGridFlag()
Gets the grid flag (option G).

Returns:
true if grid flag is set

getSineFlag

public boolean getSineFlag()
Gets the sine flag (option S).

Returns:
true if sine flag is set

getPattern

public int getPattern()
Gets the pattern type.

Returns:
the current pattern type

setPattern

public void setPattern(int newPattern)
Sets the pattern type.

Parameters:
newPattern - new pattern type

getDistMult

public double getDistMult()
Gets the distance multiplier.

Returns:
the distance multiplier

setDistMult

public void setDistMult(double newDistMult)
Sets the distance multiplier.

Parameters:
newDistMult - new distance multiplier

getNumCycles

public int getNumCycles()
Gets the number of cycles.

Returns:
the number of cycles

setNumCycles

public void setNumCycles(int newNumCycles)
Sets the the number of cycles.

Parameters:
newNumCycles - new number of cycles

getInputOrder

public int getInputOrder()
Gets the input order.

Returns:
the current input order

setInputOrder

public void setInputOrder(int newInputOrder)
Sets the input order.

Parameters:
newInputOrder - new input order

getOrderedFlag

public boolean getOrderedFlag()
Gets the ordered flag (option O).

Returns:
true if ordered flag is set

getNoiseRate

public double getNoiseRate()
Gets the percentage of noise set.

Returns:
the percentage of noise set

setNoiseRate

public void setNoiseRate(double newNoiseRate)
Sets the percentage of noise set.

Parameters:
newNoiseRate - new percentage of noise

getRandom

public java.util.Random getRandom()
Gets the random generator.

Returns:
the random generator

setRandom

public void setRandom(java.util.Random newRandom)
Sets the random generator.

Parameters:
newRandom - is the random generator.

getSeed

public int getSeed()
Gets the random number seed.

Returns:
the random number seed.

setSeed

public void setSeed(int newSeed)
Sets the random number seed.

Parameters:
newSeed - the new random number seed.

getDatasetFormat

public Instances getDatasetFormat()
Gets the dataset format.

Returns:
the dataset format.

setDatasetFormat

public void setDatasetFormat(Instances newDatasetFormat)
Sets the dataset format.

Parameters:
newDatasetFormat - the new dataset format.

getSingleModeFlag

public boolean getSingleModeFlag()
Gets the single mode flag.

Specified by:
getSingleModeFlag in class ClusterGenerator
Returns:
true if methode generateExample can be used.

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setDefaultOptions

public void setDefaultOptions()
Sets all options to their default values.


setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a list of options for this object.

For list of valid options see class description.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the datagenerator BIRCHCluster.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

defineDataFormat

public Instances defineDataFormat()
                           throws java.lang.Exception
Initializes the format for the dataset produced.

Specified by:
defineDataFormat in class ClusterGenerator
Returns:
the output data format
Throws:
java.lang.Exception - data format could not be defined

generateExample

public Instance generateExample()
                         throws java.lang.Exception
Generate an example of the dataset.

Specified by:
generateExample in class ClusterGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined or generating
examples one by one is not possible, because voting is chosen

generateExamples

public Instances generateExamples()
                           throws java.lang.Exception
Generate all examples of the dataset.

Specified by:
generateExamples in class ClusterGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateExamples

public Instances generateExamples(java.util.Random random,
                                  Instances format)
                           throws java.lang.Exception
Generate all examples of the dataset.

Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateInstance

private Instance generateInstance(Instances format,
                                  java.util.Random randomG,
                                  double stdDev,
                                  double[] center,
                                  java.lang.String cName)
Generate an example of the dataset.

Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined or generating
examples one by one is not possible, because voting is chosen

defineClusters

private FastVector defineClusters(java.util.Random random)
                           throws java.lang.Exception
Defines the clusters

Parameters:
random - random number generator
Throws:
java.lang.Exception

defineClustersGRID

private FastVector defineClustersGRID(java.util.Random random)
                               throws java.lang.Exception
Defines the clusters if pattern is GRID

Parameters:
random - random number generator
Throws:
java.lang.Exception

defineClustersRANDOM

private FastVector defineClustersRANDOM(java.util.Random random)
                                 throws java.lang.Exception
Defines the clusters if pattern is RANDOM

Parameters:
random - random number generator
Throws:
java.lang.Exception

generateFinished

public java.lang.String generateFinished()
                                  throws java.lang.Exception
Compiles documentation about the data generation after the generation process

Specified by:
generateFinished in class ClusterGenerator
Returns:
string with additional information about generated dataset
Throws:
java.lang.Exception - no input structure has been defined

generateStart

public java.lang.String generateStart()
Compiles documentation about the data generation before the generation process

Specified by:
generateStart in class ClusterGenerator
Returns:
string with additional information

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments for the data producer: