weka.filters.unsupervised.attribute
Class Discretize

java.lang.Object
  extended byweka.filters.Filter
      extended byweka.filters.unsupervised.attribute.PotentialClassIgnorer
          extended byweka.filters.unsupervised.attribute.Discretize
All Implemented Interfaces:
OptionHandler, java.io.Serializable, UnsupervisedFilter, WeightedInstancesHandler
Direct Known Subclasses:
PKIDiscretize

public class Discretize
extends PotentialClassIgnorer
implements UnsupervisedFilter, OptionHandler, WeightedInstancesHandler

An instance filter that discretizes a range of numeric attributes in the dataset into nominal attributes. Discretization is by simple binning. Skips the class attribute if set.

Valid filter-specific options are:

-B num
Specifies the (maximum) number of bins to divide numeric attributes into. Default = 10.

-M num
Specifies the desired weight of instances per bin for equal-frequency binning. If this is set to a positive number then the -B option will be ignored. Default = -1.

-F
Use equal-frequency instead of equal-width discretization if class-based discretisation is turned off.

-O
Optimize the number of bins using a leave-one-out estimate of the entropy (for equal-width binning). If this is set then the -B option will be ignored.

-R col1,col2-col4,...
Specifies list of columns to Discretize. First and last are valid indexes. (default: first-last)

-V
Invert matching sense.

-D
Make binary nominal attributes.

Version:
$Revision: 1.6 $
Author:
Len Trigg (trigg@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
Serialized Form

Field Summary
protected  double[][] m_CutPoints
          Store the current cutpoints
protected  java.lang.String m_DefaultCols
          The default columns to discretize
protected  double m_DesiredWeightOfInstancesPerInterval
          The desired weight of instances per bin
protected  Range m_DiscretizeCols
          Stores which columns to Discretize
protected  boolean m_FindNumBins
          Find the number of bins using cross-validated entropy.
protected  boolean m_MakeBinary
          Output binary attributes for discretized attributes.
protected  int m_NumBins
          The number of bins to divide the attribute into
protected  boolean m_UseEqualFrequency
          Use equal-frequency binning if unsupervised discretization turned on
 
Fields inherited from class weka.filters.unsupervised.attribute.PotentialClassIgnorer
m_ClassIndex, m_IgnoreClass
 
Fields inherited from class weka.filters.Filter
m_NewBatch
 
Constructor Summary
Discretize()
          Constructor - initialises the filter
Discretize(java.lang.String cols)
          Another constructor
 
Method Summary
 java.lang.String attributeIndicesTipText()
          Returns the tip text for this property
 boolean batchFinished()
          Signifies that this batch of input to the filter is finished.
 java.lang.String binsTipText()
          Returns the tip text for this property
protected  void calculateCutPoints()
          Generate the cutpoints for each attribute
protected  void calculateCutPointsByEqualFrequencyBinning(int index)
          Set cutpoints for a single attribute.
protected  void calculateCutPointsByEqualWidthBinning(int index)
          Set cutpoints for a single attribute.
protected  void convertInstance(Instance instance)
          Convert a single instance over.
 java.lang.String desiredWeightOfInstancesPerIntervalTipText()
          Returns the tip text for this property
protected  void findNumBins(int index)
          Optimizes the number of bins using leave-one-out cross-validation.
 java.lang.String findNumBinsTipText()
          Returns the tip text for this property
 java.lang.String getAttributeIndices()
          Gets the current range selection
 int getBins()
          Gets the number of bins numeric attributes will be divided into
 double[] getCutPoints(int attributeIndex)
          Gets the cut points for an attribute
 double getDesiredWeightOfInstancesPerInterval()
          Get the DesiredWeightOfInstancesPerInterval value.
 boolean getFindNumBins()
          Get the value of FindNumBins.
 boolean getInvertSelection()
          Gets whether the supplied columns are to be removed or kept
 boolean getMakeBinary()
          Gets whether binary attributes should be made for discretized ones.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 boolean getUseEqualFrequency()
          Get the value of UseEqualFrequency.
 java.lang.String globalInfo()
          Returns a string describing this filter
 boolean input(Instance instance)
          Input an instance for filtering.
 java.lang.String invertSelectionTipText()
          Returns the tip text for this property
 java.util.Enumeration listOptions()
          Gets an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String makeBinaryTipText()
          Returns the tip text for this property
 void setAttributeIndices(java.lang.String rangeList)
          Sets which attributes are to be Discretized (only numeric attributes among the selection will be Discretized).
 void setAttributeIndicesArray(int[] attributes)
          Sets which attributes are to be Discretized (only numeric attributes among the selection will be Discretized).
 void setBins(int numBins)
          Sets the number of bins to divide each selected numeric attribute into
 void setDesiredWeightOfInstancesPerInterval(double newDesiredNumber)
          Set the DesiredWeightOfInstancesPerInterval value.
 void setFindNumBins(boolean newFindNumBins)
          Set the value of FindNumBins.
 boolean setInputFormat(Instances instanceInfo)
          Sets the format of the input instances.
 void setInvertSelection(boolean invert)
          Sets whether selected columns should be removed or kept.
 void setMakeBinary(boolean makeBinary)
          Sets whether binary attributes should be made for discretized ones.
 void setOptions(java.lang.String[] options)
          Parses the options for this object.
protected  void setOutputFormat()
          Set the output format.
 void setUseEqualFrequency(boolean newUseEqualFrequency)
          Set the value of UseEqualFrequency.
 java.lang.String useEqualFrequencyTipText()
          Returns the tip text for this property
 
Methods inherited from class weka.filters.unsupervised.attribute.PotentialClassIgnorer
getOutputFormat, setIgnoreClass
 
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputStringIndex, getStringIndices, inputFormat, inputFormatPeek, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_DiscretizeCols

protected Range m_DiscretizeCols
Stores which columns to Discretize


m_NumBins

protected int m_NumBins
The number of bins to divide the attribute into


m_DesiredWeightOfInstancesPerInterval

protected double m_DesiredWeightOfInstancesPerInterval
The desired weight of instances per bin


m_CutPoints

protected double[][] m_CutPoints
Store the current cutpoints


m_MakeBinary

protected boolean m_MakeBinary
Output binary attributes for discretized attributes.


m_FindNumBins

protected boolean m_FindNumBins
Find the number of bins using cross-validated entropy.


m_UseEqualFrequency

protected boolean m_UseEqualFrequency
Use equal-frequency binning if unsupervised discretization turned on


m_DefaultCols

protected java.lang.String m_DefaultCols
The default columns to discretize

Constructor Detail

Discretize

public Discretize()
Constructor - initialises the filter


Discretize

public Discretize(java.lang.String cols)
Another constructor

Method Detail

listOptions

public java.util.Enumeration listOptions()
Gets an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses the options for this object. Valid options are:

-B num
Specifies the (maximum) number of bins to divide numeric attributes into. Default = 10.

-M num
Specifies the desired weight of instances per bin for equal-frequency binning. If this is set to a positive number then the -B option will be ignored. Default = -1.

-F
Use equal-frequency instead of equal-width discretization if class-based discretisation is turned off.

-O
Optimize the number of bins using a leave-one-out estimate of the entropy (for equal-width binning). If this is set then the -B option will be ignored.

-R col1,col2-col4,...
Specifies list of columns to Discretize. First and last are valid indexes. (default none)

-V
Invert matching sense.

-D
Make binary nominal attributes.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class PotentialClassIgnorer
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
Input an instance for filtering. Ordinarily the instance is processed and made available for output immediately. Some filters require all instances be read before producing output.

Overrides:
input in class Filter
Parameters:
instance - the input instance
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.IllegalStateException - if no input format has been defined.

batchFinished

public boolean batchFinished()
Signifies that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class Filter
Returns:
true if there are instances pending output
Throws:
java.lang.IllegalStateException - if no input structure has been defined

globalInfo

public java.lang.String globalInfo()
Returns a string describing this filter

Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

findNumBinsTipText

public java.lang.String findNumBinsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getFindNumBins

public boolean getFindNumBins()
Get the value of FindNumBins.

Returns:
Value of FindNumBins.

setFindNumBins

public void setFindNumBins(boolean newFindNumBins)
Set the value of FindNumBins.

Parameters:
newFindNumBins - Value to assign to FindNumBins.

makeBinaryTipText

public java.lang.String makeBinaryTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getMakeBinary

public boolean getMakeBinary()
Gets whether binary attributes should be made for discretized ones.

Returns:
true if attributes will be binarized

setMakeBinary

public void setMakeBinary(boolean makeBinary)
Sets whether binary attributes should be made for discretized ones.

Parameters:
makeBinary - if binary attributes are to be made

desiredWeightOfInstancesPerIntervalTipText

public java.lang.String desiredWeightOfInstancesPerIntervalTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDesiredWeightOfInstancesPerInterval

public double getDesiredWeightOfInstancesPerInterval()
Get the DesiredWeightOfInstancesPerInterval value.

Returns:
the DesiredWeightOfInstancesPerInterval value.

setDesiredWeightOfInstancesPerInterval

public void setDesiredWeightOfInstancesPerInterval(double newDesiredNumber)
Set the DesiredWeightOfInstancesPerInterval value.

Parameters:
newDesiredNumber - The new DesiredNumber value.

useEqualFrequencyTipText

public java.lang.String useEqualFrequencyTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getUseEqualFrequency

public boolean getUseEqualFrequency()
Get the value of UseEqualFrequency.

Returns:
Value of UseEqualFrequency.

setUseEqualFrequency

public void setUseEqualFrequency(boolean newUseEqualFrequency)
Set the value of UseEqualFrequency.

Parameters:
newUseEqualFrequency - Value to assign to UseEqualFrequency.

binsTipText

public java.lang.String binsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getBins

public int getBins()
Gets the number of bins numeric attributes will be divided into

Returns:
the number of bins.

setBins

public void setBins(int numBins)
Sets the number of bins to divide each selected numeric attribute into

Parameters:
numBins - the number of bins

invertSelectionTipText

public java.lang.String invertSelectionTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getInvertSelection

public boolean getInvertSelection()
Gets whether the supplied columns are to be removed or kept

Returns:
true if the supplied columns will be kept

setInvertSelection

public void setInvertSelection(boolean invert)
Sets whether selected columns should be removed or kept. If true the selected columns are kept and unselected columns are deleted. If false selected columns are deleted and unselected columns are kept.

Parameters:
invert - the new invert setting

attributeIndicesTipText

public java.lang.String attributeIndicesTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getAttributeIndices

public java.lang.String getAttributeIndices()
Gets the current range selection

Returns:
a string containing a comma separated list of ranges

setAttributeIndices

public void setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be Discretized (only numeric attributes among the selection will be Discretized).

Parameters:
rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
eg: first-3,5,6-last
Throws:
java.lang.IllegalArgumentException - if an invalid range list is supplied

setAttributeIndicesArray

public void setAttributeIndicesArray(int[] attributes)
Sets which attributes are to be Discretized (only numeric attributes among the selection will be Discretized).

Parameters:
attributes - an array containing indexes of attributes to Discretize. Since the array will typically come from a program, attributes are indexed from 0.
Throws:
java.lang.IllegalArgumentException - if an invalid set of ranges is supplied

getCutPoints

public double[] getCutPoints(int attributeIndex)
Gets the cut points for an attribute

Returns:
an array containing the cutpoints (or null if the attribute requested has been discretized into only one interval.)

calculateCutPoints

protected void calculateCutPoints()
Generate the cutpoints for each attribute


calculateCutPointsByEqualWidthBinning

protected void calculateCutPointsByEqualWidthBinning(int index)
Set cutpoints for a single attribute.

Parameters:
index - the index of the attribute to set cutpoints for

calculateCutPointsByEqualFrequencyBinning

protected void calculateCutPointsByEqualFrequencyBinning(int index)
Set cutpoints for a single attribute.

Parameters:
index - the index of the attribute to set cutpoints for

findNumBins

protected void findNumBins(int index)
Optimizes the number of bins using leave-one-out cross-validation.

Parameters:
index - the attribute index

setOutputFormat

protected void setOutputFormat()
Set the output format. Takes the currently defined cutpoints and m_InputFormat and calls setOutputFormat(Instances) appropriately.


convertInstance

protected void convertInstance(Instance instance)
Convert a single instance over. The converted instance is added to the end of the output queue.

Parameters:
instance - the instance to convert

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help