weka.classifiers.trees.lmt
Class LogisticBase

java.lang.Object
  extended byweka.classifiers.Classifier
      extended byweka.classifiers.trees.lmt.LogisticBase
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, java.io.Serializable, WeightedInstancesHandler
Direct Known Subclasses:
LMTNode

public class LogisticBase
extends Classifier
implements WeightedInstancesHandler

Base/helper class for building logistic regression models with the LogitBoost algorithm. Used for building logistic model trees (weka.classifiers.trees.lmt.LMT) and standalone logistic regression (weka.classifiers.functions.SimpleLogistic).

Version:
$Revision: 1.2 $
Author:
Niels Landwehr
See Also:
Serialized Form

Field Summary
protected  boolean m_errorOnProbabilities
          Use error on probabilities for stopping criterion of LogitBoost?
protected  int m_fixedNumIterations
          Use fixed number of iterations for LogitBoost?
protected  int m_heuristicStop
          Use heuristic to stop performing LogitBoost iterations earlier?
protected  int m_maxIterations
          The maximum number of LogitBoost iterations
protected  int m_numClasses
          The number of different classes
protected  Instances m_numericData
          Numeric version of the training data.
protected  Instances m_numericDataHeader
          Header-only version of the numeric version of the training data
protected static int m_numFoldsBoosting
          Number of folds for cross-validating number of LogitBoost iterations
protected  int m_numRegressions
          The number of LogitBoost iterations performed.
protected  SimpleLinearRegression[][] m_regressions
          Array holding the simple regression functions fit by LogitBoost
protected  Instances m_train
          Training data
protected  boolean m_useCrossValidation
          Use cross-validation to determine best number of LogitBoost iterations ?
protected static double Z_MAX
          Threshold on the Z-value for LogitBoost
 
Fields inherited from class weka.classifiers.Classifier
m_Debug
 
Constructor Summary
LogisticBase()
          Constructor that creates LogisticBase object with standard options.
LogisticBase(int numBoostingIterations, boolean useCrossValidation, boolean errorOnProbabilities)
          Constructor to create LogisticBase object.
 
Method Summary
 void buildClassifier(Instances data)
          Builds the logistic regression model usiing LogitBoost.
 void cleanup()
          Cleanup in order to save memory.
 double[] distributionForInstance(Instance instance)
          Returns class probabilities for an instance.
protected  int getBestIteration(double[] errors, int maxIteration)
          Helper function to find the minimum in an array of error values.
protected  double[][] getCoefficients()
          Returns an array holding the coefficients of the logistic model.
protected  double getErrorRate(Instances data)
          Returns the misclassification error of the current model on a set of instances.
protected  double[] getFs(Instance instance)
          Computes the F-values for a single instance.
protected  double[][] getFs(Instances data)
          Computes the F-values for a set of instances.
 int getMaxIterations()
          Returns the maxIterations parameter.
protected  double getMeanAbsoluteError(Instances data)
          Returns the error of the probability estimates for the current model on a set of instances.
protected  Instances getNumericData(Instances data)
          Converts training data to numeric version.
 int getNumRegressions()
          The number of LogitBoost iterations performed (= the number of simple regression functions fit).
protected  double[][] getProbs(double[][] dataFs)
          Computes the p-values (probabilities for the different classes) from the F-values for a set of instances.
 int[][] getUsedAttributes()
          Returns an array of the indices of the attributes used in the logistic model.
protected  double[][] getWs(double[][] probs, double[][] dataYs)
          Computes the LogitBoost weights from an array of y/p values (actual/estimated class probabilities).
protected  double[][] getYs(Instances data)
          Computes the Y-values (actual class probabilities) for a set of instances.
protected  double getZ(double actual, double p)
          Computes the LogitBoost response variable from y/p values (actual/estimated class probabilities).
protected  double[][] getZs(double[][] probs, double[][] dataYs)
          Computes the LogitBoost response for an array of y/p values (actual/estimated class probabilities).
protected  SimpleLinearRegression[][] initRegressions()
          Helper function to initialize m_regressions.
protected  double logLikelihood(double[][] dataYs, double[][] probs)
          Returns the likelihood of the Y-values (actual class probabilities) given the p-values (current probability estimates).
 double percentAttributesUsed()
          Returns the fraction of all attributes in the data that are used in the logistic model (in percent).
protected  void performBoosting()
          Runs LogitBoost using the stopping criterion on the training set.
protected  int performBoosting(Instances train, Instances test, double[] error, int maxIterations)
          Runs LogitBoost on a training set and monitors the error on a test set.
protected  void performBoosting(int numIterations)
          Runs LogitBoost with a fixed number of iterations.
protected  void performBoostingCV()
          Runs LogitBoost, determining the best number of iterations by cross-validation.
protected  boolean performIteration(int iteration, double[][] trainYs, double[][] trainFs, double[][] probs, Instances trainNumeric)
          Performs a single iteration of LogitBoost, and updates the model accordingly.
protected  double[] probs(double[] Fs)
          Computes the p-values (probabilities for the classes) from the F-values of the logistic model.
protected  SimpleLinearRegression[][] selectRegressions(SimpleLinearRegression[][] classifiers)
          Helper function for cutting back m_regressions to the set of classifiers (corresponsing to the number of LogitBoost iterations) that gave the smallest error.
 void setHeuristicStop(int heuristicStop)
          Sets the option "heuristicStop".
 void setMaxIterations(int maxIterations)
          Sets the parameter "maxIterations".
 java.lang.String toString()
          Returns a description of the logistic model (i.e., attributes and coefficients).
 
Methods inherited from class weka.classifiers.Classifier
classifyInstance, debugTipText, forName, getDebug, getOptions, listOptions, makeCopies, setDebug, setOptions
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

m_numericDataHeader

protected Instances m_numericDataHeader
Header-only version of the numeric version of the training data


m_numericData

protected Instances m_numericData
Numeric version of the training data. Original class is replaced by a numeric pseudo-class.


m_train

protected Instances m_train
Training data


m_useCrossValidation

protected boolean m_useCrossValidation
Use cross-validation to determine best number of LogitBoost iterations ?


m_errorOnProbabilities

protected boolean m_errorOnProbabilities
Use error on probabilities for stopping criterion of LogitBoost?


m_fixedNumIterations

protected int m_fixedNumIterations
Use fixed number of iterations for LogitBoost? (if negative, cross-validate number of iterations)


m_heuristicStop

protected int m_heuristicStop
Use heuristic to stop performing LogitBoost iterations earlier? If enabled, LogitBoost is stopped if the current (local) minimum of the error on a test set as a function of the number of iterations has not changed for m_heuristicStop iterations.


m_numRegressions

protected int m_numRegressions
The number of LogitBoost iterations performed.


m_maxIterations

protected int m_maxIterations
The maximum number of LogitBoost iterations


m_numClasses

protected int m_numClasses
The number of different classes


m_regressions

protected SimpleLinearRegression[][] m_regressions
Array holding the simple regression functions fit by LogitBoost


m_numFoldsBoosting

protected static int m_numFoldsBoosting
Number of folds for cross-validating number of LogitBoost iterations


Z_MAX

protected static final double Z_MAX
Threshold on the Z-value for LogitBoost

See Also:
Constant Field Values
Constructor Detail

LogisticBase

public LogisticBase()
Constructor that creates LogisticBase object with standard options.


LogisticBase

public LogisticBase(int numBoostingIterations,
                    boolean useCrossValidation,
                    boolean errorOnProbabilities)
Constructor to create LogisticBase object.

Parameters:
numBoostingIterations - fixed number of iterations for LogitBoost (if negative, use cross-validation or stopping criterion on the training data).
useCrossValidation - cross-validate number of LogitBoost iterations (if false, use stopping criterion on the training data).
errorOnProbabilities - if true, use error on probabilities instead of misclassification for stopping criterion of LogitBoost
Method Detail

buildClassifier

public void buildClassifier(Instances data)
                     throws java.lang.Exception
Builds the logistic regression model usiing LogitBoost.

Specified by:
buildClassifier in class Classifier
Parameters:
data - the training data
Throws:
java.lang.Exception - if the classifier has not been generated successfully

performBoostingCV

protected void performBoostingCV()
                          throws java.lang.Exception
Runs LogitBoost, determining the best number of iterations by cross-validation.

Throws:
java.lang.Exception

performBoosting

protected int performBoosting(Instances train,
                              Instances test,
                              double[] error,
                              int maxIterations)
                       throws java.lang.Exception
Runs LogitBoost on a training set and monitors the error on a test set. Used for running one fold when cross-validating the number of LogitBoost iterations.

Parameters:
train - the training set
test - the test set
error - array to hold the logged error values
maxIterations - the maximum number of LogitBoost iterations to run
Returns:
the number of completed LogitBoost iterations (can be smaller than maxIterations if the heuristic for early stopping is active or there is a problem while fitting the regressions in LogitBoost).
Throws:
java.lang.Exception

performBoosting

protected void performBoosting(int numIterations)
                        throws java.lang.Exception
Runs LogitBoost with a fixed number of iterations.

Parameters:
numIterations - the number of iterations to run
Throws:
java.lang.Exception

performBoosting

protected void performBoosting()
                        throws java.lang.Exception
Runs LogitBoost using the stopping criterion on the training set. The number of iterations is used that gives the lowest error on the training set, either misclassification or error on probabilities (depending on the errorOnProbabilities option).

Throws:
java.lang.Exception

getErrorRate

protected double getErrorRate(Instances data)
                       throws java.lang.Exception
Returns the misclassification error of the current model on a set of instances.

Parameters:
data - the set of instances
Returns:
the error rate
Throws:
java.lang.Exception

getMeanAbsoluteError

protected double getMeanAbsoluteError(Instances data)
                               throws java.lang.Exception
Returns the error of the probability estimates for the current model on a set of instances.

Parameters:
data - the set of instances
Returns:
the error
Throws:
java.lang.Exception

getBestIteration

protected int getBestIteration(double[] errors,
                               int maxIteration)
Helper function to find the minimum in an array of error values.


performIteration

protected boolean performIteration(int iteration,
                                   double[][] trainYs,
                                   double[][] trainFs,
                                   double[][] probs,
                                   Instances trainNumeric)
                            throws java.lang.Exception
Performs a single iteration of LogitBoost, and updates the model accordingly. A simple regression function is fit to the response and added to the m_regressions array.

Parameters:
iteration - the current iteration
trainYs - the y-values (see description of LogitBoost) for the model trained so far
trainFs - the F-values (see description of LogitBoost) for the model trained so far
probs - the p-values (see description of LogitBoost) for the model trained so far
trainNumeric - numeric version of the training data
Returns:
returns true if iteration performed successfully, false if no simple regression function could be fitted.
Throws:
java.lang.Exception

initRegressions

protected SimpleLinearRegression[][] initRegressions()
Helper function to initialize m_regressions.


getNumericData

protected Instances getNumericData(Instances data)
                            throws java.lang.Exception
Converts training data to numeric version. The class variable is replaced by a pseudo-class used by LogitBoost.

Throws:
java.lang.Exception

selectRegressions

protected SimpleLinearRegression[][] selectRegressions(SimpleLinearRegression[][] classifiers)
Helper function for cutting back m_regressions to the set of classifiers (corresponsing to the number of LogitBoost iterations) that gave the smallest error.


getZ

protected double getZ(double actual,
                      double p)
Computes the LogitBoost response variable from y/p values (actual/estimated class probabilities).


getZs

protected double[][] getZs(double[][] probs,
                           double[][] dataYs)
Computes the LogitBoost response for an array of y/p values (actual/estimated class probabilities).


getWs

protected double[][] getWs(double[][] probs,
                           double[][] dataYs)
Computes the LogitBoost weights from an array of y/p values (actual/estimated class probabilities).


probs

protected double[] probs(double[] Fs)
Computes the p-values (probabilities for the classes) from the F-values of the logistic model.


getYs

protected double[][] getYs(Instances data)
Computes the Y-values (actual class probabilities) for a set of instances.


getFs

protected double[] getFs(Instance instance)
                  throws java.lang.Exception
Computes the F-values for a single instance.

Throws:
java.lang.Exception

getFs

protected double[][] getFs(Instances data)
                    throws java.lang.Exception
Computes the F-values for a set of instances.

Throws:
java.lang.Exception

getProbs

protected double[][] getProbs(double[][] dataFs)
Computes the p-values (probabilities for the different classes) from the F-values for a set of instances.


logLikelihood

protected double logLikelihood(double[][] dataYs,
                               double[][] probs)
Returns the likelihood of the Y-values (actual class probabilities) given the p-values (current probability estimates).


getUsedAttributes

public int[][] getUsedAttributes()
Returns an array of the indices of the attributes used in the logistic model. The first dimension is the class, the second dimension holds a list of attribute indices. Attribute indices start at zero.

Returns:
the array of attribute indices

getNumRegressions

public int getNumRegressions()
The number of LogitBoost iterations performed (= the number of simple regression functions fit).


setMaxIterations

public void setMaxIterations(int maxIterations)
Sets the parameter "maxIterations".


setHeuristicStop

public void setHeuristicStop(int heuristicStop)
Sets the option "heuristicStop".


getMaxIterations

public int getMaxIterations()
Returns the maxIterations parameter.


getCoefficients

protected double[][] getCoefficients()
Returns an array holding the coefficients of the logistic model. First dimension is the class, the second one holds a list of coefficients. At position zero, the constant term of the model is stored, then, the coefficients for the attributes in ascending order.

Returns:
the array of coefficients

percentAttributesUsed

public double percentAttributesUsed()
Returns the fraction of all attributes in the data that are used in the logistic model (in percent). An attribute is used in the model if it is used in any of the models for the different classes.


toString

public java.lang.String toString()
Returns a description of the logistic model (i.e., attributes and coefficients).


distributionForInstance

public double[] distributionForInstance(Instance instance)
                                 throws java.lang.Exception
Returns class probabilities for an instance.

Overrides:
distributionForInstance in class Classifier
Parameters:
instance - the instance to be classified
Returns:
an array containing the estimated membership probabilities of the test instance in each class or the numeric prediction
Throws:
java.lang.Exception - if distribution can't be computed successfully

cleanup

public void cleanup()
Cleanup in order to save memory.