|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.classifiers.rules.RuleStats
This class implements the statistics functions used in the propositional rule learner, from the simpler ones like count of true/false positive/negatives, filter data based on the ruleset, etc. to the more sophisticated ones such as MDL calculation and rule variants generation for each rule in the ruleset.
Obviously the statistics functions listed above need the specific data and the specific ruleset, which are given in order to instantiate an object of this class.
Field Summary | |
private Instances |
m_Data
The data on which the stats calculation is based |
private FastVector |
m_Distributions
The class distributions predicted by each rule |
private FastVector |
m_Filtered
The set of instances filtered by the ruleset |
private FastVector |
m_Ruleset
The specific ruleset in question |
private FastVector |
m_SimpleStats
The simple stats of each rule |
private double |
m_Total
The total number of possible conditions that could appear in a rule |
private double |
MDL_THEORY_WEIGHT
The theory weight in the MDL calculation |
private static double |
REDUNDANCY_FACTOR
The redundancy factor in theory description length |
Constructor Summary | |
RuleStats()
Default constructor |
|
RuleStats(Instances data,
FastVector rules)
Constructor that provides ruleset and data |
Method Summary | |
void |
addAndUpdate(Rule lastRule)
Add a rule to the ruleset and update the stats |
double |
combinedDL(double expFPRate,
double predicted)
Compute the combined DL of the ruleset in this class, i.e. theory DL and data DL. |
private Instances[] |
computeSimpleStats(int index,
Instances insts,
double[] stats,
double[] dist)
Find all the instances in the dataset covered/not covered by the rule in given index, and the correponding simple statistics and predicted class distributions are stored in the given double array, which can be obtained by getSimpleStats() and getDistributions(). |
void |
countData()
Filter the data according to the ruleset and compute the basic stats: coverage/uncoverage, true/false positive/negatives of each rule |
void |
countData(int index,
Instances uncovered,
double[][] prevRuleStats)
Count data from the position index in the ruleset assuming that given data are not covered by the rules in position 0... |
static double |
dataDL(double expFPOverErr,
double cover,
double uncover,
double fp,
double fn)
The description length of data given the parameters of the data based on the ruleset. |
Instances |
getData()
Get the data of the stats |
double[] |
getDistributions(int index)
Get the class distribution predicted by the rule in given position |
Instances[] |
getFiltered(int index)
Get the data after filtering the given rule |
FastVector |
getRuleset()
Get the ruleset of the stats |
int |
getRulesetSize()
Get the size of the ruleset in the stats |
double[] |
getSimpleStats(int index)
Get the simple stats of one rule, including 6 parameters: 0: coverage; 1:uncoverage; 2: true positive; 3: true negatives; 4: false positives; 5: false negatives |
double |
minDataDLIfDeleted(int index,
double expFPRate,
boolean checkErr)
Compute the minimal data description length of the ruleset if the rule in the given position is deleted. |
double |
minDataDLIfExists(int index,
double expFPRate,
boolean checkErr)
Compute the minimal data description length of the ruleset if the rule in the given position is NOT deleted. |
static double |
numAllConditions(Instances data)
Compute the number of all possible conditions that could appear in a rule of a given data. |
static Instances[] |
partition(Instances data,
int numFolds)
Patition the data into 2, first of which has (numFolds-1)/numFolds of the data and the second has 1/numFolds of the data |
double |
potential(int index,
double expFPOverErr,
double[] rulesetStat,
double[] ruleStat,
boolean checkErr)
Calculate the potential to decrease DL of the ruleset, i.e. the possible DL that could be decreased by deleting the rule whose index and simple statstics are given. |
void |
reduceDL(double expFPRate,
boolean checkErr)
Try to reduce the DL of the ruleset by testing removing the rules one by one in reverse order and update all the stats |
double |
relativeDL(int index,
double expFPRate,
boolean checkErr)
The description length (DL) of the ruleset relative to if the rule in the given position is deleted, which is obtained by: MDL if the rule exists - MDL if the rule does not exist Note the minimal possible DL of the ruleset is calculated(i.e. some other rules may also be deleted) instead of the DL of the current ruleset. |
void |
removeLast()
Remove the last rule in the ruleset as well as it's stats. |
static Instances |
rmCoveredBySuccessives(Instances data,
FastVector rules,
int index)
Static utility function to count the data covered by the rules after the given index in the given rules, and then remove them. |
void |
setData(Instances data)
Set the data of the stats, overwriting the old one if any |
void |
setMDLTheoryWeight(double weight)
Set the weight of theory in MDL calcualtion |
void |
setNumAllConds(double total)
Set the number of all conditions that could appear in a rule in this RuleStats object, if the number set is smaller than 0 (typically -1), then it calcualtes based on the data store |
void |
setRuleset(FastVector rules)
Set the ruleset of the stats, overwriting the old one if any |
static Instances |
stratify(Instances data,
int folds,
java.util.Random rand)
Stratify the given data into the given number of bags based on the class values. |
static double |
subsetDL(double t,
double k,
double p)
Subset description length: S(t,k,p) = -k*log2(p)-(n-k)log2(1-p) Details see Quilan: "MDL and categorical theories (Continued)",ML95 |
double |
theoryDL(int index)
The description length of the theory for a given rule. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
private Instances m_Data
private FastVector m_Ruleset
private FastVector m_SimpleStats
private FastVector m_Filtered
private double m_Total
private static double REDUNDANCY_FACTOR
private double MDL_THEORY_WEIGHT
private FastVector m_Distributions
Constructor Detail |
public RuleStats()
public RuleStats(Instances data, FastVector rules)
data
- the datarules
- the rulesetMethod Detail |
public void setNumAllConds(double total)
total
- the set numberpublic void setData(Instances data)
data
- the data to be setpublic Instances getData()
public void setRuleset(FastVector rules)
rules
- the set of rules to be setpublic FastVector getRuleset()
public int getRulesetSize()
public double[] getSimpleStats(int index)
index
- the index of the rule
public Instances[] getFiltered(int index)
index
- the index of the rule
public double[] getDistributions(int index)
index
- the position index of the rule
public void setMDLTheoryWeight(double weight)
weight
- the weight to be setpublic static double numAllConditions(Instances data)
data
- the given data
public void countData()
public void countData(int index, Instances uncovered, double[][] prevRuleStats)
index
- the given positionuncovered
- the data not covered by rules before indexprevRuleStats
- the provided stats of previous rulesprivate Instances[] computeSimpleStats(int index, Instances insts, double[] stats, double[] dist)
index
- the given index, assuming correctinsts
- the dataset to be covered by the rulestats
- the given double array to hold stats, side-effecteddist
- the given array to hold class distributions, side-effected
if null, the distribution is not necessary
public void addAndUpdate(Rule lastRule)
public static double subsetDL(double t, double k, double p)
t
- the number of elements in a known setk
- the number of elements in a subsetp
- the expected proportion of subset known by recipientpublic double theoryDL(int index)
Details see Quilan: "MDL and categorical theories (Continued)",ML95
index
- the index of the given rule (assuming correct)
if
- index out of range or object not initialized yetpublic static double dataDL(double expFPOverErr, double cover, double uncover, double fp, double fn)
Details see Quinlan: "MDL and categorical theories (Continued)",ML95
expFPOverErr
- expected FP/(FP+FN)cover
- coverageuncover
- uncoveragefp
- False Positivefn
- False Negativepublic double potential(int index, double expFPOverErr, double[] rulesetStat, double[] ruleStat, boolean checkErr)
The way this procedure does is copied from original RIPPER implementation and is quite bizzare because it does not update the following rules' stats recursively any more when testing each rule, which means it assumes after deletion no data covered by the following rules (or regards the deleted rule as the last rule). Reasonable assumption?
index
- the index of the rule in m_Ruleset to be deletedexpFPOverErr
- expected FP/(FP+FN)rulesetStat
- the simple statistics of the ruleset, updated
if the rule should be deletedruleStat
- the simple statistics of the rule to be deletedcheckErr
- whether check if error rate >= 0.5
public double minDataDLIfDeleted(int index, double expFPRate, boolean checkErr)
index
- the index of the rule in questionexpFPRate
- expected FP/(FP+FN), used in dataDL calculationcheckErr
- whether check if error rate >= 0.5public double minDataDLIfExists(int index, double expFPRate, boolean checkErr)
index
- the index of the rule in questionexpFPRate
- expected FP/(FP+FN), used in dataDL calculationcheckErr
- whether check if error rate >= 0.5public double relativeDL(int index, double expFPRate, boolean checkErr)
index
- the given position of the rule in question
(assuming correct)expFPRate
- expected FP/(FP+FN), used in dataDL calculationcheckErr
- whether check if error rate >= 0.5
public void reduceDL(double expFPRate, boolean checkErr)
expFPRate
- expected FP/(FP+FN), used in dataDL calculationcheckErr
- whether check if error rate >= 0.5public void removeLast()
public static Instances rmCoveredBySuccessives(Instances data, FastVector rules, int index)
data
- the data to be processedrules
- the rulesetindex
- the given index
public static final Instances stratify(Instances data, int folds, java.util.Random rand)
Instances.stratify(int fold)
that before stratification it sorts the instances according to the
class order in the header file. It assumes no missing values in the class.
data
- the given datafolds
- the given number of foldsrand
- the random object used to randomize the instances
public double combinedDL(double expFPRate, double predicted)
expFPRate
- expected FP/(FP+FN), used in dataDL calculationpredicted
- the default classification if ruleset covers null
public static final Instances[] partition(Instances data, int numFolds)
data
- the given datanumFolds
- the given number of folds
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |