weka.filters.unsupervised.attribute
Class StringToWordVector

java.lang.Object
  extended byweka.filters.Filter
      extended byweka.filters.unsupervised.attribute.StringToWordVector
All Implemented Interfaces:
OptionHandler, java.io.Serializable, UnsupervisedFilter

public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler

Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Version:
$Revision: 1.7 $
Author:
Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com)
See Also:
Serialized Form

Nested Class Summary
private  class StringToWordVector.AlphabeticStringTokenizer
           
private  class StringToWordVector.Count
          Used to store word counts for dictionary selection based on a threshold.
 
Field Summary
private  double avgDocLength
          Contains the average length of documents (among the first batch of instances aka training data).
private  java.lang.String delimiters
          Delimiters used in tokenization
private  int[] docsCounts
          Contains the number of documents (instances) a particular word appears in.
private  java.util.TreeMap m_Dictionary
          Contains a mapping of valid words to attribute indexes
private  boolean m_FirstBatchDone
          True if the first batch has been done
private  boolean m_IDFTransform
          True if word frequencies should be transformed into fij*log(numOfDocs/numOfDocsWithWordi)
private  boolean m_lowerCaseTokens
          True if all tokens should be downcased
private  boolean m_normalizeDocLength
          True if document's (instance's) word frequencies are to be normalized.
private  boolean m_onlyAlphabeticTokens
          True if tokens are to be formed only from alphabetic sequences of characters.
private  boolean m_OutputCounts
          True if output instances should contain word frequency rather than boolean 0 or 1.
private  java.lang.String m_Prefix
          A String prefix for the attribute names
protected  Range m_SelectedRange
          Range of columns to convert to word vectors
private  boolean m_TFTransform
          True if word frequencies should be transformed into log(1+fi) where fi is the frequency of word i
private  boolean m_useStoplist
          True if tokens that are on a stoplist are to be ignored.
private  int m_WordsToKeep
          The default number of words (per class if there is a class attribute assigned) to attempt to keep.
private  int numInstances
          Contains the number of documents (instances) in the input format from which the dictionary is created.
 
Fields inherited from class weka.filters.Filter
m_NewBatch
 
Constructor Summary
StringToWordVector()
          Default constructor.
StringToWordVector(int wordsToKeep)
          Constructor that allows specification of the target number of words in the output.
 
Method Summary
 java.lang.String attributeNamePrefixTipText()
          Returns the tip text for this property
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
private  void convertInstance(Instance instance)
           
private  int convertInstancewoDocNorm(Instance instance, FastVector v)
           
 java.lang.String delimitersTipText()
          Returns the tip text for this property
private  void determineDictionary()
           
private  void determineSelectedRange()
           
 java.lang.String getAttributeNamePrefix()
          Get the attribute name prefix.
 java.lang.String getDelimiters()
          Get the value of delimiters.
 boolean getIDFTransform()
          Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.
 boolean getLowerCaseTokens()
          Gets whether if the tokens are to be downcased or not.
 boolean getNormalizeDocLength()
          Gets whether if the word frequencies for a document (instance) should be normalized or not.
 boolean getOnlyAlphabeticTokens()
          Gets whether if the tokens are to be formed only from contiguous alphabetic sequences.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 boolean getOutputWordCounts()
          Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
 Range getSelectedRange()
          Get the value of m_SelectedRange.
 boolean getTFTransform()
          Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
 boolean getUseStoplist()
          Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).
 int getWordsToKeep()
          Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
 java.lang.String globalInfo()
          Returns a string describing this filter
 java.lang.String IDFTransformTipText()
          Returns the tip text for this property
 boolean input(Instance instance)
          Input an instance for filtering.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
 java.lang.String lowerCaseTokensTipText()
          Returns the tip text for this property.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String normalizeDocLengthTipText()
          Returns the tip text for this property
 java.lang.String onlyAlphabeticTokensTipText()
          Returns the tip text for this property.
 java.lang.String outputWordCountsTipText()
          Returns the tip text for this property
 void setAttributeNamePrefix(java.lang.String newPrefix)
          Set the attribute name prefix.
 void setDelimiters(java.lang.String newDelimiters)
          Set the value of delimiters.
 void setIDFTransform(boolean IDFTransform)
          Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.
 boolean setInputFormat(Instances instanceInfo)
          Sets the format of the input instances.
 void setLowerCaseTokens(boolean downCaseTokens)
          Sets whether if the tokens are to be downcased or not.
 void setNormalizeDocLength(boolean normalizeDocLength)
          Sets whether if the word frequencies for a document (instance) should be normalized or not.
 void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
          Sets whether if tokens are to be formed only from contiguous alphabetic character sequences.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setOutputWordCounts(boolean outputWordCounts)
          Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
 void setSelectedRange(java.lang.String newSelectedRange)
          Set the value of m_SelectedRange.
 void setTFTransform(boolean TFTransform)
          Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
 void setUseStoplist(boolean useStoplist)
          Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).
 void setWordsToKeep(int newWordsToKeep)
          Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
private static void sortArray(int[] array)
           
 java.lang.String TFTransformTipText()
          Returns the tip text for this property
 java.lang.String useStoplistTipText()
          Returns the tip text for this property.
 java.lang.String wordsToKeepTipText()
          Returns the tip text for this property
 
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputFormat, getOutputStringIndex, getStringIndices, inputFormat, inputFormatPeek, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

delimiters

private java.lang.String delimiters
Delimiters used in tokenization


m_SelectedRange

protected Range m_SelectedRange
Range of columns to convert to word vectors


m_Dictionary

private java.util.TreeMap m_Dictionary
Contains a mapping of valid words to attribute indexes


m_FirstBatchDone

private boolean m_FirstBatchDone
True if the first batch has been done


m_OutputCounts

private boolean m_OutputCounts
True if output instances should contain word frequency rather than boolean 0 or 1.


m_Prefix

private java.lang.String m_Prefix
A String prefix for the attribute names


docsCounts

private int[] docsCounts
Contains the number of documents (instances) a particular word appears in. The counts are stored with the same indexing as given by m_Dictionary.


numInstances

private int numInstances
Contains the number of documents (instances) in the input format from which the dictionary is created. It is used in IDF transform.


avgDocLength

private double avgDocLength
Contains the average length of documents (among the first batch of instances aka training data). This is used in length normalization of documents which will be normalized to average document length.


m_WordsToKeep

private int m_WordsToKeep
The default number of words (per class if there is a class attribute assigned) to attempt to keep.


m_TFTransform

private boolean m_TFTransform
True if word frequencies should be transformed into log(1+fi) where fi is the frequency of word i


m_normalizeDocLength

private boolean m_normalizeDocLength
True if document's (instance's) word frequencies are to be normalized. The are normalized to average length of documents specified as input format.


m_IDFTransform

private boolean m_IDFTransform
True if word frequencies should be transformed into fij*log(numOfDocs/numOfDocsWithWordi)


m_onlyAlphabeticTokens

private boolean m_onlyAlphabeticTokens
True if tokens are to be formed only from alphabetic sequences of characters. (The delimiters string property is ignored if this is true).


m_lowerCaseTokens

private boolean m_lowerCaseTokens
True if all tokens should be downcased


m_useStoplist

private boolean m_useStoplist
True if tokens that are on a stoplist are to be ignored.

Constructor Detail

StringToWordVector

public StringToWordVector()
Default constructor. Targets 1000 words in the output.


StringToWordVector

public StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output.

Parameters:
wordsToKeep - the number of words in the output vector (per class if assigned).
Method Detail

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-C
Output word counts rather than boolean word presence.

-D delimiter_charcters
Specify set of delimiter characters (default: " \n\t.,:'\\\"()?!\"

-R index1,index2-index4,...
Specify list of string attributes to convert to words. (default: all string attributes)

-P attribute_name_prefix
Specify a prefix for the created attribute names. (default: "")

-W number_of_words_to_keep
Specify number of word fields to create. Other, less useful words will be discarded. (default: 1000)

-A
Only tokenize contiguous alphabetic sequences.

-L
Convert all tokens to lower case before adding to the dictionary.

-S
Do not add words to the dictionary which are on the stop list.

-T
Transform word frequencies to log(1+fij) where fij is frequency of word i in document j.

-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi) where fij is frequency of word i in document j.

-N
Normalize word frequencies for each document(instance). The frequencies are normalized to average length of the documents specified in input format.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
              throws java.lang.Exception
Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:
input in class Filter
Parameters:
instance - the input instance.
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.IllegalStateException - if no input structure has been defined.
java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class Filter
Returns:
true if there are instances pending output.
Throws:
java.lang.IllegalStateException - if no input structure has been defined.
java.lang.Exception - if there was a problem finishing the batch.

globalInfo

public java.lang.String globalInfo()
Returns a string describing this filter

Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

getOutputWordCounts

public boolean getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:
true if word counts should be output.

setOutputWordCounts

public void setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:
outputWordCounts - true if word counts should be output.

outputWordCountsTipText

public java.lang.String outputWordCountsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDelimiters

public java.lang.String getDelimiters()
Get the value of delimiters.

Returns:
Value of delimiters.

setDelimiters

public void setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters.


delimitersTipText

public java.lang.String delimitersTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getSelectedRange

public Range getSelectedRange()
Get the value of m_SelectedRange.

Returns:
Value of m_SelectedRange.

setSelectedRange

public void setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.

Parameters:
newSelectedRange - Value to assign to m_SelectedRange.

getAttributeNamePrefix

public java.lang.String getAttributeNamePrefix()
Get the attribute name prefix.

Returns:
The current attribute name prefix.

setAttributeNamePrefix

public void setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix.

Parameters:
newPrefix - String to use as the attribute name prefix.

attributeNamePrefixTipText

public java.lang.String attributeNamePrefixTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getWordsToKeep

public int getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:
the target number of words in the output vector (per class if assigned).

setWordsToKeep

public void setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:
newWordsToKeep - the target number of words in the output vector (per class if assigned).

wordsToKeepTipText

public java.lang.String wordsToKeepTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getTFTransform

public boolean getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Returns:
true if word frequencies are to be transformed.

setTFTransform

public void setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.


TFTransformTipText

public java.lang.String TFTransformTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getIDFTransform

public boolean getIDFTransform()
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Returns:
true if the word frequencies are to be transformed.

setIDFTransform

public void setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.


IDFTransformTipText

public java.lang.String IDFTransformTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getNormalizeDocLength

public boolean getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be normalized or not.

Returns:
true if word frequencies are to be normalized.

setNormalizeDocLength

public void setNormalizeDocLength(boolean normalizeDocLength)
Sets whether if the word frequencies for a document (instance) should be normalized or not.


normalizeDocLengthTipText

public java.lang.String normalizeDocLengthTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getOnlyAlphabeticTokens

public boolean getOnlyAlphabeticTokens()
Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. The delimiter string is ignored if this is true.

Returns:
true if tokens are to be formed from contiguous alphabetic characters.

setOnlyAlphabeticTokens

public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. The delimiter string is ignored if this option is set to true.


onlyAlphabeticTokensTipText

public java.lang.String onlyAlphabeticTokensTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getLowerCaseTokens

public boolean getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.

Returns:
true if the tokens are to be downcased.

setLowerCaseTokens

public void setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:
downCaseTokens - should be true if only lower case tokens are to be formed.

lowerCaseTokensTipText

public java.lang.String lowerCaseTokensTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getUseStoplist

public boolean getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).

Returns:
true if the words on the stoplist are to be ignored.

setUseStoplist

public void setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).

Parameters:
useStoplist - true if the tokens that are on a stoplist are to be ignored.

useStoplistTipText

public java.lang.String useStoplistTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

sortArray

private static void sortArray(int[] array)

determineSelectedRange

private void determineSelectedRange()

determineDictionary

private void determineDictionary()

convertInstance

private void convertInstance(Instance instance)
                      throws java.lang.Exception
Throws:
java.lang.Exception

convertInstancewoDocNorm

private int convertInstancewoDocNorm(Instance instance,
                                     FastVector v)

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help