StringToWordVector (Documentation for extended WEKA including Ensembles of Hierarchically Nested Dichotomies)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

weka.filters.unsupervised.attribute
Class StringToWordVector

java.lang.Object
  weka.filters.Filter
      weka.filters.unsupervised.attribute.StringToWordVector

All Implemented Interfaces:: OptionHandler, java.io.Serializable, UnsupervisedFilter

public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler

Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

Version:: $Revision: 1.7 $
Author:: Len Trigg (len@reeltwo.com), Stuart Inglis (stuart@reeltwo.com)
See Also:: Serialized Form

Nested Class Summary
`private class`	`StringToWordVector.AlphabeticStringTokenizer`
`private class`	`StringToWordVector.Count` Used to store word counts for dictionary selection based on a threshold.

Field Summary
`private double`	`avgDocLength` Contains the average length of documents (among the first batch of instances aka training data).
`private java.lang.String`	`delimiters` Delimiters used in tokenization
`private int[]`	`docsCounts` Contains the number of documents (instances) a particular word appears in.
`private java.util.TreeMap`	`m_Dictionary` Contains a mapping of valid words to attribute indexes
`private boolean`	`m_FirstBatchDone` True if the first batch has been done
`private boolean`	`m_IDFTransform` True if word frequencies should be transformed into fij*log(numOfDocs/numOfDocsWithWordi)
`private boolean`	`m_lowerCaseTokens` True if all tokens should be downcased
`private boolean`	`m_normalizeDocLength` True if document's (instance's) word frequencies are to be normalized.
`private boolean`	`m_onlyAlphabeticTokens` True if tokens are to be formed only from alphabetic sequences of characters.
`private boolean`	`m_OutputCounts` True if output instances should contain word frequency rather than boolean 0 or 1.
`private java.lang.String`	`m_Prefix` A String prefix for the attribute names
`protected Range`	`m_SelectedRange` Range of columns to convert to word vectors
`private boolean`	`m_TFTransform` True if word frequencies should be transformed into log(1+fi) where fi is the frequency of word i
`private boolean`	`m_useStoplist` True if tokens that are on a stoplist are to be ignored.
`private int`	`m_WordsToKeep` The default number of words (per class if there is a class attribute assigned) to attempt to keep.
`private int`	`numInstances` Contains the number of documents (instances) in the input format from which the dictionary is created.

Fields inherited from class weka.filters.Filter

m_NewBatch

Constructor Summary
`StringToWordVector()` Default constructor.
`StringToWordVector(int wordsToKeep)` Constructor that allows specification of the target number of words in the output.

Method Summary
`java.lang.String`	`attributeNamePrefixTipText()` Returns the tip text for this property
`boolean`	`batchFinished()` Signify that this batch of input to the filter is finished.
`private void`	`convertInstance(Instance instance)`
`private int`	`convertInstancewoDocNorm(Instance instance, FastVector v)`
`java.lang.String`	`delimitersTipText()` Returns the tip text for this property
`private void`	`determineDictionary()`
`private void`	`determineSelectedRange()`
`java.lang.String`	`getAttributeNamePrefix()` Get the attribute name prefix.
`java.lang.String`	`getDelimiters()` Get the value of delimiters.
`boolean`	`getIDFTransform()` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`boolean`	`getLowerCaseTokens()` Gets whether if the tokens are to be downcased or not.
`boolean`	`getNormalizeDocLength()` Gets whether if the word frequencies for a document (instance) should be normalized or not.
`boolean`	`getOnlyAlphabeticTokens()` Gets whether if the tokens are to be formed only from contiguous alphabetic sequences.
`java.lang.String[]`	`getOptions()` Gets the current settings of the filter.
`boolean`	`getOutputWordCounts()` Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
`Range`	`getSelectedRange()` Get the value of m_SelectedRange.
`boolean`	`getTFTransform()` Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`boolean`	`getUseStoplist()` Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).
`int`	`getWordsToKeep()` Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`java.lang.String`	`globalInfo()` Returns a string describing this filter
`java.lang.String`	`IDFTransformTipText()` Returns the tip text for this property
`boolean`	`input(Instance instance)` Input an instance for filtering.
`java.util.Enumeration`	`listOptions()` Returns an enumeration describing the available options
`java.lang.String`	`lowerCaseTokensTipText()` Returns the tip text for this property.
`static void`	`main(java.lang.String[] argv)` Main method for testing this class.
`java.lang.String`	`normalizeDocLengthTipText()` Returns the tip text for this property
`java.lang.String`	`onlyAlphabeticTokensTipText()` Returns the tip text for this property.
`java.lang.String`	`outputWordCountsTipText()` Returns the tip text for this property
`void`	`setAttributeNamePrefix(java.lang.String newPrefix)` Set the attribute name prefix.
`void`	`setDelimiters(java.lang.String newDelimiters)` Set the value of delimiters.
`void`	`setIDFTransform(boolean IDFTransform)` Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j.
`boolean`	`setInputFormat(Instances instanceInfo)` Sets the format of the input instances.
`void`	`setLowerCaseTokens(boolean downCaseTokens)` Sets whether if the tokens are to be downcased or not.
`void`	`setNormalizeDocLength(boolean normalizeDocLength)` Sets whether if the word frequencies for a document (instance) should be normalized or not.
`void`	`setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)` Sets whether if tokens are to be formed only from contiguous alphabetic character sequences.
`void`	`setOptions(java.lang.String[] options)` Parses a given list of options controlling the behaviour of this object.
`void`	`setOutputWordCounts(boolean outputWordCounts)` Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
`void`	`setSelectedRange(java.lang.String newSelectedRange)` Set the value of m_SelectedRange.
`void`	`setTFTransform(boolean TFTransform)` Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.
`void`	`setUseStoplist(boolean useStoplist)` Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).
`void`	`setWordsToKeep(int newWordsToKeep)` Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
`private static void`	`sortArray(int[] array)`
`java.lang.String`	`TFTransformTipText()` Returns the tip text for this property
`java.lang.String`	`useStoplistTipText()` Returns the tip text for this property.
`java.lang.String`	`wordsToKeepTipText()` Returns the tip text for this property

Methods inherited from class weka.filters.Filter

batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputFormat, getOutputStringIndex, getStringIndices, inputFormat, inputFormatPeek, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

delimiters

private java.lang.String delimiters

Delimiters used in tokenization

m_SelectedRange

protected Range m_SelectedRange

Range of columns to convert to word vectors

m_Dictionary

private java.util.TreeMap m_Dictionary

Contains a mapping of valid words to attribute indexes

m_FirstBatchDone

private boolean m_FirstBatchDone

True if the first batch has been done

m_OutputCounts

private boolean m_OutputCounts

True if output instances should contain word frequency rather than boolean 0 or 1.

m_Prefix

private java.lang.String m_Prefix

A String prefix for the attribute names

docsCounts

private int[] docsCounts

Contains the number of documents (instances) a particular word appears in. The counts are stored with the same indexing as given by m_Dictionary.

numInstances

private int numInstances

Contains the number of documents (instances) in the input format from which the dictionary is created. It is used in IDF transform.

avgDocLength

private double avgDocLength

Contains the average length of documents (among the first batch of instances aka training data). This is used in length normalization of documents which will be normalized to average document length.

m_WordsToKeep

private int m_WordsToKeep

The default number of words (per class if there is a class attribute assigned) to attempt to keep.

m_TFTransform

private boolean m_TFTransform

True if word frequencies should be transformed into log(1+fi) where fi is the frequency of word i

m_normalizeDocLength

private boolean m_normalizeDocLength

True if document's (instance's) word frequencies are to be normalized. The are normalized to average length of documents specified as input format.

m_IDFTransform

private boolean m_IDFTransform

True if word frequencies should be transformed into fij*log(numOfDocs/numOfDocsWithWordi)

m_onlyAlphabeticTokens

private boolean m_onlyAlphabeticTokens

True if tokens are to be formed only from alphabetic sequences of characters. (The delimiters string property is ignored if this is true).

m_lowerCaseTokens

private boolean m_lowerCaseTokens

True if all tokens should be downcased

m_useStoplist

private boolean m_useStoplist

True if tokens that are on a stoplist are to be ignored.

Constructor Detail

StringToWordVector

public StringToWordVector()

Default constructor. Targets 1000 words in the output.

StringToWordVector

public StringToWordVector(int wordsToKeep)

Constructor that allows specification of the target number of words in the output.
Parameters:: wordsToKeep - the number of words in the output vector (per class if assigned).

Method Detail

listOptions

public java.util.Enumeration listOptions()

Returns an enumeration describing the available options

Specified by:: listOptions in interface OptionHandler

Returns:: an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception

Parses a given list of options controlling the behaviour of this object. Valid options are:

-C
Output word counts rather than boolean word presence.

-D delimiter_charcters
Specify set of delimiter characters (default: " \n\t.,:'\\\"()?!\"

-R index1,index2-index4,...
Specify list of string attributes to convert to words. (default: all string attributes)

-P attribute_name_prefix
Specify a prefix for the created attribute names. (default: "")

-W number_of_words_to_keep
Specify number of word fields to create. Other, less useful words will be discarded. (default: 1000)

-A
Only tokenize contiguous alphabetic sequences.

-L
Convert all tokens to lower case before adding to the dictionary.

-S
Do not add words to the dictionary which are on the stop list.

-T
Transform word frequencies to log(1+fij) where fij is frequency of word i in document j.

-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi) where fij is frequency of word i in document j.

-N
Normalize word frequencies for each document(instance). The frequencies are normalized to average length of the documents specified in input format.

Specified by:: setOptions in interface OptionHandler

Parameters:: options - the list of options as an array of strings
Throws:: java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()

Gets the current settings of the filter.

Specified by:: getOptions in interface OptionHandler

Returns:: an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception

Sets the format of the input instances.

Overrides:: setInputFormat in class Filter

Parameters:: instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:: true if the outputFormat may be collected immediately
Throws:: java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
              throws java.lang.Exception

Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:: input in class Filter

Parameters:: instance - the input instance.
Returns:: true if the filtered instance may now be collected with output().
Throws:: java.lang.IllegalStateException - if no input structure has been defined.; java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception

Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:: batchFinished in class Filter

Returns:: true if there are instances pending output.
Throws:: java.lang.IllegalStateException - if no input structure has been defined.; java.lang.Exception - if there was a problem finishing the batch.

globalInfo

public java.lang.String globalInfo()

Returns a string describing this filter

Returns:: a description of the filter suitable for displaying in the explorer/experimenter gui

getOutputWordCounts

public boolean getOutputWordCounts()

Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:: true if word counts should be output.

setOutputWordCounts

public void setOutputWordCounts(boolean outputWordCounts)

Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:: outputWordCounts - true if word counts should be output.

outputWordCountsTipText

public java.lang.String outputWordCountsTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getDelimiters

public java.lang.String getDelimiters()

Get the value of delimiters.

Returns:: Value of delimiters.

setDelimiters

public void setDelimiters(java.lang.String newDelimiters)

Set the value of delimiters.

delimitersTipText

public java.lang.String delimitersTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getSelectedRange

public Range getSelectedRange()

Get the value of m_SelectedRange.

Returns:: Value of m_SelectedRange.

setSelectedRange

public void setSelectedRange(java.lang.String newSelectedRange)

Set the value of m_SelectedRange.

Parameters:: newSelectedRange - Value to assign to m_SelectedRange.

getAttributeNamePrefix

public java.lang.String getAttributeNamePrefix()

Get the attribute name prefix.

Returns:: The current attribute name prefix.

setAttributeNamePrefix

public void setAttributeNamePrefix(java.lang.String newPrefix)

Set the attribute name prefix.

Parameters:: newPrefix - String to use as the attribute name prefix.

attributeNamePrefixTipText

public java.lang.String attributeNamePrefixTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getWordsToKeep

public int getWordsToKeep()

Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:: the target number of words in the output vector (per class if assigned).

setWordsToKeep

public void setWordsToKeep(int newWordsToKeep)

Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:: newWordsToKeep - the target number of words in the output vector (per class if assigned).

wordsToKeepTipText

public java.lang.String wordsToKeepTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getTFTransform

public boolean getTFTransform()

Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

Returns:: true if word frequencies are to be transformed.

setTFTransform

public void setTFTransform(boolean TFTransform)

Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j.

TFTransformTipText

public java.lang.String TFTransformTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getIDFTransform

public boolean getIDFTransform()

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

Returns:: true if the word frequencies are to be transformed.

setIDFTransform

public void setIDFTransform(boolean IDFTransform)

Sets whether if the word frequencies in a document should be transformed into:
fij*log(num of Docs/num of Docs with word i)
where fij is the frequency of word i in document(instance) j.

IDFTransformTipText

public java.lang.String IDFTransformTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getNormalizeDocLength

public boolean getNormalizeDocLength()

Gets whether if the word frequencies for a document (instance) should be normalized or not.

Returns:: true if word frequencies are to be normalized.

setNormalizeDocLength

public void setNormalizeDocLength(boolean normalizeDocLength)

Sets whether if the word frequencies for a document (instance) should be normalized or not.

normalizeDocLengthTipText

public java.lang.String normalizeDocLengthTipText()

Returns the tip text for this property

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getOnlyAlphabeticTokens

public boolean getOnlyAlphabeticTokens()

Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. The delimiter string is ignored if this is true.

Returns:: true if tokens are to be formed from contiguous alphabetic characters.

setOnlyAlphabeticTokens

public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)

Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. The delimiter string is ignored if this option is set to true.

onlyAlphabeticTokensTipText

public java.lang.String onlyAlphabeticTokensTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getLowerCaseTokens

public boolean getLowerCaseTokens()

Gets whether if the tokens are to be downcased or not.

Returns:: true if the tokens are to be downcased.

setLowerCaseTokens

public void setLowerCaseTokens(boolean downCaseTokens)

Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).

Parameters:: downCaseTokens - should be true if only lower case tokens are to be formed.

lowerCaseTokensTipText

public java.lang.String lowerCaseTokensTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

getUseStoplist

public boolean getUseStoplist()

Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords).

Returns:: true if the words on the stoplist are to be ignored.

setUseStoplist

public void setUseStoplist(boolean useStoplist)

Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords).

Parameters:: useStoplist - true if the tokens that are on a stoplist are to be ignored.

useStoplistTipText

public java.lang.String useStoplistTipText()

Returns the tip text for this property.

Returns:: tip text for this property suitable for displaying in the explorer/experimenter gui

sortArray

private static void sortArray(int[] array)

determineSelectedRange

private void determineSelectedRange()

determineDictionary

private void determineDictionary()

convertInstance

private void convertInstance(Instance instance)
                      throws java.lang.Exception

Throws:: java.lang.Exception

convertInstancewoDocNorm

private int convertInstancewoDocNorm(Instance instance,
                                     FastVector v)

main

public static void main(java.lang.String[] argv)

Main method for testing this class.

Parameters:: argv - should contain arguments to the filter: use -h for help

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

weka.filters.unsupervised.attribute Class StringToWordVector

delimiters

m_SelectedRange

m_Dictionary

m_FirstBatchDone

m_OutputCounts

m_Prefix

docsCounts

numInstances

avgDocLength

m_WordsToKeep

m_TFTransform

m_normalizeDocLength

m_IDFTransform

m_onlyAlphabeticTokens

m_lowerCaseTokens

m_useStoplist

StringToWordVector

StringToWordVector

listOptions

setOptions

getOptions

setInputFormat

input

batchFinished

globalInfo

getOutputWordCounts

setOutputWordCounts

outputWordCountsTipText

getDelimiters

setDelimiters

delimitersTipText

getSelectedRange

setSelectedRange

getAttributeNamePrefix

setAttributeNamePrefix

attributeNamePrefixTipText

getWordsToKeep

setWordsToKeep

wordsToKeepTipText

getTFTransform

setTFTransform

TFTransformTipText

getIDFTransform

setIDFTransform

IDFTransformTipText

getNormalizeDocLength

setNormalizeDocLength

normalizeDocLengthTipText

getOnlyAlphabeticTokens

setOnlyAlphabeticTokens

onlyAlphabeticTokensTipText

getLowerCaseTokens

setLowerCaseTokens

lowerCaseTokensTipText

getUseStoplist

setUseStoplist

useStoplistTipText

sortArray

determineSelectedRange

determineDictionary

convertInstance

convertInstancewoDocNorm

main

weka.filters.unsupervised.attribute
Class StringToWordVector