|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object weka.filters.Filter weka.filters.unsupervised.attribute.StringToWordVector
Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
Nested Class Summary | |
private class |
StringToWordVector.AlphabeticStringTokenizer
|
private class |
StringToWordVector.Count
Used to store word counts for dictionary selection based on a threshold. |
Field Summary | |
private double |
avgDocLength
Contains the average length of documents (among the first batch of instances aka training data). |
private java.lang.String |
delimiters
Delimiters used in tokenization |
private int[] |
docsCounts
Contains the number of documents (instances) a particular word appears in. |
private java.util.TreeMap |
m_Dictionary
Contains a mapping of valid words to attribute indexes |
private boolean |
m_FirstBatchDone
True if the first batch has been done |
private boolean |
m_IDFTransform
True if word frequencies should be transformed into fij*log(numOfDocs/numOfDocsWithWordi) |
private boolean |
m_lowerCaseTokens
True if all tokens should be downcased |
private boolean |
m_normalizeDocLength
True if document's (instance's) word frequencies are to be normalized. |
private boolean |
m_onlyAlphabeticTokens
True if tokens are to be formed only from alphabetic sequences of characters. |
private boolean |
m_OutputCounts
True if output instances should contain word frequency rather than boolean 0 or 1. |
private java.lang.String |
m_Prefix
A String prefix for the attribute names |
protected Range |
m_SelectedRange
Range of columns to convert to word vectors |
private boolean |
m_TFTransform
True if word frequencies should be transformed into log(1+fi) where fi is the frequency of word i |
private boolean |
m_useStoplist
True if tokens that are on a stoplist are to be ignored. |
private int |
m_WordsToKeep
The default number of words (per class if there is a class attribute assigned) to attempt to keep. |
private int |
numInstances
Contains the number of documents (instances) in the input format from which the dictionary is created. |
Fields inherited from class weka.filters.Filter |
m_NewBatch |
Constructor Summary | |
StringToWordVector()
Default constructor. |
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output. |
Method Summary | |
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property |
boolean |
batchFinished()
Signify that this batch of input to the filter is finished. |
private void |
convertInstance(Instance instance)
|
private int |
convertInstancewoDocNorm(Instance instance,
FastVector v)
|
java.lang.String |
delimitersTipText()
Returns the tip text for this property |
private void |
determineDictionary()
|
private void |
determineSelectedRange()
|
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix. |
java.lang.String |
getDelimiters()
Get the value of delimiters. |
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not. |
boolean |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be normalized or not. |
boolean |
getOnlyAlphabeticTokens()
Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. |
java.lang.String[] |
getOptions()
Gets the current settings of the filter. |
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts. |
Range |
getSelectedRange()
Get the value of m_SelectedRange. |
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
boolean |
getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords). |
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
java.lang.String |
globalInfo()
Returns a string describing this filter |
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property |
boolean |
input(Instance instance)
Input an instance for filtering. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options |
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property. |
static void |
main(java.lang.String[] argv)
Main method for testing this class. |
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property |
java.lang.String |
onlyAlphabeticTokensTipText()
Returns the tip text for this property. |
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property |
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix. |
void |
setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters. |
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances. |
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not. |
void |
setNormalizeDocLength(boolean normalizeDocLength)
Sets whether if the word frequencies for a document (instance) should be normalized or not. |
void |
setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options controlling the behaviour of this object. |
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts. |
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange. |
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
void |
setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords). |
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
private static void |
sortArray(int[] array)
|
java.lang.String |
TFTransformTipText()
Returns the tip text for this property |
java.lang.String |
useStoplistTipText()
Returns the tip text for this property. |
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property |
Methods inherited from class weka.filters.Filter |
batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputFormat, getOutputStringIndex, getStringIndices, inputFormat, inputFormatPeek, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
private java.lang.String delimiters
protected Range m_SelectedRange
private java.util.TreeMap m_Dictionary
private boolean m_FirstBatchDone
private boolean m_OutputCounts
private java.lang.String m_Prefix
private int[] docsCounts
private int numInstances
private double avgDocLength
private int m_WordsToKeep
private boolean m_TFTransform
private boolean m_normalizeDocLength
private boolean m_IDFTransform
private boolean m_onlyAlphabeticTokens
private boolean m_lowerCaseTokens
private boolean m_useStoplist
Constructor Detail |
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep
- the number of words in the output vector (per class
if assigned).Method Detail |
public java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-C
Output word counts rather than boolean word presence.
-D delimiter_charcters
Specify set of delimiter characters
(default: " \n\t.,:'\\\"()?!\"
-R index1,index2-index4,...
Specify list of string attributes to convert to words.
(default: all string attributes)
-P attribute_name_prefix
Specify a prefix for the created attribute names.
(default: "")
-W number_of_words_to_keep
Specify number of word fields to create.
Other, less useful words will be discarded.
(default: 1000)
-A
Only tokenize contiguous alphabetic sequences.
-L
Convert all tokens to lower case before adding to the dictionary.
-S
Do not add words to the dictionary which are on the stop list.
-T
Transform word frequencies to log(1+fij) where fij is frequency of word i
in document j.
-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi)
where fij is frequency of word i in document j.
-N
Normalize word frequencies for each document(instance). The frequencies
are normalized to average length of the documents specified in input
format.
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public boolean setInputFormat(Instances instanceInfo) throws java.lang.Exception
setInputFormat
in class Filter
instanceInfo
- an Instances object containing the input
instance structure (any instances contained in the object are
ignored - only the structure is required).
java.lang.Exception
- if the input format can't be set
successfullypublic boolean input(Instance instance) throws java.lang.Exception
input
in class Filter
instance
- the input instance.
java.lang.IllegalStateException
- if no input structure has been defined.
java.lang.Exception
- if the input instance was not of the correct
format or if there was a problem with the filtering.public boolean batchFinished() throws java.lang.Exception
batchFinished
in class Filter
java.lang.IllegalStateException
- if no input structure has been defined.
java.lang.Exception
- if there was a problem finishing the batch.public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts
- true if word counts should be output.public java.lang.String outputWordCountsTipText()
public java.lang.String getDelimiters()
public void setDelimiters(java.lang.String newDelimiters)
public java.lang.String delimitersTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange
- Value to assign to m_SelectedRange.public java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix
- String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep
- the target number of words in the output
vector (per class if assigned).public java.lang.String wordsToKeepTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
public java.lang.String IDFTransformTipText()
public boolean getNormalizeDocLength()
public void setNormalizeDocLength(boolean normalizeDocLength)
public java.lang.String normalizeDocLengthTipText()
public boolean getOnlyAlphabeticTokens()
public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
public java.lang.String onlyAlphabeticTokensTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens
- should be true if only lower case tokens are
to be formed.public java.lang.String lowerCaseTokensTipText()
public boolean getUseStoplist()
public void setUseStoplist(boolean useStoplist)
useStoplist
- true if the tokens that are on a stoplist are to be
ignored.public java.lang.String useStoplistTipText()
private static void sortArray(int[] array)
private void determineSelectedRange()
private void determineDictionary()
private void convertInstance(Instance instance) throws java.lang.Exception
java.lang.Exception
private int convertInstancewoDocNorm(Instance instance, FastVector v)
public static void main(java.lang.String[] argv)
argv
- should contain arguments to the filter:
use -h for help
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |