#TF-IDF
The prototypical application of TF-IDF transformations involves document collections, where each element represents a document. Documents are represented in a bag-of-words format, i.e. a dictionary whose keys are words and whose values are the number of times the word occurs in the document. For more details and further reading, check the reference section.
The TF-IDF transformation performs the following computation
$$
\mbox{TF-IDF}(w, d) = tf(w, d) * log(N / f(w))
$$
where
The transformed output is a column of type dictionary
(max_categories
per column dimension sparse vector) where the key
corresponds to the index of the categorical variable and the value is 1
.
The behavior of TF-IDF for each input data column type for supported types is as follows:
-
dict: Each (key, value) pair is treated as count associated with the key for this row. A common example is to have a dict element contain a bag-of-words representation of a document, where each key is a word and each value is the number of times that word occurs in the document. All non-numeric values are ignored.
-
list: The list is converted to a bag of words format, where the keys are the unique elements in the list and the values are the counts of those unique elements. After this step, the behaviour is identical to dict.
-
string: Behaves identically to a dict, where the dictionary is generated by converting the string into a bag-of-words format. For example, 'I really like really fluffy dogs' would get converted to {'I' : 1, 'really': 2, 'like': 1, 'fluffy': 1, 'dogs':1}.
import graphlab as gl
# Create data.
sf = gl.SFrame({'a': ['1','2','3'], 'b' : [2,3,4]})
# Create a one-hot encoder.
from graphlab.toolkits.feature_engineering import TFIDF
encoder = gl.feature_engineering.create(sf, TFIDF('a'))
# Transform the data.
transformed_sf = encoder.transform(sf)
Columns:
a dict
b int
Rows: 3
Data:
+---------------------------+---+
| a | b |
+---------------------------+---+
| {'1': 1.0986122886681098} | 2 |
| {'2': 1.0986122886681098} | 3 |
| {'3': 1.0986122886681098} | 4 |
+---------------------------+---+
[3 rows x 2 columns]
# Save the transformer.
>>> encoder.save('save-path')
# Return the indices in the encoding.
>>> encoder['document_frequencies']
Columns:
feature_column str
term str
document_frequency str
Rows: 3
Data:
+----------------+------+--------------------+
| feature_column | term | document_frequency |
+----------------+------+--------------------+
| a | 1 | 1 |
| a | 2 | 1 |
| a | 3 | 1 |
+----------------+------+--------------------+
[3 rows x 3 columns]
# For list columns:
l1 = ['a','good','example']
l2 = ['a','better','example']
sf = gl.SFrame({'a' : [l1,l2]})
tfidf = gl.feature_engineering.TFIDF('a')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
a dict
Rows: 2
Data:
+-------------------------------+
| a |
+-------------------------------+
| {'a': 0.0, 'good': 0.69314... |
| {'better': 0.6931471805599... |
+-------------------------------+
[2 rows x 1 columns]
# For string columns:
sf = gl.SFrame({'a' : ['a good example', 'a better example']})
tfidf = gl.feature_engineering.TFIDF('a')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
a dict
Rows: 2
Data:
+-------------------------------+
| a |
+-------------------------------+
| {'a': 0.0, 'good': 0.69314... |
| {'better': 0.6931471805599... |
+-------------------------------+
[2 rows x 1 columns]
# For dictionary columns:
sf = gl.SFrame(
{'docs': [{'this': 1, 'is': 1, 'a': 2, 'sample': 1},
{'this': 1, 'is': 1, 'another': 2, 'example': 3}]})
tfidf = gl.feature_engineering.TFIDF('docs')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
docs dict
Rows: 2
Data:
+-------------------------------+
| docs |
+-------------------------------+
| {'this': 0.0, 'a': 1.38629... |
| {'this': 0.0, 'is': 0.0, '... |
+-------------------------------+
[2 rows x 1 columns]