Supposed corpus:
Document 1: I like drinking water. They also like drinking water.
Document 2: They like drinking coffee.
Represent every word as |V|*1
vector, with one denotes the word index, and others are all 0.
Bag of words = sum of one-hot vectors
- Dot product of any two word vectors is zero, therefore, one-hot vector can not tell the words similarity.
also | coffee | drinking | I | like | They | water | |
---|---|---|---|---|---|---|---|
Document 1 | 1 | 0 | 2 | 1 | 2 | 1 | 2 |
Document 2 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
- High dimension (#Vocabulary size * #Corpus size) and sparsity
- Ignore the word order
- Filter low frequency words
def n_grams(text, n):
return [text[i:i+n] for i in range(len(text)-n+1)]
a = ['I', 'love', 'you']
print(n_grams(a, 2))
TF
= word count / total word counts in one document. E.g., TF(water, Document 1) = 2/9.
IDF = log(N/n)
, N
the number of documents in the corpus, n
is the number of documents that word appears. E.g., IDF(They) = log(2/2) = 0.
also | coffee | drinking | I | like | They | water | |
---|---|---|---|---|---|---|---|
Document 1 | 1/9*log(2/1) | 0 | 2/9*log(2/2)=0 | 1/2*log(2/1) | 2/9*log(2/2)=0 | 1/9*log(2/2)=0 | 2/9*log(2/1) |
Document 2 | 0 | 1/4*log(2/1) | 1/4*log(2/1) | 0 | 1/4*log(2/1) | 1/4*log(2/1) | 0 |
- Penalising common words that appeared in the corpus.
Co-occurrence
: the number of times that two words co-occurence in a context window.Context Window
: composed of window size and direction.
The results of window size 2 and both direction:
also | coffee | drinking | I | like | They | water | |
---|---|---|---|---|---|---|---|
also | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
coffee | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
drinking | 1 | 1 | 0 | 1 | 3 | 2 | 2 |
I | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
like | 1 | 1 | 3 | 1 | 0 | 2 | 2 |
They | 1 | 0 | 2 | 0 | 2 | 0 | 1 |
water | 1 | 0 | 2 | 0 | 2 | 1 | 0 |
- high dimension
- applying SVD
# input texts
# 1. I enjoy flying.
# 2. I like NLP.
# 3. I like deep learning.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
words = ['I', 'like', 'enjoy', 'deep', 'learning', 'NLP', 'flying', '.']
word_cooccur_matrix = np.array(
[[0, 2, 1, 0, 0, 0, 0, 0],
[2, 0, 0, 1, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 1, 1, 0]])
U, s, V = np.linalg.svd(word_cooccur_matrix)
for i in range(len(words)):
plt.xlim((-1, 1))
plt.ylim((-1,1))
plt.text(U[i, 0], U[i, 1], words[i])
- Two model [2]
- Two algorithms: Hierarchical softmax, Negative sampling.
- Distributional hypothesis: Similar words have similar context.
- speed
- cbow: faster, skip-gram: slower,
- infrequent words
- cbow: bad, skip-gram: better, why?
- training data
- cbow: smaller datasets, skip-gram: larger datasets.
- CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation), useful for smaller datasets. Skip-gram treats each context-target pair as a new observation, and tends to do better when larger datasets.
Negative samples are selected proportional to its frequency (f(w)^3/4
, this power makes less frequent words be sampled more often). Frequent words (such as the, and) are subsampling .
For each word in one sentence, it can be deleted or not according its frequency. And the hyper-parameter sampling rate (i.e., sample
in gensim Word2Vec
, default value is 1e-3
)
- hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors), why?
- sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5), why?
- dimensionality of the word vectors: usually more is better, but not always
- context (window) size: for skip-gram usually around 10, for CBOW around 5
For building word vectors, fasttext is extremely fast. The input
format is a text corpus file which contains several lines. Each line includes segmented word by whitespace.
I did a very simple performance comparsion of gensim and fastText based on the same hyper-parameters, the results of gensim seems little better than that of fastText (without n-gram).
gensim_Word2vec_cbow_hs_model_test.py
: gensim based on cbow model and hierarchical softmax trick.
gensim_Word2vec_sg_ns_model_test.py
: gensim based on skipgram model and negative sampling trick.
fastText_Word2vec_cbow_hs_model_test.bash
: fastText based on cbow model and hierarchical softmax trick.
fastText_Word2vec_sg_ns_model_test.bash
: fastText based on skipgram and negative sampling trick.
The input of the about sripts is word_vector_108000.cs, which contains 108,000 documents (15,000 training data and 3,000 testing data for each category, and 6 category in total).
I also implemented the Word2vec in tensorflow based on CS20SI. The following image show the results, it looks good enough.
- The word orders are ignored in each window.
- Generally, narrower window size leads to better performance in syntactic tests while wider window size leads to better performance in semantic tests.
- Key drawbacks: cannot handle Polysemy, different meanings of the word has same embeddings.
Reference:
- [1] http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes1.pdf
- [2] https://mp.weixin.qq.com/s/Rd3-ypRYiJObi-e2JDeOjQ
Global vectors for word representation (GloVe) uses global statistics to predict the probability of word j
appearing in the context of word i
with a least square objective.
where X
is word-word co-occurrence matrix.
The |GloVe vectors| can be viewed as Keywords, since the larger the co-occur the larger |GloVe vector| is.
The skip-gram tries to capture the words co-occurence one window at a time. GloVe tries to catpure all words co-occurence informations across the whole corpus.
Reference:
http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes2.pdf
Let the word embedding dimension is 100, each word gets embedded to a point in 100 dimensional space.
- Learn word embeddings from a large corpus
- Transfer word embeddings to new task.
- Whether or not fine-tune the embeddings depends on the corpus size of new task. For NER, pretrained word embeddings are prerequisites for better accuracy.
pmi(x,y) = log(p(x,y) / (p(x) * p(y)))
- If the
pmi(x,y)
of two words is larger than 0, it indicates that they are dependent.