Problem about input embeddings generated by other algo. #2

geekinglcq · 2016-10-23T08:03:47Z

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

askerlee · 2016-10-23T08:13:10Z

Thank you for trying our code.

Yes you can use embeddings generated by other algorithms such as w2v. In theory there should be no significant performance difference, however as I tried, using w2v embeddings usually yield worse results. I guess the reason might be that w2v embeddings have less clear probabilistic interpretations.

I never found monotonically decreasing loglikes in my experiments. Sometimes the loglike decreases in a few iterations and increases again, and that's usually caused by too big GD steps. But monotonically decreasing loglikes... I have no idea. Sorry. If you don't mind, you could upload it somewhere and I'll try to figure out the reason.

Alternatively, you may try the embeddings trained by me on Wikipedia. You can find their URLs in Dropbox in the Github file "vectors&corpus_download_addr.txt".

geekinglcq · 2016-10-23T09:54:55Z

Hi, cause I wanted to apply it in chinese, so i didn't ues your embedings.

I found that the loglike is normal where the corpus contains only one doc.
Here I upload the log, topicvec, corpus and wordembeddings / unigramtable I use. I tried 5 times, respectively used 10, 100, 1000, 10000, and 20000 docs and named after cn_top_a/b/c/d/e.

Thank you for help.

gabrer · 2017-05-06T18:41:22Z

I write in this "discussion" because I think my question should be "in topic" :)

If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"?
If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure?
(For example, some matching between indeces..).

askerlee · 2017-05-06T20:58:13Z

Yeah you can do that. Top1gram file is actually used to get unigram probabilities.

…

On May 7, 2017 2:41 AM, "Gabriele Pergola" ***@***.***> wrote: I write in this "discussion" because I think my question should be "in topic" :) If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"? If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure? (For example, some matching between indeces..). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJQD9OsZJon6-N8QtX6cnnD36p3Y4ks5r3L7SgaJpZM4KeEyb> .

gabrer · 2017-05-07T16:03:07Z

Ok, great! Thank you for your help!
And congratulation on your work :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem about input embeddings generated by other algo. #2

Problem about input embeddings generated by other algo. #2

geekinglcq commented Oct 23, 2016

askerlee commented Oct 23, 2016 •

edited

Loading

geekinglcq commented Oct 23, 2016

gabrer commented May 6, 2017

askerlee commented May 6, 2017 via email

gabrer commented May 7, 2017

Problem about input embeddings generated by other algo. #2

Problem about input embeddings generated by other algo. #2

Comments

geekinglcq commented Oct 23, 2016

askerlee commented Oct 23, 2016 • edited Loading

geekinglcq commented Oct 23, 2016

gabrer commented May 6, 2017

askerlee commented May 6, 2017 via email

gabrer commented May 7, 2017

askerlee commented Oct 23, 2016 •

edited

Loading