Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem about input embeddings generated by other algo. #2

Open
geekinglcq opened this issue Oct 23, 2016 · 5 comments
Open

Problem about input embeddings generated by other algo. #2

geekinglcq opened this issue Oct 23, 2016 · 5 comments

Comments

@geekinglcq
Copy link

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

@askerlee
Copy link
Owner

askerlee commented Oct 23, 2016

Thank you for trying our code.

Yes you can use embeddings generated by other algorithms such as w2v. In theory there should be no significant performance difference, however as I tried, using w2v embeddings usually yield worse results. I guess the reason might be that w2v embeddings have less clear probabilistic interpretations.

I never found monotonically decreasing loglikes in my experiments. Sometimes the loglike decreases in a few iterations and increases again, and that's usually caused by too big GD steps. But monotonically decreasing loglikes... I have no idea. Sorry. If you don't mind, you could upload it somewhere and I'll try to figure out the reason.

Alternatively, you may try the embeddings trained by me on Wikipedia. You can find their URLs in Dropbox in the Github file "vectors&corpus_download_addr.txt".

@geekinglcq
Copy link
Author

Hi, cause I wanted to apply it in chinese, so i didn't ues your embedings.

I found that the loglike is normal where the corpus contains only one doc.
Here I upload the log, topicvec, corpus and wordembeddings / unigramtable I use. I tried 5 times, respectively used 10, 100, 1000, 10000, and 20000 docs and named after cn_top_a/b/c/d/e.

Thank you for help.

@gabrer
Copy link

gabrer commented May 6, 2017

I write in this "discussion" because I think my question should be "in topic" :)

If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"?
If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure?
(For example, some matching between indeces..).

@askerlee
Copy link
Owner

askerlee commented May 6, 2017 via email

@gabrer
Copy link

gabrer commented May 7, 2017

Ok, great! Thank you for your help!
And congratulation on your work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants