-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem about input embeddings generated by other algo. #2
Comments
Thank you for trying our code. Yes you can use embeddings generated by other algorithms such as w2v. In theory there should be no significant performance difference, however as I tried, using w2v embeddings usually yield worse results. I guess the reason might be that w2v embeddings have less clear probabilistic interpretations. I never found monotonically decreasing loglikes in my experiments. Sometimes the loglike decreases in a few iterations and increases again, and that's usually caused by too big GD steps. But monotonically decreasing loglikes... I have no idea. Sorry. If you don't mind, you could upload it somewhere and I'll try to figure out the reason. Alternatively, you may try the embeddings trained by me on Wikipedia. You can find their URLs in Dropbox in the Github file "vectors&corpus_download_addr.txt". |
Hi, cause I wanted to apply it in chinese, so i didn't ues your embedings. I found that the loglike is normal where the corpus contains only one doc. Thank you for help. |
I write in this "discussion" because I think my question should be "in topic" :) If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"? |
Yeah you can do that. Top1gram file is actually used to get unigram
probabilities.
…On May 7, 2017 2:41 AM, "Gabriele Pergola" ***@***.***> wrote:
I write in this "discussion" because I think my question should be "in
topic" :)
If I want to use an alternative word embeddings (e.g. word2vec), should I
generate also the " top1gram"?
If so, can I generate the "top1gram" using the "gramcount.pl" script and
the word embedding with an external tool? Is there any relation between the
"top1gram" file and the word embedding structure?
(For example, some matching between indeces..).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABgKJQD9OsZJon6-N8QtX6cnnD36p3Y4ks5r3L7SgaJpZM4KeEyb>
.
|
Ok, great! Thank you for your help! |
Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.
I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.
I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.
In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.
The text was updated successfully, but these errors were encountered: