-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3
Comments
top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input. You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf. |
Roger that! |
I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. |
It doesn't matter. Words in the word embedding file should be a subset of
those in top1grams.txt. Extra words in top1grams.txt will be ignored.
…On Thu, May 11, 2017 at 7:34 AM, Gabriele Pergola ***@***.***> wrote:
I noticed that the number of words in "top1grams" is different from the
number of words in the word embedding. E.g. for the Wiki-dataset,
"top1grams" has 286441 words while word embedding has 180000.
Does it matter?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABgKJU1OkZPA6v4X9x-o2oor0yA0x30-ks5r4kmegaJpZM4LvfHi>
.
|
Hi Askerlee, Thanks! |
I see. I didn't consider this situation as I mainly use my own embeddings.
Yeah it's better to be fixed. Thanks for finding out.
…On Thu, May 11, 2017 at 5:27 PM, Gabriele Pergola ***@***.***> wrote:
Hi Askerlee,
thank you as usual! :)
I have got a problem about the "Mstep_sample_topwords" and I thought it
was because of the gap between these two counts. However, it was due to a
number of words in the word embedding smaller than "Mstep_sample_topwords".
I fixed it.
Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABgKJegU8G4xGMutjpZVVELCXj-aDnj_ks5r4tRwgaJpZM4LvfHi>
.
|
hi all,
if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!
The text was updated successfully, but these errors were encountered: