if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

zzks · 2017-01-27T07:04:17Z

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!

askerlee · 2017-01-27T07:18:08Z

top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.

You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.

zzks · 2017-01-27T10:01:57Z

Roger that!
Thank you for the quick response & detailed reply!

gabrer · 2017-05-10T23:34:54Z

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000.
Does it matter?

askerlee · 2017-05-11T02:53:07Z

It doesn't matter. Words in the word embedding file should be a subset of those in top1grams.txt. Extra words in top1grams.txt will be ignored.

…

On Thu, May 11, 2017 at 7:34 AM, Gabriele Pergola ***@***.***> wrote: I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJU1OkZPA6v4X9x-o2oor0yA0x30-ks5r4kmegaJpZM4LvfHi> .

gabrer · 2017-05-11T09:27:12Z

Hi Askerlee,
thank you as usual! :)
I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!

askerlee · 2017-05-11T12:06:33Z

I see. I didn't consider this situation as I mainly use my own embeddings. Yeah it's better to be fixed. Thanks for finding out.

…

On Thu, May 11, 2017 at 5:27 PM, Gabriele Pergola ***@***.***> wrote: Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it. Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJegU8G4xGMutjpZVVELCXj-aDnj_ks5r4tRwgaJpZM4LvfHi> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

zzks commented Jan 27, 2017

askerlee commented Jan 27, 2017 •

edited

Loading

zzks commented Jan 27, 2017

gabrer commented May 10, 2017

askerlee commented May 11, 2017 via email

gabrer commented May 11, 2017

askerlee commented May 11, 2017 via email

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

Comments

zzks commented Jan 27, 2017

askerlee commented Jan 27, 2017 • edited Loading

zzks commented Jan 27, 2017

gabrer commented May 10, 2017

askerlee commented May 11, 2017 via email

gabrer commented May 11, 2017

askerlee commented May 11, 2017 via email

askerlee commented Jan 27, 2017 •

edited

Loading