Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and.. #3

Open
zzks opened this issue Jan 27, 2017 · 6 comments

Comments

@zzks
Copy link

zzks commented Jan 27, 2017

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!

@askerlee
Copy link
Owner

askerlee commented Jan 27, 2017

top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.

You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.

@zzks
Copy link
Author

zzks commented Jan 27, 2017

Roger that!
Thank you for the quick response & detailed reply!

@gabrer
Copy link

gabrer commented May 10, 2017

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000.
Does it matter?

@askerlee
Copy link
Owner

askerlee commented May 11, 2017 via email

@gabrer
Copy link

gabrer commented May 11, 2017

Hi Askerlee,
thank you as usual! :)
I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!

@askerlee
Copy link
Owner

askerlee commented May 11, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants