-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the embedding model trainable during the training process? #42
Comments
Hi, Thanks for the question, and for carefully studying the code! We have experimented with various ways of initializing the word embeddings when training_mode='emb', it means random initialization; when training_mode='e2e', it means training end-to-end. For all the main experiments in the paper (except from ablations) we use --training_mode = 'e2e' to train the embeddings end-to-end. Inside the training code, the embedding step happens here:
For decoding, we actually load the. trained embedding. As shown in
Hope this helps. |
Thanks for your reply. I got a better understanding of the code with your response. I believe your code would be more readable if you could explain it more! Previously, I thought 'e2e' means 'English2English' (forgive me. ) |
However, I wonder why you loaded the weight of 'word_embedding' into the weight of 'lm_head'. As far as I know, the dimension of 'word_embeding' is (vocab_size, in_channels), while the dimension of 'lm_head' is (in_channels, vocab_size). Should the parameters of 'lm_head' be learnable instead of using the same weight of 'word_embedding'? Can you please give me some hints regarding this implementation? Thanks a lot. |
Hi, thanks for providing the code. However, I am confused regarding the embedding layer.
In the
train.py
script, the model weight is loaded fromema_0.9999_200000.pt
for 'roc' dataset. This indicates that the embedding layer is using the pre-trained parameters.But as for other datasets or for
experiment = 'random
, the embedding layer is randomly initialized.So, first of all, I guess that this embedding model is trained during the training process. Am I right?
Nevertheless, when we decode the text batches and hope to sample texts by
batch_decode.py
andtext_sample.py
. It turns out that the embedding model loads the weight of the randomly initialized model, which means that the embedding layer is not trained during the training process. This is very weird, isn't it?To summarize, I am uncertain about why you do not load a well-trained embedding layer when you decode the batches but adopt a randomly initialized embedding layer.
The text was updated successfully, but these errors were encountered: