Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility on GLUE #9

Open
paul-grundmann opened this issue Jul 31, 2021 · 6 comments
Open

Reproducibility on GLUE #9

paul-grundmann opened this issue Jul 31, 2021 · 6 comments

Comments

@paul-grundmann
Copy link

Hi,
I am currently working on reproducing the results from the paper on the GLUE benchmark. However, my current results are very far from those in the paper. Have you already conducted experiments in this direction or could you reproduce the scores?

I have a running implementation compatible with Huggingface if you want to try it out:
https://github.com/paul-grundmann/transformers/blob/fnet/src/transformers/models/fnet/modeling_fnet.py

In my case, it seems that the model steadily learns on the masked language modeling task but does not improve on downstream tasks at all even after 200k pre-training steps.

@erksch
Copy link
Owner

erksch commented Jul 31, 2021

No, I have not yet evaluated on downstream tasks, but it's definitely in the pipeline. Maybe I can get some runs going this weekend. But, I did some finetuning on some private tasks and it did pretty well, so I don't think there will be many problems. What implementation are you using for finetuning on GLUE?

PS: As I see you are a fellow Berliner working on FNet, maybe we can connect outside of GitHub some time :)

@erksch
Copy link
Owner

erksch commented Jul 31, 2021

Also, are you planning to contribute FNet to HuggingFace? I think this would also be a cool thing.

@paul-grundmann
Copy link
Author

As I see you are a fellow Berliner working on FNet, maybe we can connect outside of GitHub some time :)

Yes sure :)

My plan was to use the model for some downstream tasks with long documents. I just thought it would be easier to implement everything in the Huggingface ecosystem to leverage existing implementations of GLUE and such. But yes, if everything goes well, then of course it would be a great idea to contribute the model and source code to Huggingface.

For evaluation, I used the run_glue.py script in the examples with the following parameters.

python run_glue.py \
--task_name qnli \
--model_name_or_path ./fnet \ 
--tokenizer_name bert-base-uncased \
--output_dir glue \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 64 \
--learning_rate 1e-5 \
--num_train_epochs 3 \ 
--dataloader_num_workers=8 \ 
--max_seq_length 128

I tested SST2, CoLA and QNLI but the model did not improve on those tasks. Neither with my custom pre-training scripts nor with the one from Huggingface run_mlm.py.

But of course I cannot exclude that it is due to my implementation...

@erksch
Copy link
Owner

erksch commented Aug 1, 2021

I guess you are not using the official checkpoint converted to fit your hugging face model? Because you also seem to use a different tokenizer. I conclude that you did run a pre-training from scratch. On what dataset? For how long? What was the MLM score? Maybe the model is just not trained up enough to handle fine-tuning.

@erksch
Copy link
Owner

erksch commented Aug 1, 2021

I just ran SST2 from the FNet base checkpoint converted to PyTorch and it learned pretty smoothly.
But I only got 0.89 validation accuracy as opposed to the 0.95 stated in the paper for FNet base.

Epochs: 3
Learning rate: 1.5e-5
Batch size: 16 for all sets

@paul-grundmann
Copy link
Author

Hi Erik, I just ran some internal benchmarks on a custom pre-trained FNet base (12 layers). In doing so, I realized that I forgot the attention mask in my implementation and that I had some padded inputs in both my training and my downstream tasks. So I adjusted my implementation to simply multiply the attention mask with the embeddings in the fourier layer. This seems to be working. GLUE is still significantly worse than with a normal BERT base, but the results are no longer purely random. (~84% accuracy on SST2, ~11% correlation on CoLA).
In our internal retrieval benchmarks it is also 50%-75% as good as BERT depending on the task. If you want to play around with it, I can send you the weights. They should work with the Huggingface implementation. The model was pre-trained for 125k steps with a learning rate of 7e-4, Batchsize: 2048 on an English Wikipedia and PubMed corpus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants