Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor results on Squad 1.0 #4

Open
rishabhjoshi opened this issue May 9, 2021 · 6 comments
Open

Poor results on Squad 1.0 #4

rishabhjoshi opened this issue May 9, 2021 · 6 comments

Comments

@rishabhjoshi
Copy link

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data.
I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8).
Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated!
Thanks!

@dayihengliu
Copy link
Owner

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data.
I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8).
Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated!
Thanks!

Here are some suggestions that might help you:

  1. "the generated questions are gibberish ". -> Your autoencoder might be overfitted. You could try to pretrain the autoencoder on a large-scale Wikipedia dataset
  2. "Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3)." -> It seems the modified step size used in the inference stage is too large, you can adjust this hyper-parameter.
  3. You can also try to set "SPAN = False" and only add "para".
  4. When using the augmented data set to finetune the MRC model, you can try to adjust the hyperparameter of "warmup steps", due to the increase in the amount of training data.

@rishabhjoshi
Copy link
Author

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh.
2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes.
3) I tried SPAN = False as well, just keeping para and still poor results.
4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder?
Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

@dayihengliu
Copy link
Owner

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh.
2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes.
3) I tried SPAN = False as well, just keeping para and still poor results.
4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder?
Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

This work was done during my internship at Microsoft, but I have left Microsoft. So far, I can only find the augmented unanswered questions and the well-trained RoBERTa SQuAD 2.0 MRC model. Regarding the autoencoder, you can refer to https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#quick-start-guide to download and preprocess the Wikipedia dataset.

@TingFree
Copy link

@rishabhjoshi Hi, Do you solve this problem? I want to improve it base on crqda, but Judging from your description, maybe the code is hard to run, if you solve it, can you share the augmented squad1.1 data (answerable) for me? thanks!

@rishabhjoshi
Copy link
Author

@TingFree I was not able to reproduce the results for Squad 1.0 dataset. I was hoping to get the MRC model and the autoencoder (although, the MRC model and autoencoder I have trained are pretty good). I did try multiple hyperparameters but could never get as good results as the authors got for Squad 2.

@TingFree
Copy link

@rishabhjoshi Hi, Have you reproduce CRQDA on Squad2? I mean the same results as paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants