Poor results on Squad 1.0 #4

rishabhjoshi · 2021-05-09T23:49:14Z

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data.
I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8).
Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated!
Thanks!

dayihengliu · 2021-05-10T01:29:16Z

Hi, I wanted to augment the squad 1.0 dataset (not unanswerable questions). I trained a standard roberta MRC model using the transformers library which was giving 86.16 and 92.31 Exact match and F1 scores on the validation data.
I also trained an autoencoder for 100 epochs as described and the loss came down to about 0.04 with perfect regeneration.

I then tried running crqda to augment 30000 samples. I removed the "NEG" parameter, and added "SPAN = True" and "para". With the same hyperparameters (epsilon) that you used. Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3).

After manual inspection, I see that most of the generated questions are gibberish (especially if jaccard similarity is <= 0.8).
Can you share some insights on what might be going wrong and why the results are so poor given the MRC and the autoencoder models are training perfectly?

Any help would be greatly appreciated!
Thanks!

Here are some suggestions that might help you:

"the generated questions are gibberish ". -> Your autoencoder might be overfitted. You could try to pretrain the autoencoder on a large-scale Wikipedia dataset
"Out of 30000 samples, only 1800 samples had questions generated which were selected (the jaccard was >= 0.3)." -> It seems the modified step size used in the inference stage is too large, you can adjust this hyper-parameter.
You can also try to set "SPAN = False" and only add "para".
When using the augmented data set to finetune the MRC model, you can try to adjust the hyperparameter of "warmup steps", due to the increase in the amount of training data.

rishabhjoshi · 2021-05-13T03:07:02Z

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh.
2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes.
3) I tried SPAN = False as well, just keeping para and still poor results.
4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder?
Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

dayihengliu · 2021-05-15T04:51:16Z

Hi, 1) we trained the autoencoder on the 2M corpus that you provide replacing the "train_file" in this script https://github.com/dayihengliu/CRQDA/blob/master/crqda/run_train.sh.
2) I will experiment with lower step sizes. Personally, I don't see why the hyperparameter for step size would vary for SQuAD 1.0 dataset considering SQuAD 2 is just SQuAD 1 with some unanswerable questions. However, I will still experiment with more step sizes.
3) I tried SPAN = False as well, just keeping para and still poor results.
4) I have not reached this step as the number of augmented datapoints is too low (1000 out of possible 30000 attempts).

Would it be possible for you to release your trained autoencoder?
Also, would it be possible to release the augmented dataset including other samples (not just unanswerable)?

Thanks!

This work was done during my internship at Microsoft, but I have left Microsoft. So far, I can only find the augmented unanswered questions and the well-trained RoBERTa SQuAD 2.0 MRC model. Regarding the autoencoder, you can refer to https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#quick-start-guide to download and preprocess the Wikipedia dataset.

TingFree · 2021-05-31T01:44:56Z

@rishabhjoshi Hi, Do you solve this problem? I want to improve it base on crqda, but Judging from your description, maybe the code is hard to run, if you solve it, can you share the augmented squad1.1 data (answerable) for me? thanks!

rishabhjoshi · 2021-07-22T22:26:47Z

@TingFree I was not able to reproduce the results for Squad 1.0 dataset. I was hoping to get the MRC model and the autoencoder (although, the MRC model and autoencoder I have trained are pretty good). I did try multiple hyperparameters but could never get as good results as the authors got for Squad 2.

TingFree · 2022-04-16T14:12:51Z

@rishabhjoshi Hi, Have you reproduce CRQDA on Squad2? I mean the same results as paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor results on Squad 1.0 #4

Poor results on Squad 1.0 #4

rishabhjoshi commented May 9, 2021

dayihengliu commented May 10, 2021

rishabhjoshi commented May 13, 2021

dayihengliu commented May 15, 2021

TingFree commented May 31, 2021

rishabhjoshi commented Jul 22, 2021

TingFree commented Apr 16, 2022

Poor results on Squad 1.0 #4

Poor results on Squad 1.0 #4

Comments

rishabhjoshi commented May 9, 2021

dayihengliu commented May 10, 2021

rishabhjoshi commented May 13, 2021

dayihengliu commented May 15, 2021

TingFree commented May 31, 2021

rishabhjoshi commented Jul 22, 2021

TingFree commented Apr 16, 2022