Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BERTLMScore #3

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Add BERTLMScore #3

wants to merge 12 commits into from

Conversation

dldk-gael
Copy link
Contributor

@dldk-gael dldk-gael commented Apr 28, 2020

Hi,
I propose to add a BERT-based language modeling score.

I implemented the idea describe in Effective Sentence Scoring Method Using BERT for Speech. In this paper, this authors said they obtained better results by using a bidirectional LM (by comparison to a unidirectional LM as GPT2).

The idea is to successively mask each token and retrieve the probability that BERT give to that tokens in the output.

For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.

Also, I did not manage to run the different test with poetry as you asked.
I am not familiar with automatic testing and poetry, so I have to look into it.

@simonepri simonepri changed the title add BERTLMScore Add BERTLMScore Apr 28, 2020
@simonepri
Copy link
Owner

For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.

Can you please add a flag to allow to setup a batch size?

Also, I did not manage to run the different test with poetry as you asked.

There is a bug in the branch you are using (fixed here df67f70) that won't allow you to run poetry.
Please run the following on your branch to fetch the fix:

git fetch upstream
git rebase upstream/master

I am not familiar with automatic testing and poetry, so I have to look into it.

All the tests run automatically every time you submit changes to this PR.

If you click on details and expand the line where you can see a red cross, you will see exactly what's the problem.

@dldk-gael
Copy link
Contributor Author

Ok, succeed to make work black and poetry.
Also I copy/paste the test file you made for GPT2 and it is interesting to see that it does not pass two tests :

I need to finish this project, but I do not have enough time.
I need to finish this project, and I do not have enough time.

and

I can walk there.
I can ambulate there.

Second one is more surprising.

I will add the flag option.

@dldk-gael
Copy link
Contributor Author

I added the possibility to input the mask sentences by batches. I have also tried to add new test functions (to check if this does not change the score results), however it is not taken into account when I run poetry run task test (by the way, there is a small type in readme, you ask to run poetry run task testS) I will look into it tomorrow.

@simonepri
Copy link
Owner

I woud like to get #4 merged before reviewing this. Since we are chaning the API we have to update this PR later anyway.

@simonepri simonepri self-requested a review April 30, 2020 15:00
@simonepri simonepri marked this pull request as draft April 30, 2020 22:19
@simonepri
Copy link
Owner

simonepri commented May 16, 2020

Things todo here before we can merge this:

  • Refactor the code to match the new API
  • Update the tests to match the ones implemented for GPT2

@dldk-gael
Copy link
Contributor Author

Hi Simone, I agree and will work on that next week.

@simonepri
Copy link
Owner

Sure, no hurry! The comment was meant as a reminder for the future.

@dldk-gael
Copy link
Contributor Author

Hi @simonepri,

I have begin to work on adapting BERTScore to new API and I have a question about the batching part.
In BERTScore, you can not directly use BatchedLMScorer because you don't batch by sentence but by masked sentence.
For instance, suppose the user set a batch size of 4 and input 2 sentences :

I like tennis
I like to eat potatoes

I will in fact input 8 sentences to BERT

MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes
I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK

And so my first batch will consist of:

MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes

and second of :

I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK

So what I have done for now is the following :

  1. Encode every sentence provide by user in one shot using BatchEncoding
  2. Expand the input ids to replace each sentence by mask sentences
  3. Input to BERT by batch of batch_size
  4. Resplit the result by the sentences.

Code is not finished yet (not tested, not cleaned ...), but just to give you an idea :
https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee

My question is: according you what is the best way to proceed with these batch specificity ? Do I simply overwrite _tokens_log_prob from BatchedLMScorer ?

@simonepri
Copy link
Owner

What I would do is the following:

First, You put your code inside a function called _tokens_log_prob_for_sentence.
Then inside the _tokens_log_prob_for_batch you simply do a for loop that calls a _tokens_log_prob_for_sentence for every sentence in the batch.

In the function _tokens_log_prob_for_sentence you then generate the masked sentences and you batch them according to the batch_size field (which is what you are already doing I guess).

How does it sound?

PS: You don't want to tokenize all the sentences at once if you then process the sentences independently. If you do so, you only introduce unnecessary padding. Just process each sentence one by one using encode_plus instead.

Notes on the code:

Here https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee#file-bert_score-py-L12-L23

You don't need this. Just add the return_special_tokens_mask=True when you call the encode_plus method and then do the following:

tokens_mask = encoding["special_tokens_mask"] == 0
ids = encoding["input_ids"][tokens_mask]

@dldk-gael
Copy link
Contributor Author

dldk-gael commented May 25, 2020

By doing so, when the user input lot of small sentences but use large batch size (for instance because he have lot of GPU memory), we won't fill the batch at each input.
And sometime, the problem will be the reverse. That why I think one batch should not necessary contained only masked sentences coming from one unique sentence.

@simonepri
Copy link
Owner

Yes that's true, but it also makes the logic much much more easier to maintain.
I would rather implement a simple version first test it, and then if it's worth it, improve it.

Anyway, If you find a clean way to batch multiple sentences then sure go ahead and do it.
If you do so, just be careful to not build all the masked sequences at once anyway.
Say that the user inputs 64 sentences each long 64 tokens you don't want to store 64^2 sequences of tokens in memory just to process up to batch_size sentences at a time.

@dldk-gael
Copy link
Contributor Author

Hi @simonepri ,
I gave a try to the second option because BERT scoring method request many more operations than in GPT2 and so I want to use full batch capacity in any cases.
It is working even if is still some test to update (and some minor change to made as handling the empty string case correctly), but because I am not respecting BatchedLMScorer structure here, I want your opinion on this draft before going further.

@dldk-gael
Copy link
Contributor Author

dldk-gael commented Jun 11, 2020

Hi @simonepri, did you have time to look ?
For your information, next I plan to compare the different scorer using Blimp dataset. I don't know if you have used this dataset before : https://arxiv.org/pdf/1912.00582.pdf but it is really related to what we try to achieve here ! (and it is available on huggingface/nlp ! :) )

@dldk-gael
Copy link
Contributor Author

Quick update: I run some test on the Blimp dataset. With GPT2 (base) score I get 0.9 accuracy vs 0.85 with Bert Score. So I agree we can go back a to more easy-to-maintain version of the batch handler within Bert.
(btw, I observe a huge drop when using hmean reducing strategy, but i did not understand the idea behind this one).

@simonepri
Copy link
Owner

@dldk-gael Sorry I don't have time at the moment to review the code.
I should be able to come back to you in about a week. Ping me in case I foget.

but i did not understand the idea behind this one

Wha do you mean? It's just the harmonic mean of the individual token probabilities. I added it just for completeness (i.e. to have all the three Pythagorean means). It wasn't meant to be particularly useful.

@dldk-gael
Copy link
Contributor Author

Hi @simonepri , are you still planning to work on this one ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants