Skip to content

Bert model prediction is slow , consider more practical implementation #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alex-movila opened this issue Jul 10, 2019 · 18 comments
Open

Comments

@alex-movila
Copy link

There is a new approach from AliBaba described here.
ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

@fmikaelian
Copy link
Collaborator

Hi @alex-movila

Thank you for recommending this implementation!

Speed and scalability are indeed very important to use cdQA in production. At this stage, our primary focus is to make end-to-end question answering easy for our users but we might need to dig deeper into this new approach in the future, probably as soon as Alibaba has released some code (I couldn't find it yet).

We'll follow their updates on the topic closely 😉

@alex-movila
Copy link
Author

There are some other tips here to make BERT better suited for production:
https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Jul 17, 2019

Thanks for the tips @alex-movila !

However, I am afraid they are not compatible with cdQA for two reasons:

  1. These tips are for the TF version of BERT, while we use the PyTorch version, provided by Hugging Face

  2. The bert-as-service project has one feature: map a variable-length sentence to a fixed-length vector. This is a good feature to use for Information Extraction (it can be used for the Retriever component of cdQA for example - for now we use a TF-IDF approach). But it is not useful for the Question Answering part of the pipeline (i.e. the Reader), where we apply the model BertForQuestionAnswering.

I will take a deeper look at it to see if I can extract some useful stuffs to our package.

@alex-movila
Copy link
Author

alex-movila commented Jul 17, 2019

I think the idea of quantization + pruning could be useful to make BERT smaller.
Also for production there is need of concurrent requests and loading the model only at initialization.

Here is another idea. I think this one should be easy. To implement the reader with allennlp.:
https://allenai.github.io/bi-att-flow/
https://allennlp.org/models

seems 2 lines of code.. I will try it myself.
Though the problem remains that was not trained with non-answerable questions.

@alex-movila
Copy link
Author

Related:
Here is another paper for XML model which is available at huggingface
Large Memory Layers with Product Keys (https://arxiv.org/abs/1907.05242)
""outperforms a baseline transformer model with 24 layers, while being twice faster at inference time"

@jacobdanovitch
Copy link

Fwiw, using QAPipeline(reader='models/bert_qa_vGPU-sklearn.joblib', predict_batch_size=128, verbose_logging=True).to('cuda') gives me very reasonable inference time over a pretty large set of documents.

Obviously this might not be feasible for everyone, I'm just running this in Colab.

@alex-movila
Copy link
Author

Well we must consider production where we have 1000 users doing inference concurrently.
Also not everyone has GPU.

@fmikaelian
Copy link
Collaborator

I think if we can achieve #209 and make cdQA modular (having possibility to chose different retriever/reader impementations) hopefully everyone will be able to have a solution given his/her specific constraints.

@jacobdanovitch
Copy link

Well we must consider production where we have 1000 users doing inference concurrently.
Also not everyone has GPU.

For sure on the first point, it obviously doesn't scale. Just wanted to point out that it does work on some level.

That said, I think it might be reasonable to expect the availability of a GPU, even if not a large one. Don't know much about deploying these models, but would inference on CPU really be feasible?

@andrelmfarias
Copy link
Collaborator

andrelmfarias commented Jul 22, 2019

but would inference on CPU really be feasible?

Currently, inference on CPU is feasible but slow (about 10 to 20s per inference depending on the question and the CPU).

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

@lewtun
Copy link

lewtun commented Oct 3, 2019

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

@alex-movila
Copy link
Author

Now we have Albert: ALBERT:
A Lite BERT For Self-Supervised Learning of Language Representations

@andrelmfarias
Copy link
Collaborator

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

Yes, it's in our plans to add support for DistilBERT and other Transformer based models as well. We will be adding them soon.

There's also TinyBert: https://arxiv.org/abs/1909.10351

If ever they publish the model we can add it in cdQA too.

@lewtun
Copy link

lewtun commented Oct 5, 2019

Thanks for the information @andrelmfarias - is the general procedure to add new models that listed in this PR?

@morningstar899
Copy link

There is a new approach from AliBaba described here.
ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

Google just release "PAWS: Paraphrase Adversaries from Word Scrambling", maybe can benefit cdQA as well?
https://github.com/google-research-datasets/paws

@Valdegg
Copy link

Valdegg commented Mar 16, 2020

Any news?

I've been testing cdQA for production, but it takes like 7 seconds to answer a question. Which is OK but kind of annoying for the user.

Can I add a faster model, like Distilbert?

@ehutt
Copy link

ehutt commented Mar 17, 2020

Do you know where the bottle neck is? Is it the QA inference with BERT that is slow? What about the retriever step?

I know you are using tf-idf, but have you experimented with other representations at all?

@ehutt
Copy link

ehutt commented Mar 17, 2020

@andrelmfarias

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

Certainly not everyone will be interested in using GPU for inference, but I am!

I have been experimenting with some pretrained models from the NVIDIA NeMo (neural modules) framework - they are based on Hugging Face pretrained BERT checkpoints but optimized for fine-tuning and inference on GPUs. Since they are based on the Hugging Face models and use a Pytorch backend, I wonder if they could be easily plugged in to cdQA as a Reader?

Any idea where to start if I want to try this? I have a couple engineers who might be interested in making this happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants