-
Notifications
You must be signed in to change notification settings - Fork 191
Bert model prediction is slow , consider more practical implementation #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @alex-movila Thank you for recommending this implementation! Speed and scalability are indeed very important to use We'll follow their updates on the topic closely 😉 |
There are some other tips here to make BERT better suited for production: |
Thanks for the tips @alex-movila ! However, I am afraid they are not compatible with
I will take a deeper look at it to see if I can extract some useful stuffs to our package. |
I think the idea of quantization + pruning could be useful to make BERT smaller. Here is another idea. I think this one should be easy. To implement the reader with allennlp.: seems 2 lines of code.. I will try it myself. |
Related: |
Fwiw, using Obviously this might not be feasible for everyone, I'm just running this in Colab. |
Well we must consider production where we have 1000 users doing inference concurrently. |
I think if we can achieve #209 and make |
For sure on the first point, it obviously doesn't scale. Just wanted to point out that it does work on some level. That said, I think it might be reasonable to expect the availability of a GPU, even if not a large one. Don't know much about deploying these models, but would inference on CPU really be feasible? |
Currently, inference on CPU is feasible but slow (about 10 to 20s per inference depending on the question and the CPU). We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented. If you are interested in contributing to |
Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU. |
Now we have Albert: ALBERT: |
Yes, it's in our plans to add support for DistilBERT and other Transformer based models as well. We will be adding them soon. There's also TinyBert: https://arxiv.org/abs/1909.10351 If ever they publish the model we can add it in cdQA too. |
Thanks for the information @andrelmfarias - is the general procedure to add new models that listed in this PR? |
Google just release "PAWS: Paraphrase Adversaries from Word Scrambling", maybe can benefit cdQA as well? |
Any news? I've been testing cdQA for production, but it takes like 7 seconds to answer a question. Which is OK but kind of annoying for the user. Can I add a faster model, like Distilbert? |
Do you know where the bottle neck is? Is it the QA inference with BERT that is slow? What about the retriever step? I know you are using tf-idf, but have you experimented with other representations at all? |
Certainly not everyone will be interested in using GPU for inference, but I am! I have been experimenting with some pretrained models from the NVIDIA NeMo (neural modules) framework - they are based on Hugging Face pretrained BERT checkpoints but optimized for fine-tuning and inference on GPUs. Since they are based on the Hugging Face models and use a Pytorch backend, I wonder if they could be easily plugged in to cdQA as a Reader? Any idea where to start if I want to try this? I have a couple engineers who might be interested in making this happen. |
There is a new approach from AliBaba described here.
ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST
A Deep Cascade Model for Multi-Document Reading Comprehension
It seems faster with better scalability. Not sure there is code available.
The text was updated successfully, but these errors were encountered: