Bert model prediction is slow , consider more practical implementation #197

alex-movila · 2019-07-10T15:06:28Z

There is a new approach from AliBaba described here.
ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

fmikaelian · 2019-07-11T08:14:41Z

Thank you for recommending this implementation!

Speed and scalability are indeed very important to use cdQA in production. At this stage, our primary focus is to make end-to-end question answering easy for our users but we might need to dig deeper into this new approach in the future, probably as soon as Alibaba has released some code (I couldn't find it yet).

We'll follow their updates on the topic closely 😉

alex-movila · 2019-07-17T06:26:18Z

There are some other tips here to make BERT better suited for production:
https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/

andrelmfarias · 2019-07-17T08:00:12Z

Thanks for the tips @alex-movila !

However, I am afraid they are not compatible with cdQA for two reasons:

These tips are for the TF version of BERT, while we use the PyTorch version, provided by Hugging Face
The bert-as-service project has one feature: map a variable-length sentence to a fixed-length vector. This is a good feature to use for Information Extraction (it can be used for the Retriever component of cdQA for example - for now we use a TF-IDF approach). But it is not useful for the Question Answering part of the pipeline (i.e. the Reader), where we apply the model BertForQuestionAnswering.

I will take a deeper look at it to see if I can extract some useful stuffs to our package.

alex-movila · 2019-07-17T08:23:24Z

I think the idea of quantization + pruning could be useful to make BERT smaller.
Also for production there is need of concurrent requests and loading the model only at initialization.

Here is another idea. I think this one should be easy. To implement the reader with allennlp.:
https://allenai.github.io/bi-att-flow/
https://allennlp.org/models

seems 2 lines of code.. I will try it myself.
Though the problem remains that was not trained with non-answerable questions.

alex-movila · 2019-07-17T08:36:36Z

Related:
Here is another paper for XML model which is available at huggingface
Large Memory Layers with Product Keys (https://arxiv.org/abs/1907.05242)
""outperforms a baseline transformer model with 24 layers, while being twice faster at inference time"

jacobdanovitch · 2019-07-19T06:05:55Z

Fwiw, using QAPipeline(reader='models/bert_qa_vGPU-sklearn.joblib', predict_batch_size=128, verbose_logging=True).to('cuda') gives me very reasonable inference time over a pretty large set of documents.

Obviously this might not be feasible for everyone, I'm just running this in Colab.

alex-movila · 2019-07-19T08:12:36Z

Well we must consider production where we have 1000 users doing inference concurrently.
Also not everyone has GPU.

fmikaelian · 2019-07-19T09:00:56Z

I think if we can achieve #209 and make cdQA modular (having possibility to chose different retriever/reader impementations) hopefully everyone will be able to have a solution given his/her specific constraints.

jacobdanovitch · 2019-07-21T05:39:47Z

Well we must consider production where we have 1000 users doing inference concurrently.
Also not everyone has GPU.

For sure on the first point, it obviously doesn't scale. Just wanted to point out that it does work on some level.

That said, I think it might be reasonable to expect the availability of a GPU, even if not a large one. Don't know much about deploying these models, but would inference on CPU really be feasible?

andrelmfarias · 2019-07-22T09:37:17Z

but would inference on CPU really be feasible?

Currently, inference on CPU is feasible but slow (about 10 to 20s per inference depending on the question and the CPU).

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

lewtun · 2019-10-03T16:48:57Z

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

alex-movila · 2019-10-04T10:59:31Z

Now we have Albert: ALBERT:
A Lite BERT For Self-Supervised Learning of Language Representations

andrelmfarias · 2019-10-05T10:56:28Z

Would DistilBERT be one way to achieve faster inference? Huggingface have already fine-tuned it on SQuAD v1.1: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering

In their blog post they quote 60% speedup in inference time compared to BERT, and that's on a CPU.

Yes, it's in our plans to add support for DistilBERT and other Transformer based models as well. We will be adding them soon.

There's also TinyBert: https://arxiv.org/abs/1909.10351

If ever they publish the model we can add it in cdQA too.

lewtun · 2019-10-05T13:34:03Z

Thanks for the information @andrelmfarias - is the general procedure to add new models that listed in this PR?

morningstar899 · 2019-10-11T12:39:28Z

There is a new approach from AliBaba described here.
ALIBABA AI BEATS HUMANS IN READING-COMPREHENSION TEST

A Deep Cascade Model for Multi-Document Reading Comprehension

It seems faster with better scalability. Not sure there is code available.

Google just release "PAWS: Paraphrase Adversaries from Word Scrambling", maybe can benefit cdQA as well?
https://github.com/google-research-datasets/paws

Valdegg · 2020-03-16T10:59:41Z

Any news?

I've been testing cdQA for production, but it takes like 7 seconds to answer a question. Which is OK but kind of annoying for the user.

Can I add a faster model, like Distilbert?

ehutt · 2020-03-17T00:51:40Z

Do you know where the bottle neck is? Is it the QA inference with BERT that is slow? What about the retriever step?

I know you are using tf-idf, but have you experimented with other representations at all?

ehutt · 2020-03-17T19:13:08Z

@andrelmfarias

We are thinking about some solutions to improve inference time on CPU, although it will probably take some time to these solutions to be implemented.

If you are interested in contributing to cdQA working with this particular issue, it can be very helpful and we could implement the solutions sooner 😃

Certainly not everyone will be interested in using GPU for inference, but I am!

I have been experimenting with some pretrained models from the NVIDIA NeMo (neural modules) framework - they are based on Hugging Face pretrained BERT checkpoints but optimized for fine-tuning and inference on GPUs. Since they are based on the Hugging Face models and use a Pytorch backend, I wonder if they could be easily plugged in to cdQA as a Reader?

Any idea where to start if I want to try this? I have a couple engineers who might be interested in making this happen.

fmikaelian added tag: research ✍🏻 type: optimization ⚡ labels Jul 11, 2019

andrelmfarias added status: help wanted 👋 tag: backend 🏭 tag: machine-learning 🤖 labels Jul 16, 2019

andrelmfarias mentioned this issue Oct 25, 2019

Support for Distilbert #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert model prediction is slow , consider more practical implementation #197

Bert model prediction is slow , consider more practical implementation #197

alex-movila commented Jul 10, 2019

fmikaelian commented Jul 11, 2019

alex-movila commented Jul 17, 2019

andrelmfarias commented Jul 17, 2019 •

edited

Loading

alex-movila commented Jul 17, 2019 •

edited

Loading

alex-movila commented Jul 17, 2019

jacobdanovitch commented Jul 19, 2019

alex-movila commented Jul 19, 2019

fmikaelian commented Jul 19, 2019

jacobdanovitch commented Jul 21, 2019

andrelmfarias commented Jul 22, 2019 •

edited

Loading

lewtun commented Oct 3, 2019

alex-movila commented Oct 4, 2019

andrelmfarias commented Oct 5, 2019

lewtun commented Oct 5, 2019

morningstar899 commented Oct 11, 2019

Valdegg commented Mar 16, 2020 •

edited

Loading

ehutt commented Mar 17, 2020

ehutt commented Mar 17, 2020

Bert model prediction is slow , consider more practical implementation #197

Bert model prediction is slow , consider more practical implementation #197

Comments

alex-movila commented Jul 10, 2019

fmikaelian commented Jul 11, 2019

alex-movila commented Jul 17, 2019

andrelmfarias commented Jul 17, 2019 • edited Loading

alex-movila commented Jul 17, 2019 • edited Loading

alex-movila commented Jul 17, 2019

jacobdanovitch commented Jul 19, 2019

alex-movila commented Jul 19, 2019

fmikaelian commented Jul 19, 2019

jacobdanovitch commented Jul 21, 2019

andrelmfarias commented Jul 22, 2019 • edited Loading

lewtun commented Oct 3, 2019

alex-movila commented Oct 4, 2019

andrelmfarias commented Oct 5, 2019

lewtun commented Oct 5, 2019

morningstar899 commented Oct 11, 2019

Valdegg commented Mar 16, 2020 • edited Loading

ehutt commented Mar 17, 2020

ehutt commented Mar 17, 2020

andrelmfarias commented Jul 17, 2019 •

edited

Loading

alex-movila commented Jul 17, 2019 •

edited

Loading

andrelmfarias commented Jul 22, 2019 •

edited

Loading

Valdegg commented Mar 16, 2020 •

edited

Loading