Upgrade tgi to 2.3.1 #225

yuanwu2017 · 2024-09-26T06:34:03Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…2161) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <[email protected]>

…e#2167)

* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff

This reverts commit 2bbb7fa.

) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.

Adding "longrope" for phi-3

…2166) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.

* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build

* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes

* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug

Fix number of KV heads

We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.

Signed-off-by: Wang, Yi A <[email protected]>

huggingface#2190) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <[email protected]>

@samsamoa

* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD

* Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.

…e#2194) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.

Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <[email protected]>

Signed-off-by: yuanwu <[email protected]>

mandy-li · 2024-11-12T17:37:58Z

@yuanwu2017 , pls test if any performance regression for llama2, llama3.1, lava-next with this PR

Signed-off-by: yuanwu <[email protected]>

Subsequent updates will remove these codes Signed-off-by: yuanwu <[email protected]>

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-12-08T13:05:07Z

@regisss Please help to review the patch. We are preparing the test report and will send it to you later.

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 · 2024-12-11T07:48:07Z

@regisss @yao-matrix I have ran the performance benchmark of v2.0.6 and v.2.3.1, and not found the performance regression. Please help to review the patch.

This reverts commit c6f023a.

yao-matrix · 2024-12-16T00:04:50Z

@yuanwu2017 , pls resolve branch conflicts, thx.

yao-matrix · 2024-12-16T23:54:58Z

@yuanwu2017 , pls resolve branch conflicts, thx.

Signed-off-by: yuanwu <[email protected]>

regisss

LGTM!

sywangyi and others added 30 commits September 24, 2024 03:58

fix FlashDecoding change's regression in intel platform (huggingface#…

71b0189

…2161) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <[email protected]>

fix: use the base layers weight in mistral rocm (huggingface#2155)

e913f3a

Fixing rocm. (huggingface#2164)

bc5a792

Hotfixing qwen2 and starcoder2 (which also get clamping). (huggingfac…

d580215

…e#2167)

Fixing missing object field for regular completions.

b6c8984

Revert "Fixing missing object field for regular completions."

878491c

This reverts commit 2bbb7fa.

Fixing the dockerfile warnings. (huggingface#2173)

64989f9

Fixing missing object field for regular completions. (huggingface#2175

e93c830

) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.

Version 2.1.1

74ddd12

Preparing patch release. (huggingface#2186)

2e09ebe

Adding "longrope" for Phi-3 (huggingface#2172) (huggingface#2179)

835ad0a

Adding "longrope" for phi-3

Hotfixing after refactor.

e481a9b

Fix Starcoder2 after refactor (huggingface#2189)

1e7ce69

Consistently take prefix in model constructors (huggingface#2191)

508e308

* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes

fix dbrx & opt model prefix bug (huggingface#2201)

8e3d1e6

* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug

hotfix: Fix number of KV heads (huggingface#2202)

f11fd69

Fix number of KV heads

Fix incorrect cache allocation with multi-query (huggingface#2203)

1759491

We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.

Falcon/DBRX: get correct number of key-value heads (huggingface#2205)

540e710

add doc for intel gpus (huggingface#2181)

8dd9b2b

Signed-off-by: Wang, Yi A <[email protected]>

fix: python deserialization (huggingface#2178)

4a54e41

update to metrics 0.23.0 or could work with metrics-exporter-promethe… (

74edda9

huggingface#2190) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <[email protected]>

feat: use model name as adapter id in chat endpoints (huggingface#2128)

48f1196

Fix nccl regression on PyTorch 2.3 upgrade (huggingface#2099)

eaaea91

* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD

Adding sanity check to openapi docs.

591f9f7

Updating the self check (huggingface#2209)

cc4fceb

* Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.

Add support for FP8 on compute capability >=8.0, <8.9 (huggingface#2213)

85c3c5d

Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <[email protected]>

yuanwu2017 added 4 commits October 28, 2024 04:37

Merge branch 'habana-main' into 2.3.0

c23584f

Add missing package

4c9856f

Signed-off-by: yuanwu <[email protected]>

Fix the prefill warmup issue

fcf2e3a

Signed-off-by: yuanwu <[email protected]>

Merge branch 'habana-main' into 2.3.0

c345c73

This was referenced Nov 7, 2024

Incorrect answer with openai compatible penalty parameters #238

Open

llama3.1-70B-instruct 422 error Template error: unknown test: test iterable is unknown (in <string>:99) #218

Open

yuanwu2017 added 9 commits November 26, 2024 08:55

Fix startcode issue

636cdb4

Signed-off-by: yuanwu <[email protected]>

Merge branch 'habana-main' into 2.3.0

b83419a

Fix the starCode warmup issue

4586325

Signed-off-by: yuanwu <[email protected]>

Doesn't run the prefill warmup when limit_hpu_graph=true

0228bd0

Signed-off-by: yuanwu <[email protected]>

Remove the CI workflows we don't currently support

253a992

Signed-off-by: yuanwu <[email protected]>

Refine the warmup process

9f356ce

Signed-off-by: yuanwu <[email protected]>

Remove the error log

73e6e3b

Subsequent updates will remove these codes Signed-off-by: yuanwu <[email protected]>

Add the no-deps in pip install

1b65978

Signed-off-by: yuanwu <[email protected]>

Use optimum-habana v1.15-release branch

c6f023a

Signed-off-by: yuanwu <[email protected]>

Fix the warmup issue of llama2-7B.

c922ef9

Signed-off-by: yuanwu <[email protected]>

Revert "Use optimum-habana v1.15-release branch"

c3b8899

This reverts commit c6f023a.

yuanwu2017 added 3 commits December 17, 2024 02:06

Merge branch 'habana-main' into 2.3.0

15de6c9

Remove the useless modifications

eaeef6e

Signed-off-by: yuanwu <[email protected]>

Fix benchmark build error

8e2e5d8

Signed-off-by: yuanwu <[email protected]>

yuanwu2017 changed the title ~~Upgrade to 2.3.1~~ Upgrade tgi to 2.3.1 Dec 18, 2024

yao-matrix approved these changes Dec 18, 2024

View reviewed changes

regisss approved these changes Dec 19, 2024

View reviewed changes

regisss merged commit 5291f65 into huggingface:habana-main Dec 19, 2024

yuanwu2017 deleted the 2.3.0 branch January 12, 2025 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade tgi to 2.3.1 #225

Upgrade tgi to 2.3.1 #225

yuanwu2017 commented Sep 26, 2024

mandy-li commented Nov 12, 2024

yuanwu2017 commented Dec 8, 2024 •

edited

Loading

yuanwu2017 commented Dec 11, 2024 •

edited

Loading

yao-matrix commented Dec 16, 2024

yao-matrix commented Dec 16, 2024

regisss left a comment

Upgrade tgi to 2.3.1 #225

Upgrade tgi to 2.3.1 #225

Conversation

yuanwu2017 commented Sep 26, 2024

What does this PR do?

Before submitting

Who can review?

mandy-li commented Nov 12, 2024

yuanwu2017 commented Dec 8, 2024 • edited Loading

yuanwu2017 commented Dec 11, 2024 • edited Loading

yao-matrix commented Dec 16, 2024

yao-matrix commented Dec 16, 2024

regisss left a comment

Choose a reason for hiding this comment

yuanwu2017 commented Dec 8, 2024 •

edited

Loading

yuanwu2017 commented Dec 11, 2024 •

edited

Loading