Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade tgi to 2.3.1 #225

Merged
merged 345 commits into from
Dec 19, 2024
Merged

Upgrade tgi to 2.3.1 #225

merged 345 commits into from
Dec 19, 2024

Conversation

yuanwu2017
Copy link
Collaborator

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sywangyi and others added 30 commits September 24, 2024 03:58
…2161)

install triton because GPTQParams needs it.

Signed-off-by: Wang, Yi A <[email protected]>
* feat: add pre commit step to force schema update when router changes

* fix: prefer improved update_doc and start server and compare

* fix: adjust typo

* fix: adjust revert typo

* fix: update workflow to use update_doc md command

* feat: improve workflow to check openapi schema too

* fix: adjust timeout for CI

* fix: adjust raise condition and install server in ci

* fix: install protoc before server

* feat: improve update doc and add command to print router schema

* fix: adjust autodoc workflow

* fix: explicitly install protoc and python

* fix: alllow trailing space in openapi schema diff
)

* Fixing missing `object` field for regular completions.

* Fixing docs by re-adding missing `Prompt`.
…2166)

* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
* Add more representative Llama GPTQ test

The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.

* Add support for manually triggering a release build
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
huggingface#2190)

update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1

Signed-off-by: Wang, Yi A <[email protected]>
* fix nccl issue

* add note in dockerfile

* use v2.22.3 that also fixes @samsamoa's repro

* poetry actually can't handle the conflict between torch and nccl

* set LD_PRELOAD
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
…e#2194)

Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.

Co-authored-by: Florian Zimmermeister <[email protected]>
@mandy-li
Copy link
Collaborator

@yuanwu2017 , pls test if any performance regression for llama2, llama3.1, lava-next with this PR

@yuanwu2017
Copy link
Collaborator Author

yuanwu2017 commented Dec 8, 2024

@regisss Please help to review the patch. We are preparing the test report and will send it to you later.

@yuanwu2017
Copy link
Collaborator Author

yuanwu2017 commented Dec 11, 2024

@regisss @yao-matrix I have ran the performance benchmark of v2.0.6 and v.2.3.1, and not found the performance regression. Please help to review the patch.
image

@yao-matrix
Copy link
Collaborator

@yuanwu2017 , pls resolve branch conflicts, thx.

1 similar comment
@yao-matrix
Copy link
Collaborator

@yuanwu2017 , pls resolve branch conflicts, thx.

@yuanwu2017 yuanwu2017 changed the title Upgrade to 2.3.1 Upgrade tgi to 2.3.1 Dec 18, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit 5291f65 into huggingface:habana-main Dec 19, 2024
@yuanwu2017 yuanwu2017 deleted the 2.3.0 branch January 12, 2025 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.