-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade tgi to 2.3.1 #225
Upgrade tgi to 2.3.1 #225
Conversation
…2161) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <[email protected]>
* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff
This reverts commit 2bbb7fa.
Adding "longrope" for phi-3
…2166) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.
* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build
* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes
* Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug
Fix number of KV heads
We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.
Signed-off-by: Wang, Yi A <[email protected]>
huggingface#2190) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <[email protected]>
* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD
* Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.
…e#2194) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
@yuanwu2017 , pls test if any performance regression for llama2, llama3.1, lava-next with this PR |
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Subsequent updates will remove these codes Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
@regisss Please help to review the patch. We are preparing the test report and will send it to you later. |
Signed-off-by: yuanwu <[email protected]>
@regisss @yao-matrix I have ran the performance benchmark of v2.0.6 and v.2.3.1, and not found the performance regression. Please help to review the patch. |
This reverts commit c6f023a.
@yuanwu2017 , pls resolve branch conflicts, thx. |
1 similar comment
@yuanwu2017 , pls resolve branch conflicts, thx. |
Signed-off-by: yuanwu <[email protected]>
Signed-off-by: yuanwu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.