-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On-The-Fly Quantization for Inference appears not to be working as per documentation. #2748
Comments
Could you check the size of the key-value cache in both cases? The memory freed up by quantization is used to increase the size of the key-value cache, so that more requests can be in flight simultaneously and prefix caching gives larger benefits. See e.g.:
|
@danieldk we launch the container via docker docker run ..... I don't see "KV-cache" in these logs. When I run text-generation-launcher from inside the running container I don't see "KE-cache" either on the console. Can you advise ? |
That's odd, the KV-cache size is logged unconditionally at the info level during warmup. It's only added in TGI 2.4.0, so the message wouldn't be logged in 2.0.5 (though the same applies, the additional memory is used to make a larger KV cache). Example run with 2.4.0:
|
Thanks @danieldk I verified "KV-cache" is present in the 2.4.0 TGI console logs but not previous versions. So I noticed that the KV-cache blocks: 123449, size: 1 entry appears to be the same figure for the Infeared MAX_BATCH_TOTAL_TOKENS parameter, ie KV-cache blocks: 123449 and MAX_BATCH_TOTAL_TOKENS = 123449. I'm looking at this HF blog and beginning to get the picture that the TGI inference engine is doing more optimization (PreFill & Decode) then we originally expected to see -->TGI launcher documentation.
With better understanding of the TGI memory optimizations for Inference we still have a few outstanding questions.
|
System Info
Platform: Dell 760xa with 4x L40S GPUs
OS Description: Ubuntu 22.04.5 LTS
GPU: NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
Python: 3.10.12
Docker: 26.1.5
Model: [Deploy Meta-Llama-3.1-8b-Instruct | Dell Enterprise Hub by Hugging Face] (https://dell.huggingface.co/authenticated/models/meta-llama/Meta-Llama-3.1-8b-Instruct/deploy/docker)
Tested with two versions of model containers:
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0
Information
Tasks
Reproduction
docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test
nvidia-smi
note the gpu memory usage e.g 27629 GB for each GPUdocker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2 -e MAX_BATCH_PREFILL_TOKENS=16182 -e MAX_INPUT_TOKENS=8000 -e MAX_TOTAL_TOKENS=8192 registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test --quantize bitsandbytes
nvidia-smi
note the gpu memory usage e.g 276701 GB for 1 GPU and 27027 for 2nd GPUTGI Container Versions 2.4.0 & 2.0.5.dev0 (Current DEH version):
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct-test -> TGI 2.4.0
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct TGI -> 2.0.5.dev0
Single and Dual GPU's
docker run -it --shm-size 1g -p 80:80 --gpus 1 -e NUM_SHARD=1
docker run -it --shm-size 1g -p 80:80 --gpus 2 -e NUM_SHARD=2
Quantize Options: Quantize options.
bitsandbytes, bitsandbytes-fp4, bitsandbytes-nf4, fp8, eetq
Expected behavior
See attached PDF
HF-TICKET-Quantization-Results.pdf
Results.
Running e.g Llama 8B instruct with --quantize bitsandbytes we
see minor or insignificant differences in GPU memory utilization.
Note: Both TGI Container Versions show similar signatures.
On-the-Fly quantization for Inferencing doesn't appear to be working as expected.
Does bitsandbytes-fp4 and bitsandbytes-nf4 work ?
Does fp8 quantization work for on-the-fly inference quantization ?
Does eetq quantization work for on-the-fly inference quantization ?
Does on-the-fly quantization work with multi-GPU’s instances ?
Should different Input-Token Configs be used to see meaningful quantization results? eg.
{ MAX_BATCH_PREFILL_TOKENS=16182, MAX_INPUT_TOKENS=8000, MAX_TOTAL_TOKENS=8192 }
The text was updated successfully, but these errors were encountered: