-
-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
taking about 40 minutes to generate one sentence,Is this speed normal? #186
Comments
same here with RTX A4000 using llama3:8b |
I guess a splitted group of layer is around 4GB,is there a way to load more groups once a time for those GPUs having more VRAM? |
In my experience disk speed is the primary bottleneck a least in my config (RTX 3050 6GB). If the model is stored on my HDD spinning disk I'm sitting at just above 2 minutes per token using Qwen2.5-Coder-32B-Instruct with 4 bit compression but if I create a ramdisk and use that to store the model on a ramdisk ( I have 32 gigs of system memory so the 17.566 gigs the compressed model takes up easily fits) using the layer_shards_saving_path parameter for AutoModel.from_pretrained then I'm down to 13 seconds per token and which makes everything MUCH more usable. If the whole model can fit in your gpu like @kingdoom1 or @parsa-pico you guys' should being 7B and 8B models with a 3090 and A4000 then it would be best to just use something else to run the models because at least from my experience AirLLM really is just good for using larger models than you have vram for and not really suited for the smaller models as I assume that there is some overhead in facilitating the vram management stuff even if the whole model can fit in vram at one time. @ggaaooppeenngg I think this is what the prefetching parameter is for but I really don't know because with it set to true or false I'm not seeing much of a difference speed nor GPU usage wise. README says its only about 10% benefit. I don't have any hard evidence to support it but I assume that loading the model from the disk is still likely to be the biggest bottleneck. |
I have set the input maxlength to 128 and the output maxlength to 128 as well. The speed of output is very slow, taking about 40 minutes to generate one sentence. I am using the Qwen-2.5 7B model. Is this speed normal? My GPU is an NVIDIA 3090 with 12GB of VRAM, and it's using around 5GB.
The text was updated successfully, but these errors were encountered: