-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting reasonable performance on dual RTX 3090 and 128gb #85
Comments
Hi, thanks for your interest about ktransformers. Deepseekv2's Q4-km requires 136G RAM, the data will frequently swap in and out in your RAM if you only got 128G, which slashed your generate speed. My advise is increase your ram or use IQ4_XS format model (125G). |
Hi Azure, thanks for the reply. Unfortunately I am using a consumer motherboard on this setup and the ram is maxed at 128GB. However, I tried the IQ4_XS format with the no optimize config and the results are better. prompt eval count: 26 token(s) When i try to load it with the default DeepSeek-V2-Chat-multi-gpu.yaml, I get the following CUDA error as it starts to load into the second GPU
Loading the Q4_KM with the same config completes correctly, but suffers from the aforementioned bad performance. Would it be possible to eventually leverage the extra 24GB of VRAM ( + 12GB unused on the first GPU ) to load a larger model than the system ram can handle? As in is there a way to configure the optimize config to offload more of the model on the GPU to compensate |
This is a bug, I just fixed it. About your problem.
Maybe you can consider modify your yaml, offload some of experts from CPU to GPU to utilize your extra VRAM. You can find detailed tutorial here. |
Thanks for the update! I will test this throughout the weekend. Do you have an intuition on which parameters i should try to load first? I tried with the "ktransformers.operators.experts.KTransformersExperts" class but triggered an OOM on 1 layer .. Not sure where to go next and would love your input. |
Which backend you are using for |
Using the following yaml modification to the yaml
If I use the Marlin Backend, The VRAM usage on the first GPU hits ~22GB usage during loading then settles down to ~12GB after loading. During Loading:
After Loading:
When I try to generate anything with the web UI, I get the following error in the command line:
The same Error occurs if I load it with the Torch expert: During handling of the above exception, another exception occurred:
Any ideas on how to debug this? |
+1 to this. I am also impacted. I have RTX 3090. When I try to use 0 and 1st layers of experts with @Azure-Tang Please, let us know if there is a fix. |
Hi,
First off thanks for all the work you guys have put into this.
I am trying to run DeepSeek-Coder-V2-Instruct-0724-GGUF Q4_K_M with reasonable performance but cannot figure it out. When i use the default configuration of the "DeepSeek-V2-Chat-multi-gpu.yaml" optimize file, I get about 0.7 t/s. I have tried to load some of the expert layers to the cuda:0 and cuda:1 but hit OOM errors when more than 1 layer is used. Example Yaml match:
GPU Usage:
Has any one been able to achieve reasonable results with this sort of setup?
System:
13th Gen Intel(R) Core(TM) i5-13600K
128GB DDR4 3200 ( 4 x 32GB )
2x RTX 3090
The text was updated successfully, but these errors were encountered: