-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High VRAM usage with Blocks to swap on ROCM #1776
Comments
Unfortunately I don't know why it doesn't work with ROCm. Could you try the |
That does seem to help a little, I lowered the blocks to swap to 33, and this was my results: In one attempt the following happened: At first it climbed as normal to 12.3GiB. (instead of 11 as in the old 36 experiment) Here it stabilized, growing with a few MB only, going up and down, trending maybe a MB every few seconds in what could look like some sort of memory leak. I assume this is the swapping actually working.. But then after a few more seconds of this (and no visible progress on an iteration) it SIGABRT'ed with:
Which to me looks like some kind of segfault. This doesn't always happen however, when I try to run it again, it stabilizes similarly, but instead of segfaulting, it prints:
then crashes with the traceback: Traceback
This is a little better: ~1GB free instead of ~400MB, but still falls a little short of the 1.85GB it wants. The behavior is similar with Blockwise Fused Optimizer and AdamW as well. Maybe there is some hope though due to how the blockswapping looks like it might actually be somewhat working? Oh and I forgot to mention (switching branch reminded me). To get adafactor working I needed to run this patch: diff --git a/library/train_util.py b/library/train_util.py
index 8b5cf21..d6ff231 100644
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -5014,7 +5014,7 @@ def get_scheduler_fix(args, optimizer: Optimizer, num_processes: int):
assert (
type(optimizer) == transformers.optimization.Adafactor
), f"adafactor scheduler must be used with Adafactor optimizer / adafactor scheduler<E3><81><AF>Adafactor<E3><82><AA><E3><83><97><E3><83><86><E3><82><A3><E3><83><9E><E3><82><A4><E3><8<82><B6><E3><81><A8><E5><90><8C><E6><99><82><E3><81><AB><E4><BD><BF><E3><81><A3><E3><81><A6><E3><81><8F><E3><81><A0><E3><81><95><E3><81><84>"
- initial_lr = float(name.split(":")[1])
+ initial_lr = args.learning_rate
# logger.info(f"adafactor scheduler init lr {initial_lr}")
return wrap_check_needless_num_warmup_steps(transformers.optimization.AdafactorSchedule(optimizer, initial_lr)) Thanks again for the quick answer! |
Hey, I was testing out flux dreambooth on my 16GB VRAM AMD GPU with blocks to swap = 36, CPU Checkpoint offloading, and Memory Efficient Save.
I see in #1764 a value of 36 on nvidia should enable ~6GB of VRAM usage, instead what I see is ~5.4GB usage when caching latents, then it drops with a long pause of loading state dicts (at ~300MB) while it loads into RAM.
It then starts rising slowly to ~9.6GB, before it reaches
It then quickly rises to ~11GB of usage, printing
And then it spikes up to 15GB and ultimately fails to allocate 1.85GB printing the traceback:
Traceback
I've tried a few different configurations, like turning on/off sdpa, enabling and disabling full fp16 training.
The command used:
config.toml
The commit I am using:
264328d117dc5d17772ec0bdbac2b9f0cf4695f5
If you need any more detail or if I can help in any other way to test I would be more than happy to do so.
Or maybe I have some wrong settings, which in that case I'm sorry for any trouble I may have caused.
Thank you in advance!
The text was updated successfully, but these errors were encountered: