You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running it on runpod with a couple 4090s with the newest version of KohyaGUI, which doesn't update the requirements correctly so I had to go manually do it. uncertain if this is a requirements problem, or something else. Better to share it now.
I train with Flux1D2pro as the base model, might be related to that but I don't think so.
Also this; I'm uncertain what this is, but I've been seeing it for days now.
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
removing old checkpoint: ./outputs/v4_woman_shuffle/v4_aesthetic_juicer2-step00000850.safetensors
steps: 2%|��� | 919/50025 [1:46:14<94:36:57, 6.94s/it, avr_loss=0.318]Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/kohya_ss/sd-scripts/flux_train_network.py", line 574, in <module>
[rank1]: trainer.train(args)
[rank1]: File "/workspace/kohya_ss/sd-scripts/train_network.py", line 1226, in train
[rank1]: accelerator.backward(loss)
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank1]: frame.recompute_fn(*args)
[rank1]: File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1519, in recompute_fn
[rank1]: fn(*args, **kwargs)
[rank1]: File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 725, in _forward
[rank1]: attn = attention(q, k, v, pe=pe, attn_mask=attn_mask)
[rank1]: File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 451, in attention
[rank1]: x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
[rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
W1119 06:10:40.225000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18826 closing signal SIGTERM
E1119 06:10:41.545000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 18827) of binary: /workspace/kohya_ss/venv/bin/python
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
The text was updated successfully, but these errors were encountered:
Running it on runpod with a couple 4090s with the newest version of KohyaGUI, which doesn't update the requirements correctly so I had to go manually do it. uncertain if this is a requirements problem, or something else. Better to share it now.
I train with Flux1D2pro as the base model, might be related to that but I don't think so.
Also this; I'm uncertain what this is, but I've been seeing it for days now.
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
The text was updated successfully, but these errors were encountered: