Crashed randomly in the middle of cooking #1795

AbstractEyes · 2024-11-19T13:12:07Z

Running it on runpod with a couple 4090s with the newest version of KohyaGUI, which doesn't update the requirements correctly so I had to go manually do it. uncertain if this is a requirements problem, or something else. Better to share it now.

I train with Flux1D2pro as the base model, might be related to that but I don't think so.

Also this; I'm uncertain what this is, but I've been seeing it for days now.
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory

removing old checkpoint: ./outputs/v4_woman_shuffle/v4_aesthetic_juicer2-step00000850.safetensors
steps:   2%|���                                                                                                                      | 919/50025 [1:46:14<94:36:57,  6.94s/it, avr_loss=0.318]Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/kohya_ss/sd-scripts/flux_train_network.py", line 574, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/train_network.py", line 1226, in train
[rank1]:     accelerator.backward(loss)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank1]:     frame.recompute_fn(*args)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1519, in recompute_fn
[rank1]:     fn(*args, **kwargs)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 725, in _forward
[rank1]:     attn = attention(q, k, v, pe=pe, attn_mask=attn_mask)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 451, in attention
[rank1]:     x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
[rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
W1119 06:10:40.225000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18826 closing signal SIGTERM
E1119 06:10:41.545000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 18827) of binary: /workspace/kohya_ss/venv/bin/python
Traceback (most recent call last):
  File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashed randomly in the middle of cooking #1795

Crashed randomly in the middle of cooking #1795

AbstractEyes commented Nov 19, 2024 •

edited

Loading

Crashed randomly in the middle of cooking #1795

Crashed randomly in the middle of cooking #1795

Comments

AbstractEyes commented Nov 19, 2024 • edited Loading

AbstractEyes commented Nov 19, 2024 •

edited

Loading