Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashed randomly in the middle of cooking #1795

Open
AbstractEyes opened this issue Nov 19, 2024 · 0 comments
Open

Crashed randomly in the middle of cooking #1795

AbstractEyes opened this issue Nov 19, 2024 · 0 comments

Comments

@AbstractEyes
Copy link

AbstractEyes commented Nov 19, 2024

Running it on runpod with a couple 4090s with the newest version of KohyaGUI, which doesn't update the requirements correctly so I had to go manually do it. uncertain if this is a requirements problem, or something else. Better to share it now.

I train with Flux1D2pro as the base model, might be related to that but I don't think so.

Also this; I'm uncertain what this is, but I've been seeing it for days now.
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory

removing old checkpoint: ./outputs/v4_woman_shuffle/v4_aesthetic_juicer2-step00000850.safetensors
steps:   2%|���                                                                                                                      | 919/50025 [1:46:14<94:36:57,  6.94s/it, avr_loss=0.318]Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
Could not load library libnvrtc.so.12. Error: libnvrtc.so.12: cannot open shared object file: No such file or directory
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/kohya_ss/sd-scripts/flux_train_network.py", line 574, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/train_network.py", line 1226, in train
[rank1]:     accelerator.backward(loss)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank1]:     frame.recompute_fn(*args)
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1519, in recompute_fn
[rank1]:     fn(*args, **kwargs)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 725, in _forward
[rank1]:     attn = attention(q, k, v, pe=pe, attn_mask=attn_mask)
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 451, in attention
[rank1]:     x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
[rank1]: RuntimeError: Expected mha_graph->execute(handle, variant_pack, workspace_ptr.get()).is_good() to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
W1119 06:10:40.225000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18826 closing signal SIGTERM
E1119 06:10:41.545000 18754 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 18827) of binary: /workspace/kohya_ss/venv/bin/python
Traceback (most recent call last):
  File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant