Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. #32

Open
leonel-os opened this issue Jul 26, 2021 · 5 comments

Comments

@leonel-os
Copy link

leonel-os commented Jul 26, 2021

Now I'm getting the following error. I installed all the dependencies including CUDA, but when I run:

sh scripts/run_car.sh

I'm getting this:

Load config from yml file: configs/car.ymlLoad config from yml file: configs/car.ymlLoad config from yml file: configs/car.ymlLoad config from yml file: configs/car.yml



Loading configs from configs/car.ymlLoading configs from configs/car.yml
Loading configs from configs/car.yml
Loading configs from configs/car.yml

{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}

{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}
Setting up Perceptual loss...
Setting up Perceptual loss...
Setting up Perceptual loss...
Setting up Perceptual loss...
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 3 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    trainer = Trainer(cfgs, GAN2Shape)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 23, in __init__
    self.model = model(cfgs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 89, in __init__
    model='net-lin', net='vgg', use_gpu=True, gpu_ids=[torch.device(self.rank)]
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/__init__.py", line 22, in __init__
    self.model.initialize(model=model, net=net, use_gpu=use_gpu, colorspace=colorspace, spatial=self.spatial, gpu_ids=gpu_ids)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/dist_model.py", line 75, in initialize
    self.net.load_state_dict(torch.load(model_path, **kw), strict=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 740, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    result = fn(storage, location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 132, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/serialization.py", line 126, in validate_cuda_device
    device, torch.cuda.device_count()))
RuntimeError: Attempting to deserialize object on CUDA device 2 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.
...[net-lin [vgg]] initialized
...Done

Please I need help, thanks

@XingangPan
Copy link
Owner

@leonel-os It seems that your machine has only 1 GPU, while our scripts require at least 4 GPUs. You need to revise the run_car.sh script accordingly to run on one GPU. Specifically, you need to change CUDA_VISIBLE_DEVICES=0,1,2,3 to CUDA_VISIBLE_DEVICES=0, and close distributed training. However, it is possible that you may get sub-optimal quality on only one GPU. I suggest running on more GPUs if possible.

@leonel-os
Copy link
Author

leonel-os commented Jul 28, 2021

@XingangPan thanks, changing the run_car.sh configuration fixed the error.

EXP=car
CONFIG=car
GPUS=1
PORT=${PORT:-29577}

mkdir -p results/${EXP}
CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
    run.py \
    --launcher pytorch \
    --config configs/${CONFIG}.yml \
    2>&1 | tee results/${EXP}/log.txt

but now I'm getting the following error, related to CUDA out of memory:

sh scripts/run_car.sh
Load config from yml file: configs/car.yml
Loading configs from configs/car.yml
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': False, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 2, 'distributed': True}
Setting up Perceptual loss...
Loading model from: /home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
...[net-lin [vgg]] initialized
...Done
Loading images...
Traceback (most recent call last):
  File "run.py", line 34, in <module>
    trainer.train()
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 158, in train
    self.setup_data(epoch)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/trainer.py", line 78, in setup_data
    self.latent_list[epoch])
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 149, in setup_target
    self.load_latent()
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 248, in load_latent
    self.latent_w, self.gan_im = get_w_img(self.w_path)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/model.py", line 227, in get_w_img
    truncation=self.truncation, randomize_noise=False)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 595, in forward
    out = conv2(out, latent[:, i + 1], noise=noise2)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 350, in forward
    out = self.conv(input, style)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/darkayserleo/Documentos/Tesis/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/model.py", line 287, in forward
    out = F.conv2d(input, weight, padding=self.padding, groups=batch)
RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 1.95 GiB total capacity; 901.72 MiB already allocated; 99.88 MiB free; 928.00 MiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/darkayserleo/anaconda3/envs/unsup3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/darkayserleo/anaconda3/envs/unsup3d/bin/python', '-u', 'run.py', '--local_rank=0', '--launcher', 'pytorch', '--config', 'configs/car.yml']' returned non-zero exit status 1.

I don't know how to fix it. I read some forums and they say I need to change the batch size and/or num of workers, is that right? What else I need to change to run this demo? Please

Thanks in advance

Leonel

@hito-Chen
Copy link

Hello, I have encountered the same problem, changing the batch size/num of workers does not work. Have you solved it yet? Looking forward to your reply. Thanks!

@dellshan
Copy link

dellshan commented Mar 9, 2023

(base) [yshan@saturn12 GAN2Shape]$ sh scripts/run_car.sh
/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/data/yshan/anaconda3/lib/python3.9/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
StyleGAN2: Optimized CUDA op FusedLeakyReLU not available, using native PyTorch fallback.
StyleGAN2: Optimized CUDA op UpFirDn2d not available, using native PyTorch fallback.
Load config from yml file: configs/car.yml
Loading configs from configs/car.yml
{'checkpoint_dir': 'results/car', 'save_checkpoint_freq': 500, 'keep_num_checkpoint': 2, 'use_logger': True, 'log_freq': 100, 'joint_train': False, 'independent': False, 'reset_weight': True, 'save_results': True, 'num_stage': 4, 'flip1_cfg': [False, False, False, False], 'flip3_cfg': [False, False, False, False], 'stage_len_dict': {'step1': 700, 'step2': 700, 'step3': 600}, 'stage_len_dict2': {'step1': 200, 'step2': 500, 'step3': 400}, 'image_size': 128, 'load_gt_depth': False, 'img_list_path': 'data/car/list.txt', 'img_root': 'data/car', 'latent_root': 'data/car/latents', 'model_name': 'gan2shape_car', 'category': 'car', 'share_weight': True, 'relative_enc': False, 'use_mask': True, 'add_mean_L': True, 'add_mean_V': True, 'min_depth': 0.9, 'max_depth': 1.1, 'xyz_rotation_range': 60, 'xy_translation_range': 0.1, 'z_translation_range': 0, 'collect_iters': 100, 'batchsize': 8, 'lr': 0.0001, 'lam_perc': 0.5, 'lam_smooth': 0.01, 'lam_regular': 0.01, 'view_mvn_path': 'checkpoints/view_light/view_mvn.pth', 'light_mvn_path': 'checkpoints/view_light/light_mvn.pth', 'rand_light': [-1, 1, -0.2, 0.8, -0.1, 0.6, -0.6], 'channel_multiplier': 2, 'gan_size': 512, 'gan_ckpt': 'checkpoints/stylegan2/stylegan2-car-config-f.pt', 'F1_d': 2, 'rot_center_depth': 1.0, 'fov': 10, 'tex_cube_size': 2, 'config': 'configs/car.yml', 'seed': 0, 'num_workers': 4, 'distributed': True}
Setting up Perceptual loss...
/data/yshan/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/data/yshan/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
Loading model from: /data/yshan/GAN2Shape/gan2shape/stylegan2/stylegan2-pytorch/lpips/weights/v0.1/vgg.pth
...[net-lin [vgg]] initialized
...Done
Traceback (most recent call last):
File "/data/yshan/GAN2Shape/run.py", line 31, in
trainer = Trainer(cfgs, GAN2Shape)
File "/data/yshan/GAN2Shape/gan2shape/trainer.py", line 23, in init
self.model = model(cfgs)
File "/data/yshan/GAN2Shape/gan2shape/model.py", line 92, in init
self.renderer = Renderer(cfgs, self.image_size)
File "/data/yshan/GAN2Shape/gan2shape/renderer/renderer.py", line 44, in init
self.inv_K_origin = torch.inverse(K).unsqueeze(0)
RuntimeError: Error in dlopen: libtorch_cuda_linalg.so: cannot open shared object file: No such file or directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 209663) of binary: /data/yshan/anaconda3/bin/python
Traceback (most recent call last):
File "/data/yshan/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/yshan/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/yshan/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-03-10_00:31:49
host : saturn12.ihpc.uts.edu.au
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 209663)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@dellshan
Copy link

dellshan commented Mar 9, 2023

Hi, I got this error, when use you config :
EXP=car
CONFIG=car
GPUS=1
PORT=${PORT:-29577}

mkdir -p results/${EXP}
CUDA_VISIBLE_DEVICES=0
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT
run.py
--launcher pytorch
--config configs/${CONFIG}.yml
2>&1 | tee results/${EXP}/log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants