Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training does not work with 1 GPU #2

Open
kata44 opened this issue Sep 18, 2021 · 6 comments
Open

Training does not work with 1 GPU #2

kata44 opened this issue Sep 18, 2021 · 6 comments

Comments

@kata44
Copy link

kata44 commented Sep 18, 2021

There seems to be a problem with the contrastive loss when using 1 GPU to train, training only works when setting no_insgen=true.

The output is:

Setting up augmentation...
Distributing across 1 GPUs...
Distributing Contrastive Heads across 1 GPUS...
Setting up training phases...
Setting up contrastive training phases...
Exporting sample images...
Initializing logs...
2021-09-18 04:23:26.767334: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 583, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 576, in main
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
  File "train.py", line 421, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/home/katarina/ML/insgen/training/training_loop.py", line 326, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain, cl_phases=cl_phases, D_ema=D_ema, g_fake_cl=not no_cl_on_g, **cl_loss_weight)
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 156, in accumulate_gradients
    loss_Dreal = loss_Dreal + lw_real_cl * self.run_cl(real_img_tmp, real_c, sync, Dphase.module, D_ema, loss_name='D_cl')
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 71, in run_cl
    loss = contrastive_head(logits0, logits1, loss_only=loss_only, update_q=update_q)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 183, in forward
    self._dequeue_and_enqueue(k)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 51, in _dequeue_and_enqueue
    keys = concat_all_gather(keys)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 197, in concat_all_gather
    for _ in range(torch.distributed.get_world_size())]
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
    return _get_group_size(group)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
    default_pg = _get_default_group()
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
@kata44
Copy link
Author

kata44 commented Sep 22, 2021

I attempted this fix:

diff --git a/training/contrastive_head.py b/training/contrastive_head.py
index e09367e..4517bac 100644
--- a/training/contrastive_head.py
+++ b/training/contrastive_head.py
@@ -189,10 +189,15 @@ class CLHead(torch.nn.Module):
 
 @torch.no_grad()
 def concat_all_gather(tensor):
+
+    if not torch.distributed.is_initialized():
+        return tensor
+
     """
     Performs all_gather operation on the provided tensors.
     *** Warning ***: torch.distributed.all_gather has no gradient.
     """
+
     tensors_gather = [torch.ones_like(tensor)
         for _ in range(torch.distributed.get_world_size())]
     torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
diff --git a/training/training_loop.py b/training/training_loop.py
index a09c5a1..efbef17 100755
--- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -398,9 +398,11 @@ def training_loop(
             snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
             for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
                 if module is not None:
-                    if name in ['DHead', 'GHead']:
-                        module = module.module
                     if num_gpus > 1:
+
+                        if name in ['DHead', 'GHead']:
+                            module = module.module
+
                         misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
                     module = copy.deepcopy(module).eval().requires_grad_(False).cpu()
                 snapshot_data[name] = module

However, this halves training throughput with insgen enabled and according to the paper, "the extra computing load is extremely small and the training efficiency is barely affected", so I assume this is not doing the right thing.

@Johnson-yue
Copy link

@kata44 I alse want to train with only 1GPU, and I think only modified ‘concat_all_gather()’ function is not correct.

as the code said in _batch_shuffle_ddp()[https://github.com/genforce/insgen/blob/52bda7cfe59094fbb2f533a0355fff1392b0d380/training/contrastive_head.py#L73-L75] and _batch_unshuffle_ddp()

@RuoyuGuo
Copy link

RuoyuGuo commented Jan 4, 2022

Hi, I am not familiar with multiple GPUs training but I think the bug is triggered by def _dequeue_and_enqueue(...) from contrastive_head.py.

Now look at line 51 keys = concat_all_gather(keys), I guess this line code only concatenate distributed tensors from different GPUs, but for 1 GPU, it is not necessary for 1 GPU. So I simply delete this line if train on 1 GPU

@49xxy
Copy link

49xxy commented Jun 13, 2022

嗨,我不熟悉多 GPU 训练,但我认为该错误是由def _dequeue_and_enqueue(...)from触发的contrastive_head.py

现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行 keys = concat_all_gather(keys)

Have you solved this problem? Can you train with one GPU?

@49xxy
Copy link

49xxy commented Jul 14, 2022

嗨,我不熟悉多 GPU 训练,但我认为该错误是由def _dequeue_and_enqueue(...)from触发的contrastive_head.py

现在看第 51 行,我猜这行代码只连接来自不同 GPU 的分布式张量,但对于 1 个 GPU,1 个 GPU 没有必要。因此,如果在 1 个 GPU 上训练,我只需删除此行 keys = concat_all_gather(keys)

Hi!Can I delete this line for normal training?

@GilesBathgate
Copy link

I think the issue is simply that the process groups need to be initialised even if there is only one GPU see the patch in #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants