Problem arises when I train updown_nocaps setting #9

chenxy99 · 2019-12-10T19:41:36Z

Hi, thanks a lot for this great dataset.

First, I follow the instruction on 'How to setup this codebase?' to set up my anaconda env (updown) as well as set the token through EvalAI CLI.

Next, based on the instruction on 'How to train your captioner?', I use the script

python scripts/train.py --config configs/updown_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

I find that the script can run through into the validation part in /scripts/train.py and print first 25 captions with their Image ID in train.py Line 261-263

# Print first 25 captions with their Image ID.
for k in range(25):
   print(predictions[k]["image_id"], predictions[k]["caption"])

Then, however, the script seems to stop running like the figure shown below.

And I wait for a long time (at least half an hour) I got an error message

I doubt that there exist to be some problem with evalai. But in the conda list, I found the evalai has been installed. Hence, I doubt one possibility is that when I run pip install -r requirements.txt, I find a message that some modules are incompatible. I try a lot, but I cannot find a good balance so that all of the modules are compatible.

So I hope that you can provide the environment.yaml conda env export > environment.yaml for me and I can try it by using the same versions of all the modules.

If it is not the case, could you help me to solve this problem.

Thanks a lot again.

The text was updated successfully, but these errors were encountered:

kdexd · 2019-12-10T19:47:21Z

Hi @chenxy99 — glad you liked our work! It looks like your predictions are not being uploaded to EvalAI for validation. Can you double-check if you compute machine has internet access? Also, please try to save one prediction file as JSON (and remove --evalai-submit) flag, and go to evalai.cloudcv.org and submit it manually. Let me know if EvalAI does not accept your file, thanks!

chenxy99 · 2019-12-12T16:48:27Z

Thanks for your help. It works well.

kdexd · 2019-12-12T17:16:12Z

That's great, glad it worked!

chenxy99 · 2019-12-25T05:58:39Z

Hello, I find that your code works well in a single gpu scenario. But in multi-gpus setting, it seems to have some problem. I use the script below
python scripts/train.py --config configs/updown_plus_cbs_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 1 --serialization-dir checkpoints/updown_plus_cbs_test
In the second evaluation for the nocaps val, an error occurs.

0%| | 19/70000 [07:59<152:06:17, 7.82s/it]
Traceback (most recent call last): | 324/750 [01:36<02:04, 3.42it/s]
File "scripts/train.py", line 239, in
num_constraints=batch.get("num_constraints", None),
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map
for k in out))
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in
for k in out))
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa99d5c7441 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fa99d5c6d7a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x55a (0x7fa99c95138a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a230c (0x7fa9dcd3830c in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x7fa9dc8c6cfc in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #15: THPFunction_apply(_object, _object) + 0x6b1 (0x7fa9dcb49481 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

It seems there are some problems form the output of CBS.

RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238)

I would like to know how I can fix this problem.

Thanks for your help a lot again.

chenxy99 · 2020-01-31T20:44:55Z

Hello @kdexd , this morning, I find that every time I submit my evaluation json file to EVAL AI, the status is always 'submitted'. I wait for about 4 hours, but I still cannot get the 'finished' status (It usually costs 1 minute to change to 'finished' status). I would like to know whether there is something wrong with the eval ai for nocaps from this morning and hopefully, you can help me solve this issue.

Thanks a lot.

chenxy99 closed this as completed Dec 12, 2019

chenxy99 reopened this Dec 25, 2019

chenxy99 closed this as completed Feb 1, 2020

chenxy99 mentioned this issue Feb 16, 2022

erro chenxy99/ANOC#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem arises when I train updown_nocaps setting #9

Problem arises when I train updown_nocaps setting #9

chenxy99 commented Dec 10, 2019

kdexd commented Dec 10, 2019

chenxy99 commented Dec 12, 2019

kdexd commented Dec 12, 2019

chenxy99 commented Dec 25, 2019

chenxy99 commented Jan 31, 2020 •

edited

Loading

Problem arises when I train updown_nocaps setting #9

Problem arises when I train updown_nocaps setting #9

Comments

chenxy99 commented Dec 10, 2019

kdexd commented Dec 10, 2019

chenxy99 commented Dec 12, 2019

kdexd commented Dec 12, 2019

chenxy99 commented Dec 25, 2019

chenxy99 commented Jan 31, 2020 • edited Loading

chenxy99 commented Jan 31, 2020 •

edited

Loading