Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem arises when I train updown_nocaps setting #9

Closed
chenxy99 opened this issue Dec 10, 2019 · 5 comments
Closed

Problem arises when I train updown_nocaps setting #9

chenxy99 opened this issue Dec 10, 2019 · 5 comments

Comments

@chenxy99
Copy link

Hi, thanks a lot for this great dataset.

First, I follow the instruction on 'How to setup this codebase?' to set up my anaconda env (updown) as well as set the token through EvalAI CLI.

Next, based on the instruction on 'How to train your captioner?', I use the script

python scripts/train.py --config configs/updown_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

I find that the script can run through into the validation part in /scripts/train.py and print first 25 captions with their Image ID in train.py Line 261-263

# Print first 25 captions with their Image ID.
for k in range(25):
   print(predictions[k]["image_id"], predictions[k]["caption"])

Then, however, the script seems to stop running like the figure shown below.
image_cap

And I wait for a long time (at least half an hour) I got an error message
image_error

I doubt that there exist to be some problem with evalai. But in the conda list, I found the evalai has been installed. Hence, I doubt one possibility is that when I run pip install -r requirements.txt, I find a message that some modules are incompatible. I try a lot, but I cannot find a good balance so that all of the modules are compatible.
conda_install

So I hope that you can provide the environment.yaml conda env export > environment.yaml for me and I can try it by using the same versions of all the modules.

If it is not the case, could you help me to solve this problem.

Thanks a lot again.

@kdexd
Copy link
Collaborator

kdexd commented Dec 10, 2019

Hi @chenxy99 — glad you liked our work! It looks like your predictions are not being uploaded to EvalAI for validation. Can you double-check if you compute machine has internet access? Also, please try to save one prediction file as JSON (and remove --evalai-submit) flag, and go to evalai.cloudcv.org and submit it manually. Let me know if EvalAI does not accept your file, thanks!

@chenxy99
Copy link
Author

Thanks for your help. It works well.

@kdexd
Copy link
Collaborator

kdexd commented Dec 12, 2019

That's great, glad it worked!

@chenxy99
Copy link
Author

Hello, I find that your code works well in a single gpu scenario. But in multi-gpus setting, it seems to have some problem. I use the script below
python scripts/train.py --config configs/updown_plus_cbs_nocaps_val.yaml --config-override OPTIM.BATCH_SIZE 250 --checkpoint-every 10 --gpu-ids 0 1 --serialization-dir checkpoints/updown_plus_cbs_test
In the second evaluation for the nocaps val, an error occurs.

0%| | 19/70000 [07:59<152:06:17, 7.82s/it]
Traceback (most recent call last): | 324/750 [01:36<02:04, 3.42it/s]
File "scripts/train.py", line 239, in
num_constraints=batch.get("num_constraints", None),
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, *kwargs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map
for k in out))
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in
for k in out))
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fa99d5c7441 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fa99d5c6d7a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x55a (0x7fa99c95138a in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a230c (0x7fa9dcd3830c in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x7fa9dc8c6cfc in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #15: THPFunction_apply(_object
, _object
) + 0x6b1 (0x7fa9dcb49481 in /home/xianyu/anaconda3/envs/updown/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

It seems there are some problems form the output of CBS.

RuntimeError: Gather got an input of invalid size: got [3, 4], but expected [3, 20] (gather at /pytorch/torch/csrc/cuda/comm.cpp:238)

I would like to know how I can fix this problem.

Thanks for your help a lot again.

@chenxy99 chenxy99 reopened this Dec 25, 2019
@chenxy99
Copy link
Author

chenxy99 commented Jan 31, 2020

Hello @kdexd , this morning, I find that every time I submit my evaluation json file to EVAL AI, the status is always 'submitted'. I wait for about 4 hours, but I still cannot get the 'finished' status (It usually costs 1 minute to change to 'finished' status). I would like to know whether there is something wrong with the eval ai for nocaps from this morning and hopefully, you can help me solve this issue.

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants