Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN error #17

Open
LeftAttention opened this issue Nov 17, 2021 · 4 comments
Open

cuDNN error #17

LeftAttention opened this issue Nov 17, 2021 · 4 comments

Comments

@LeftAttention
Copy link

While executing the training script. I encountered the following error.

Traceback (most recent call last):
  File "train.py", line 72, in <module>
    model.optimize_parameters()
  File "/home/DMFN/models/inpainting_model.py", line 177, in optimize_parameters
    l_g_total.backward()
  File "/home/anaconda3/envs/dmfn/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/anaconda3/envs/dmfn/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 256, 64, 64], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7fb8e80d6120
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 2, 256, 64, 64, 
    strideA = 1048576, 4096, 64, 1, 
output: TensorDescriptor 0x7fb8e80c8380
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 2, 256, 64, 64, 
    strideA = 1048576, 4096, 64, 1, 
weight: FilterDescriptor 0x7fb8e80d2500
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 256, 256, 3, 3, 
Pointer addresses: 
    input: 0x7fb936000000
    output: 0x7fb93a000000
    weight: 0x7fb924e90200
Additional pointer addresses: 
    grad_output: 0x7fb93a000000
    grad_weight: 0x7fb924e90200
Backward filter algorithm: 5

While execution of the suggested code snippet I did get any error or warnings.

After typing python -m torch.utils.collect_env I got the following.

Collecting environment information...
PyTorch version: 1.8.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~16.04) 9.4.0
Clang version: Could not collect
CMake version: version 3.21.3

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1070
Nvidia driver version: 455.45.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.5.1.10
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.4
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] facenet-pytorch==2.5.2
[pip3] numpy==1.19.5
[pip3] numpydoc==1.1.0
[pip3] pytorch-fid==0.1.1
[pip3] pytorch-ignite==0.4.7
[pip3] pytorch2keras==0.2.4
[pip3] pytorch3d==0.5.0
[pip3] segmentation-models-pytorch==0.1.3
[pip3] torch==1.8.0+cu111
[pip3] torch-geometric==1.7.2
[pip3] torch-model-archiver==0.4.0
[pip3] torch-scatter==2.0.7
[pip3] torch-sparse==0.6.10
[pip3] torch-workflow-archiver==0.1.0
[pip3] torchaudio==0.8.0
[pip3] torchserve==0.4.0
[pip3] torchvision==0.9.0+cu111
[conda] blas                      1.0                         mkl  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] facenet-pytorch           2.5.2                    pypi_0    pypi
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py37h7f8727e_0  
[conda] mkl_fft                   1.3.0            py37h42c9631_2  
[conda] mkl_random                1.2.2            py37h51133e4_0  
[conda] numpy                     1.19.5                   pypi_0    pypi
[conda] numpydoc                  1.1.0              pyhd3eb1b0_1  
[conda] pytorch-fid               0.1.1                    pypi_0    pypi
[conda] pytorch-ignite            0.4.7                    pypi_0    pypi
[conda] pytorch2keras             0.2.4                    pypi_0    pypi
[conda] pytorch3d                 0.5.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.1.3                    pypi_0    pypi
[conda] torch                     1.8.0+cu111              pypi_0    pypi
[conda] torch-geometric           1.7.2                    pypi_0    pypi
[conda] torch-model-archiver      0.4.0                    pypi_0    pypi
[conda] torch-scatter             2.0.7                    pypi_0    pypi
[conda] torch-sparse              0.6.10                   pypi_0    pypi
[conda] torch-workflow-archiver   0.1.0                    pypi_0    pypi
[conda] torchaudio                0.8.0                    pypi_0    pypi
[conda] torchserve                0.4.0                    pypi_0    pypi
[conda] torchsul                  0.1.26                   pypi_0    pypi
[conda] torchvision               0.9.0+cu111              pypi_0    pypi

Could you please guide me on this?

@Zheng222
Copy link
Owner

@LeftAttention

DMFN/train.py

Line 17 in b6f2258

torch.backends.cudnn.benchmark = True

You can try to change to torch.backends.cudnn.benchmark = False

@LeftAttention
Copy link
Author

LeftAttention commented Nov 20, 2021

I tried that same issue. For the first batch it is running fine but for the second batch it is throwing this error during back propagation. Initially I thought this may be due to different input format but that is not the cause of this issue. In CPU it works. I am not able to figure out the cause of this issue.

@Zheng222
Copy link
Owner

Zheng222 commented Nov 20, 2021

@LeftAttention You can refer to my environment.

PyTorch version: 1.9.0+cu102
Clang version: Could not collect
CMake version: version 3.16.6
Libc version: glibc-2.17

Python version: 3.6 (64-bit runtime)
Python platform: Linux-4.15.0-142-generic-x86_64-with-Ubuntu-16.04-xenial
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti

Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.2.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

I tried that same issue. For the first batch it is running fine but for the second batch it is throwing this error during back propagation. Initially I thought this may be due to different input format but that is not the cause of this issue. In CPU it works. I am not able to figure out the cause of this issue.

@LeftAttention
Copy link
Author

Thanks. I will check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants