Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

Closed
xuan-li opened this issue Jul 24, 2022 · 6 comments · Fixed by #5891
Closed

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

xuan-li opened this issue Jul 24, 2022 · 6 comments · Fixed by #5891
Assignees
Labels
potential bug Something that looks like a bug but not yet confirmed

Comments

@xuan-li
Copy link

xuan-li commented Jul 24, 2022

Describe the bug
If taichi is initialized with GPU, Torch cannot execute backward.

PyTorch version: 1.12.0

To Reproduce

import taichi as ti
import torch

device = torch.device("cuda:0")
ti.init(arch=ti.gpu)

x = torch.tensor([1.], requires_grad=True, device=device)
loss = x ** 2
loss.backward()

Log/Screenshots

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda
Traceback (most recent call last):
  File "******", line 10, in <module>
    loss.backward()
  File "******/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "******/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.

Additional comments
ti diagnose:

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.9.12 (main, Jun  1 2022, 11:38:51) 
[GCC 7.5.0]
system: linux
executable: /home/xuan/miniconda3/envs/dl/bin/python
platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.31
architecture: 64bit ELF
uname: uname_result(system='Linux', node='Wanzi', release='5.13.0-51-generic', version='#58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/xuan/miniconda3/envs/dl/bin:/home/xuan/miniconda3/condabin:/snap/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
PYTHONPATH: ['/home/xuan/miniconda3/envs/dl/bin', '/home/xuan/miniconda3/envs/dl/lib/python39.zip', '/home/xuan/miniconda3/envs/dl/lib/python3.9', '/home/xuan/miniconda3/envs/dl/lib/python3.9/lib-dynload', '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages']

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal



import: <module 'taichi' from '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: True
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

Sat Jul 23 20:38:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 20%   37C    P0    N/A /  75W |    241MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1115      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      1711      G   /usr/lib/xorg/Xorg                120MiB |
|    0   N/A  N/A      1847      G   /usr/bin/gnome-shell                9MiB |
|    0   N/A  N/A      2284      G   ...AAAAAAAAA= --shared-files        6MiB |
+-----------------------------------------------------------------------------+

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=x64

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=opengl

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                 TAICHI EXAMPLES                                  
 ──────────────────────────────────────────────────────────────────────────────── 
  0: ad_gravity               21: keyboard                42: odop_solar          
  1: comet                    22: laplace                 43: patterns            
  2: cornell_box              23: mandelbrot_zoom         44: pbf2d               
  3: diff_sph                 24: marching_squares        45: physarum            
  4: euler                    25: mass_spring_3d_ggui     46: print_offset        
  5: explicit_activation      26: mass_spring_game        47: rasterizer          
  6: export_mesh              27: mass_spring_game_ggui   48: regression          
  7: export_ply               28: mciso_advanced          49: sdf_renderer        
  8: export_videos            29: mgpcg                   50: simple_derivative   
  9: fem128                   30: mgpcg_advanced          51: simple_texture      
  10: fem128_ggui             31: minimal                 52: simple_uv           
  11: fem99                   32: minimization            53: stable_fluid        
  12: fractal                 33: mpm128                  54: stable_fluid_ggui   
  13: fractal3d_ggui          34: mpm128_ggui             55: stable_fluid_graph  
  14: fullscreen              35: mpm3d                   56: taichi_bitmasked    
  15: game_of_life            36: mpm3d_ggui              57: taichi_dynamic      
  16: gui_image_io            37: mpm88                   58: taichi_logo         
  17: gui_widgets             38: mpm88_graph             59: taichi_sparse       
  18: implicit_fem            39: mpm99                   60: tutorial            
  19: implicit_mass_spring    40: mpm_lagrangian_forces   61: vortex_rings        
  20: initial_value_problem   41: nbody                   62: waterwave           
 ──────────────────────────────────────────────────────────────────────────────── 
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.16s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 9.21s
@xuan-li xuan-li added the potential bug Something that looks like a bug but not yet confirmed label Jul 24, 2022
@taichi-gardener taichi-gardener moved this to Untriaged in Taichi Lang Jul 24, 2022
@xuan-li
Copy link
Author

xuan-li commented Jul 24, 2022

Taichi can work with PyTorch 1.10.0.

@xuan-li xuan-li changed the title Torch and Taichi cannot use CUDA at the same time. Torch 1.12.0 and Taichi cannot use CUDA at the same time. Jul 24, 2022
@lin-hitonami
Copy link
Contributor

I reproduced the error too. @erizmr Can you look into this error?

@erizmr
Copy link
Contributor

erizmr commented Jul 25, 2022

I am looking into it.

@k-ye
Copy link
Member

k-ye commented Jul 25, 2022

FYI: #2190 and #4944

@lin-hitonami lin-hitonami moved this from Untriaged to Todo in Taichi Lang Jul 29, 2022
@qiao-bo qiao-bo moved this from Todo to In Progress in Taichi Lang Aug 12, 2022
@pableeto
Copy link

Hi, I have also met the same error.
I've tried different pytorch versions - it seems 1.11 and 1.12 have this issue, while 1.10 does not.

@turbo0628
Copy link
Member

turbo0628 commented Aug 26, 2022

Made some inspection inspired by this pytorch issue
Firstly, pip install cuda-python

Full code with cuda driver helpers:

import torch
import taichi as ti
from cuda import cuda, cudart
import torch

def ASSERT_DRV(err):
    """
    This is a helper function to turn CUDA messages into errors when
    appropriate, since by default the CUDA package doesn't raise
    Python errors, it returns error messages
    """
    if isinstance(err, cuda.CUresult):
        if err != cuda.CUresult.CUDA_SUCCESS:
            raise RuntimeError("Cuda Error: {}".format(err))
    elif isinstance(err, cudart.cudaError_t):
        if err != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError("Cudart Error: {}".format(err))
    else:
        raise RuntimeError("Unknown error type: {}".format(err))


def print_existing_contexts():
    valid_contexts = []
    while True:
        err, cuda_context = cuda.cuCtxPopCurrent()
        try:
            ASSERT_DRV(err)
        except RuntimeError:
            break
        else:
            valid_contexts.append(cuda_context)

    print("Existing, valid contexts: ", valid_contexts)

    for curr_ctx in reversed(valid_contexts):
        err, = cuda.cuCtxPushCurrent(curr_ctx)
        ASSERT_DRV(err)
        
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

This is the proper code that can work. What we see from the log:
image

Taichi ignored the PyTorch CUDA context and created its own.
If we change the initialization order:

ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
# print("TAICHI "ti._lib.core.get_primary_ctx_state())
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

We encouter the error.
image

Torch just fetches the CUcontext created by Taichi, and that CUDA context is not synced.

That said, to work with PyTorch, we should pop out Taichi's CUDA context at the end of ti.init. PyTorch would create its own primary context. When Taichi needs the CUDA context in subsequent exec, it always set the Context to its own ctx pointer so the pop out is fine for Taichi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential bug Something that looks like a bug but not yet confirmed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants