Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

xuan-li · 2022-07-24T03:40:46Z

Describe the bug
If taichi is initialized with GPU, Torch cannot execute backward.

PyTorch version: 1.12.0

To Reproduce

import taichi as ti
import torch

device = torch.device("cuda:0")
ti.init(arch=ti.gpu)

x = torch.tensor([1.], requires_grad=True, device=device)
loss = x ** 2
loss.backward()

Log/Screenshots

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda
Traceback (most recent call last):
  File "******", line 10, in <module>
    loss.backward()
  File "******/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "******/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.

Additional comments
ti diagnose:

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

Taichi system diagnose:

python: 3.9.12 (main, Jun  1 2022, 11:38:51) 
[GCC 7.5.0]
system: linux
executable: /home/xuan/miniconda3/envs/dl/bin/python
platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.31
architecture: 64bit ELF
uname: uname_result(system='Linux', node='Wanzi', release='5.13.0-51-generic', version='#58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022', machine='x86_64')
locale: en_US.UTF-8
PATH: /home/xuan/miniconda3/envs/dl/bin:/home/xuan/miniconda3/condabin:/snap/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
PYTHONPATH: ['/home/xuan/miniconda3/envs/dl/bin', '/home/xuan/miniconda3/envs/dl/lib/python39.zip', '/home/xuan/miniconda3/envs/dl/lib/python3.9', '/home/xuan/miniconda3/envs/dl/lib/python3.9/lib-dynload', '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages']

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal



import: <module 'taichi' from '/home/xuan/miniconda3/envs/dl/lib/python3.9/site-packages/taichi/__init__.py'>

cc: False
cpu: True
metal: False
opengl: True
cuda: True
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

vulkan: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo'

Sat Jul 23 20:38:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 20%   37C    P0    N/A /  75W |    241MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1115      G   /usr/lib/xorg/Xorg                 35MiB |
|    0   N/A  N/A      1711      G   /usr/lib/xorg/Xorg                120MiB |
|    0   N/A  N/A      1847      G   /usr/bin/gnome-shell                9MiB |
|    0   N/A  N/A      2284      G   ...AAAAAAAAA= --shared-files        6MiB |
+-----------------------------------------------------------------------------+

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=x64

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=opengl

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12
[Taichi] Starting on arch=cuda

[Taichi] version 1.0.4, llvm 10.0.0, commit 2827db2c, linux, python 3.9.12

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://docs.taichi-lang.org/
GitHub: https://github.com/taichi-dev/taichi/
Forum:  https://forum.taichi.graphics/

                                 TAICHI EXAMPLES                                  
 ──────────────────────────────────────────────────────────────────────────────── 
  0: ad_gravity               21: keyboard                42: odop_solar          
  1: comet                    22: laplace                 43: patterns            
  2: cornell_box              23: mandelbrot_zoom         44: pbf2d               
  3: diff_sph                 24: marching_squares        45: physarum            
  4: euler                    25: mass_spring_3d_ggui     46: print_offset        
  5: explicit_activation      26: mass_spring_game        47: rasterizer          
  6: export_mesh              27: mass_spring_game_ggui   48: regression          
  7: export_ply               28: mciso_advanced          49: sdf_renderer        
  8: export_videos            29: mgpcg                   50: simple_derivative   
  9: fem128                   30: mgpcg_advanced          51: simple_texture      
  10: fem128_ggui             31: minimal                 52: simple_uv           
  11: fem99                   32: minimization            53: stable_fluid        
  12: fractal                 33: mpm128                  54: stable_fluid_ggui   
  13: fractal3d_ggui          34: mpm128_ggui             55: stable_fluid_graph  
  14: fullscreen              35: mpm3d                   56: taichi_bitmasked    
  15: game_of_life            36: mpm3d_ggui              57: taichi_dynamic      
  16: gui_image_io            37: mpm88                   58: taichi_logo         
  17: gui_widgets             38: mpm88_graph             59: taichi_sparse       
  18: implicit_fem            39: mpm99                   60: tutorial            
  19: implicit_mass_spring    40: mpm_lagrangian_forces   61: vortex_rings        
  20: initial_value_problem   41: nbody                   62: waterwave           
 ──────────────────────────────────────────────────────────────────────────────── 
Running example minimal ...
[Taichi] Starting on arch=x64
42.0
>>> Running time: 0.16s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 9.21s

The text was updated successfully, but these errors were encountered:

xuan-li · 2022-07-24T03:53:12Z

Taichi can work with PyTorch 1.10.0.

lin-hitonami · 2022-07-25T08:02:56Z

I reproduced the error too. @erizmr Can you look into this error?

erizmr · 2022-07-25T08:51:10Z

I am looking into it.

k-ye · 2022-07-25T08:53:21Z

FYI: #2190 and #4944

pableeto · 2022-08-21T03:39:19Z

Hi, I have also met the same error.
I've tried different pytorch versions - it seems 1.11 and 1.12 have this issue, while 1.10 does not.

turbo0628 · 2022-08-26T07:35:17Z

Made some inspection inspired by this pytorch issue
Firstly, pip install cuda-python

Full code with cuda driver helpers:

import torch
import taichi as ti
from cuda import cuda, cudart
import torch

def ASSERT_DRV(err):
    """
    This is a helper function to turn CUDA messages into errors when
    appropriate, since by default the CUDA package doesn't raise
    Python errors, it returns error messages
    """
    if isinstance(err, cuda.CUresult):
        if err != cuda.CUresult.CUDA_SUCCESS:
            raise RuntimeError("Cuda Error: {}".format(err))
    elif isinstance(err, cudart.cudaError_t):
        if err != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError("Cudart Error: {}".format(err))
    else:
        raise RuntimeError("Unknown error type: {}".format(err))


def print_existing_contexts():
    valid_contexts = []
    while True:
        err, cuda_context = cuda.cuCtxPopCurrent()
        try:
            ASSERT_DRV(err)
        except RuntimeError:
            break
        else:
            valid_contexts.append(cuda_context)

    print("Existing, valid contexts: ", valid_contexts)

    for curr_ctx in reversed(valid_contexts):
        err, = cuda.cuCtxPushCurrent(curr_ctx)
        ASSERT_DRV(err)
        
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

This is the proper code that can work. What we see from the log:

Taichi ignored the PyTorch CUDA context and created its own.
If we change the initialization order:

ti.init(arch=ti.gpu, log_level=ti.TRACE)
print("===AFTER TI INIT===")
print_existing_contexts()
device = torch.device("cuda:0")
print("===AFTER TORCH DEVICE INIT===")
print_existing_contexts()
print(torch._C._cuda_hasPrimaryContext(0))
x = torch.tensor([1.], requires_grad=True, device=device)
print("===AFTER TORCH TENSOR INIT===")
print_existing_contexts()
# print("TAICHI "ti._lib.core.get_primary_ctx_state())
print("Torch has primary context", torch._C._cuda_hasPrimaryContext(0))
loss = x**2
loss.backward()
print(torch._C._cuda_hasPrimaryContext(0))

We encouter the error.

Torch just fetches the CUcontext created by Taichi, and that CUDA context is not synced.

That said, to work with PyTorch, we should pop out Taichi's CUDA context at the end of ti.init. PyTorch would create its own primary context. When Taichi needs the CUDA context in subsequent exec, it always set the Context to its own ctx pointer so the pop out is fine for Taichi.

xuan-li added the potential bug Something that looks like a bug but not yet confirmed label Jul 24, 2022

taichi-gardener added this to Taichi Lang Jul 24, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Jul 24, 2022

xuan-li changed the title ~~Torch and Taichi cannot use CUDA at the same time.~~ Torch 1.12.0 and Taichi cannot use CUDA at the same time. Jul 24, 2022

lin-hitonami assigned erizmr Jul 29, 2022

lin-hitonami moved this from Untriaged to Todo in Taichi Lang Jul 29, 2022

qiao-bo moved this from Todo to In Progress in Taichi Lang Aug 12, 2022

qiao-bo assigned ailzhang Aug 12, 2022

ailzhang assigned turbo0628 Aug 26, 2022

turbo0628 mentioned this issue Aug 26, 2022

[cuda] Clear cuda context after init #5891

Merged

ailzhang closed this as completed in #5891 Sep 5, 2022

Repository owner moved this from In Progress to Done in Taichi Lang Sep 5, 2022

turbo0628 mentioned this issue Sep 6, 2022

[cuda] Use CUDA primary context to work with PyTorch and Numba. #5992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

xuan-li commented Jul 24, 2022 •

edited

Loading

xuan-li commented Jul 24, 2022

lin-hitonami commented Jul 25, 2022

erizmr commented Jul 25, 2022

k-ye commented Jul 25, 2022

pableeto commented Aug 21, 2022

turbo0628 commented Aug 26, 2022 •

edited

Loading

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

Comments

xuan-li commented Jul 24, 2022 • edited Loading

xuan-li commented Jul 24, 2022

lin-hitonami commented Jul 25, 2022

erizmr commented Jul 25, 2022

k-ye commented Jul 25, 2022

pableeto commented Aug 21, 2022

turbo0628 commented Aug 26, 2022 • edited Loading

xuan-li commented Jul 24, 2022 •

edited

Loading

turbo0628 commented Aug 26, 2022 •

edited

Loading