dynamic node raises CUDA error with torch training codes #4944

jhonsonlaid · 2022-05-10T08:29:43Z

Environment:
- [Taichi] version 1.0.0, llvm 10.0.0, commit 6a15da8, linux, python 3.8.5
- [Torch] : 1.8.1+cu101
- [GPU] : 1080Ti
Description:
for issue 4937, I have to use dynamic node. But it will raise cuda error after some iterations.

[E 05/10/22 16:10:18.990 50390] [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)

In windows RTX 2060 8GB, it raises another error. (Codes work, when replacing the dynamic node with dense field, or removing loss.backward())

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Sample codes:

import torch
import torch.nn.functional as F
import torchvision
import taichi as ti
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(message)s')

ti.init(arch=ti.cuda, device_memory_GB=2)
grp_res = ti.field(ti.i32)
_grp_res_pixel = ti.root.dynamic(
    ti.i, 32 * 1024)
_grp_res_pixel.place(grp_res)

# _grp_res_pixel = ti.root.dense(
#     ti.i, 32 * 1024)
# _grp_res_pixel.place(grp_res)

device = 'cuda'
model = torchvision.models.resnet18().to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters())


@ti.kernel
def fake_deactivation():
    for i in range(grp_res.shape[0]):
        grp_res[i] = 0


for i in range(100):
    _grp_res_pixel.deactivate_all()
    # fake_deactivation()

    x = torch.randn(32, 3, 224, 224).to(device)
    y = model(x)
    target = torch.randint(1000, (32,), dtype=torch.int64).to(device)
    loss = F.cross_entropy(y, target)

    model.zero_grad()
    loss.backward()

    torch.nn.utils.clip_grad_norm_(
        model.parameters(), 5)

    optimizer.step()
    logging.info(f'{i}: {loss.item()}')

The text was updated successfully, but these errors were encountered:

k-ye · 2022-05-13T07:21:07Z

Not sure if this is related to the shared CUDA context issue, see #2190

keunhong · 2024-05-03T06:18:51Z

Were you able to figure this out?

jhonsonlaid added the question Question on using Taichi label May 10, 2022

taichi-ci-bot added this to Taichi Lang May 10, 2022

taichi-ci-bot moved this to Untriaged in Taichi Lang May 10, 2022

jhonsonlaid changed the title ~~CUDA Error, dynamic rnode, with torch training codes~~ dynamic rnode raise CUDA error with torch training codes May 10, 2022

jhonsonlaid changed the title ~~dynamic rnode raise CUDA error with torch training codes~~ dynamic node raises CUDA error with torch training codes May 10, 2022

ailzhang moved this from Untriaged to Backlog in Taichi Lang May 20, 2022

k-ye mentioned this issue Jul 25, 2022

Torch 1.12.0 and Taichi cannot use CUDA at the same time. #5502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic node raises CUDA error with torch training codes #4944

dynamic node raises CUDA error with torch training codes #4944

jhonsonlaid commented May 10, 2022 •

edited

Loading

k-ye commented May 13, 2022

keunhong commented May 3, 2024

dynamic node raises CUDA error with torch training codes #4944

dynamic node raises CUDA error with torch training codes #4944

Comments

jhonsonlaid commented May 10, 2022 • edited Loading

k-ye commented May 13, 2022

keunhong commented May 3, 2024

jhonsonlaid commented May 10, 2022 •

edited

Loading