Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic node raises CUDA error with torch training codes #4944

Open
jhonsonlaid opened this issue May 10, 2022 · 2 comments
Open

dynamic node raises CUDA error with torch training codes #4944

jhonsonlaid opened this issue May 10, 2022 · 2 comments
Labels
question Question on using Taichi

Comments

@jhonsonlaid
Copy link

jhonsonlaid commented May 10, 2022

  • Environment:

    • [Taichi] version 1.0.0, llvm 10.0.0, commit 6a15da8, linux, python 3.8.5
    • [Torch] : 1.8.1+cu101
    • [GPU] : 1080Ti
  • Description:
    for issue 4937, I have to use dynamic node. But it will raise cuda error after some iterations.

[E 05/10/22 16:10:18.990 50390] [cuda_driver.h:operator()@87] CUDA Error CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while calling stream_synchronize (cuStreamSynchronize)

In windows RTX 2060 8GB, it raises another error. (Codes work, when replacing the dynamic node with dense field, or removing loss.backward())

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
  • Sample codes:
import torch
import torch.nn.functional as F
import torchvision
import taichi as ti
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(message)s')

ti.init(arch=ti.cuda, device_memory_GB=2)
grp_res = ti.field(ti.i32)
_grp_res_pixel = ti.root.dynamic(
    ti.i, 32 * 1024)
_grp_res_pixel.place(grp_res)

# _grp_res_pixel = ti.root.dense(
#     ti.i, 32 * 1024)
# _grp_res_pixel.place(grp_res)

device = 'cuda'
model = torchvision.models.resnet18().to(device)
model.train()
optimizer = torch.optim.Adam(model.parameters())


@ti.kernel
def fake_deactivation():
    for i in range(grp_res.shape[0]):
        grp_res[i] = 0


for i in range(100):
    _grp_res_pixel.deactivate_all()
    # fake_deactivation()

    x = torch.randn(32, 3, 224, 224).to(device)
    y = model(x)
    target = torch.randint(1000, (32,), dtype=torch.int64).to(device)
    loss = F.cross_entropy(y, target)

    model.zero_grad()
    loss.backward()

    torch.nn.utils.clip_grad_norm_(
        model.parameters(), 5)

    optimizer.step()
    logging.info(f'{i}: {loss.item()}')
@jhonsonlaid jhonsonlaid added the question Question on using Taichi label May 10, 2022
@taichi-ci-bot taichi-ci-bot moved this to Untriaged in Taichi Lang May 10, 2022
@jhonsonlaid jhonsonlaid changed the title CUDA Error, dynamic rnode, with torch training codes dynamic rnode raise CUDA error with torch training codes May 10, 2022
@jhonsonlaid jhonsonlaid changed the title dynamic rnode raise CUDA error with torch training codes dynamic node raises CUDA error with torch training codes May 10, 2022
@k-ye
Copy link
Member

k-ye commented May 13, 2022

Not sure if this is related to the shared CUDA context issue, see #2190

@keunhong
Copy link

keunhong commented May 3, 2024

Were you able to figure this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
Status: Backlog
Development

No branches or pull requests

3 participants