Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA kernels should not make const copies of global and closure variables #9084

Open
gmarkall opened this issue Jul 21, 2023 · 9 comments · May be fixed by NVIDIA/numba-cuda#15
Open

CUDA kernels should not make const copies of global and closure variables #9084

gmarkall opened this issue Jul 21, 2023 · 9 comments · May be fixed by NVIDIA/numba-cuda#15
Assignees
Labels
bug - incorrect behavior Bugs: incorrect behavior CUDA CUDA related issue/PR

Comments

@gmarkall
Copy link
Member

Noted whilst investigating https://numba.discourse.group/t/avoid-multiple-copies-of-large-numpy-array-in-closure/2017

In many cases Numba will copy global and closure variables as constant arrays inside jitted functions and kernels (and will always attempt to do this for kernels). This is a problem because const memory is extremely limited in CUDA. The following simple example:

import numpy as np

N = 100000
closed_array = np.ones(N)

@cuda.jit(cache=True)
def kernel(r, x):
    r[0] = closed_array[x]

r = np.zeros(1)

kernel[1, 1](r, 2)
print(r[0])

will fail with the error:

numba.cuda.cudadrv.driver.LinkerError: [218] Call to cuLinkAddData results in CUDA_ERROR_INVALID_PTX
ptxas error   : File uses too much global constant data (0xc3500 bytes, 0x10000 max)

Since we have a mechanism for users to specify when a const array should be created, constant arrays should never be implicitly created in CUDA kernels, and they should always be opted-in to. Making this change will not be a breaking change, because Numba makes no guarantee about whether a copy is made or not: https://numba.readthedocs.io/en/stable/reference/pysemantics.html#global-and-closure-variables

Numba may or may not copy global variables referenced inside a compiled function.

The following change provides a proof-of-concept that demonstrates creating references to arrays instead of const copies:

diff --git a/numba/core/base.py b/numba/core/base.py
index 1c0a5ee5a..0728a731e 100644
--- a/numba/core/base.py
+++ b/numba/core/base.py
@@ -1086,7 +1086,7 @@ class BaseContext(object):
         llvoidptr = self.get_value_type(types.voidptr)
         addr = self.get_constant(types.uintp, intaddr).inttoptr(llvoidptr)
         # Use a unique name by embedding the address value
-        symname = 'numba.dynamic.globals.{:x}'.format(intaddr)
+        symname = 'numba_dynamic_globals_{:x}'.format(intaddr)
         gv = cgutils.add_global_variable(mod, llvoidptr, symname)
         # Use linkonce linkage to allow merging with other GV of the same name.
         # And, avoid optimization from assuming its value.
diff --git a/numba/cuda/target.py b/numba/cuda/target.py
index 492f375f6..7b44da626 100644
--- a/numba/cuda/target.py
+++ b/numba/cuda/target.py
@@ -69,6 +69,7 @@ VALID_CHARS = re.compile(r'[^a-z0-9]', re.I)
 class CUDATargetContext(BaseContext):
     implement_powi_as_math_call = True
     strict_alignment = True
+    allow_dynamic_globals = True
 
     def __init__(self, typingctx, target='cuda'):
         super().__init__(typingctx, target)
@@ -292,43 +293,20 @@ class CUDATargetContext(BaseContext):
         Unlike the parent version.  This returns a a pointer in the constant
         addrspace.
         """
-
-        lmod = builder.module
-
-        constvals = [
-            self.get_constant(types.byte, i)
-            for i in iter(arr.tobytes(order='A'))
-        ]
-        constaryty = ir.ArrayType(ir.IntType(8), len(constvals))
-        constary = ir.Constant(constaryty, constvals)
-
-        addrspace = nvvm.ADDRSPACE_CONSTANT
-        gv = cgutils.add_global_variable(lmod, constary.type, "_cudapy_cmem",
-                                         addrspace=addrspace)
-        gv.linkage = 'internal'
-        gv.global_constant = True
-        gv.initializer = constary
-
-        # Preserve the underlying alignment
-        lldtype = self.get_data_type(aryty.dtype)
-        align = self.get_abi_sizeof(lldtype)
-        gv.align = 2 ** (align - 1).bit_length()
-
-        # Convert to generic address-space
-        ptrty = ir.PointerType(ir.IntType(8))
-        genptr = builder.addrspacecast(gv, ptrty, 'generic')
-
         # Create array object
-        ary = self.make_array(aryty)(self, builder)
+        dataptr = arr.device_ctypes_pointer.value
+        data = self.add_dynamic_addr(builder, dataptr, info=str(type(dataptr)))
+        rt_addr = self.add_dynamic_addr(builder, id(arr), info=str(type(arr)))
         kshape = [self.get_constant(types.intp, s) for s in arr.shape]
         kstrides = [self.get_constant(types.intp, s) for s in arr.strides]
-        self.populate_array(ary, data=builder.bitcast(genptr, ary.data.type),
+        cary = self.make_array(aryty)(self, builder)
+        self.populate_array(cary, data=builder.bitcast(data, cary.data.type),
                             shape=kshape,
                             strides=kstrides,
-                            itemsize=ary.itemsize, parent=ary.parent,
+                            itemsize=arr.dtype.itemsize, parent=rt_addr,
                             meminfo=None)
 
-        return ary._getvalue()
+        return cary._getvalue()
 
     def insert_const_string(self, mod, string):
         """

This does require the example to be modified so that the data is already on the device:

from numba import cuda
import numpy as np

N = 100000
closed_array = cuda.to_device(np.ones(N))

@cuda.jit(cache=True)
def kernel(r, x):
    r[0] = closed_array[x]

r = np.zeros(1)

kernel[1, 1](r, 2)
print(r[0])

And now works as expected:

$ python repro.py 
1.0

To complete the implementation, the following considerations need to be addressed:

  • Caching needs to be disabled for kernels with references to globals and closure variables. The docstring for BaseContext.add_dynamic_addr() suggests that addition of a dynamic address will disable caching, but this does not seem to be the case for CUDA at least.
  • Making copies of data to / from the device at launch time for global and closure variables on the host needs to be considered.
  • CUDA documentation should clarify that const copies are never automatically made, and indicate the const array constructor for their explicit construction.
  • Tests need to be added for both global and closure variable cases
  • Any other memory management considerations as required (e.g. across multiple devices, avoidance of leaks when making implicit copies from the host, etc.)
@gmarkall gmarkall added CUDA CUDA related issue/PR bug - incorrect behavior Bugs: incorrect behavior labels Jul 21, 2023
@gmarkall
Copy link
Member Author

Noted also in the triage discussion - we will need to hold a reference to referenced globals and closure variables to ensure their lifetime exceeds that of the dispatcher.

@gmarkall gmarkall added this to the 0.59.0-rc1 milestone Jul 25, 2023
@gmarkall gmarkall self-assigned this Jul 25, 2023
@yxqd
Copy link

yxqd commented Jul 27, 2023

@gmarkall I would like to report one error I encountered. I patched my local numba installation using your patch and it works well for my current use case. I then try to use your test script above as a quick check of the numba treatment of global/closure arrays. The first time I run your test script, it succeeds. But all the later runs of the script will throw an error like this:

Traceback (most recent call last):                                                                                                                                                                                  
  File "/dv/neutron/mcvine/acc/tests/./test_array_in_closure.py", line 17, in <module>                                                                                                                 
    kernel[1, 1](r, 2)                                                                                                                                                                                              
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 541, in __call__                                                                      
    return self.dispatcher.call(args, self.griddim, self.blockdim,                                                                                                                                                  
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 677, in call                                                                          
    kernel.launch(args, griddim, blockdim, stream, sharedmem)                                                                                                                                                       
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 373, in launch                                                                        
    wb()                                                                                                                                                                                                            
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/args.py", line 62, in <lambda>                                                                             
    retr.append(lambda: devary.copy_to_host(self.value, stream=stream))                                                                                                                                             
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context                                                    
    return fn(*args, **kws)                                                                                                                                                                                         
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py", line 277, in copy_to_host                                                         
    _driver.device_to_host(hostary, self, self.alloc_size,                                                                                                                                                          
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 3247, in device_to_host                                                           
    fn(host_pointer(dst), device_pointer(src), size, *varargs)                                                                                                                                                      
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 331, in safe_cuda_api_call                                                        
    self._check_ctypes_error(fname, retcode)                                                                                                                                                                        
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 399, in _check_ctypes_error                                                       
    raise CudaAPIError(retcode, msg)                                                                                                                                                                                
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR   

Does this have something to do with lifetime management you mentioned above?

@gmarkall
Copy link
Member Author

Are you caching the kernel?

@yxqd
Copy link

yxqd commented Jul 28, 2023

Got it. Thanks! Turning it off solves that problem

@gmarkall
Copy link
Member Author

No problem. A final patch implementing this would raise an error when attempting to cache a kernel that makes references to globals and/or closure variables.

@yxqd
Copy link

yxqd commented Sep 14, 2023

@gmarkall I wonder if there is a timeline when an official patch for this will be released? Thanks!

@gmarkall
Copy link
Member Author

@yxqd This is in the 0.59 milestone, so it should be part of that release.

@gmarkall
Copy link
Member Author

0.59 seems to be aimed for 2023-12: #8971

@yxqd
Copy link

yxqd commented Sep 14, 2023

Got it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - incorrect behavior Bugs: incorrect behavior CUDA CUDA related issue/PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants