CUDA kernels should not make const copies of global and closure variables #9084

gmarkall · 2023-07-21T16:00:24Z

Noted whilst investigating https://numba.discourse.group/t/avoid-multiple-copies-of-large-numpy-array-in-closure/2017

In many cases Numba will copy global and closure variables as constant arrays inside jitted functions and kernels (and will always attempt to do this for kernels). This is a problem because const memory is extremely limited in CUDA. The following simple example:

import numpy as np

N = 100000
closed_array = np.ones(N)

@cuda.jit(cache=True)
def kernel(r, x):
    r[0] = closed_array[x]

r = np.zeros(1)

kernel[1, 1](r, 2)
print(r[0])

will fail with the error:

numba.cuda.cudadrv.driver.LinkerError: [218] Call to cuLinkAddData results in CUDA_ERROR_INVALID_PTX
ptxas error   : File uses too much global constant data (0xc3500 bytes, 0x10000 max)

Since we have a mechanism for users to specify when a const array should be created, constant arrays should never be implicitly created in CUDA kernels, and they should always be opted-in to. Making this change will not be a breaking change, because Numba makes no guarantee about whether a copy is made or not: https://numba.readthedocs.io/en/stable/reference/pysemantics.html#global-and-closure-variables

Numba may or may not copy global variables referenced inside a compiled function.

The following change provides a proof-of-concept that demonstrates creating references to arrays instead of const copies:

diff --git a/numba/core/base.py b/numba/core/base.py
index 1c0a5ee5a..0728a731e 100644
--- a/numba/core/base.py
+++ b/numba/core/base.py
@@ -1086,7 +1086,7 @@ class BaseContext(object):
         llvoidptr = self.get_value_type(types.voidptr)
         addr = self.get_constant(types.uintp, intaddr).inttoptr(llvoidptr)
         # Use a unique name by embedding the address value
-        symname = 'numba.dynamic.globals.{:x}'.format(intaddr)
+        symname = 'numba_dynamic_globals_{:x}'.format(intaddr)
         gv = cgutils.add_global_variable(mod, llvoidptr, symname)
         # Use linkonce linkage to allow merging with other GV of the same name.
         # And, avoid optimization from assuming its value.
diff --git a/numba/cuda/target.py b/numba/cuda/target.py
index 492f375f6..7b44da626 100644
--- a/numba/cuda/target.py
+++ b/numba/cuda/target.py
@@ -69,6 +69,7 @@ VALID_CHARS = re.compile(r'[^a-z0-9]', re.I)
 class CUDATargetContext(BaseContext):
     implement_powi_as_math_call = True
     strict_alignment = True
+    allow_dynamic_globals = True
 
     def __init__(self, typingctx, target='cuda'):
         super().__init__(typingctx, target)
@@ -292,43 +293,20 @@ class CUDATargetContext(BaseContext):
         Unlike the parent version.  This returns a a pointer in the constant
         addrspace.
         """
-
-        lmod = builder.module
-
-        constvals = [
-            self.get_constant(types.byte, i)
-            for i in iter(arr.tobytes(order='A'))
-        ]
-        constaryty = ir.ArrayType(ir.IntType(8), len(constvals))
-        constary = ir.Constant(constaryty, constvals)
-
-        addrspace = nvvm.ADDRSPACE_CONSTANT
-        gv = cgutils.add_global_variable(lmod, constary.type, "_cudapy_cmem",
-                                         addrspace=addrspace)
-        gv.linkage = 'internal'
-        gv.global_constant = True
-        gv.initializer = constary
-
-        # Preserve the underlying alignment
-        lldtype = self.get_data_type(aryty.dtype)
-        align = self.get_abi_sizeof(lldtype)
-        gv.align = 2 ** (align - 1).bit_length()
-
-        # Convert to generic address-space
-        ptrty = ir.PointerType(ir.IntType(8))
-        genptr = builder.addrspacecast(gv, ptrty, 'generic')
-
         # Create array object
-        ary = self.make_array(aryty)(self, builder)
+        dataptr = arr.device_ctypes_pointer.value
+        data = self.add_dynamic_addr(builder, dataptr, info=str(type(dataptr)))
+        rt_addr = self.add_dynamic_addr(builder, id(arr), info=str(type(arr)))
         kshape = [self.get_constant(types.intp, s) for s in arr.shape]
         kstrides = [self.get_constant(types.intp, s) for s in arr.strides]
-        self.populate_array(ary, data=builder.bitcast(genptr, ary.data.type),
+        cary = self.make_array(aryty)(self, builder)
+        self.populate_array(cary, data=builder.bitcast(data, cary.data.type),
                             shape=kshape,
                             strides=kstrides,
-                            itemsize=ary.itemsize, parent=ary.parent,
+                            itemsize=arr.dtype.itemsize, parent=rt_addr,
                             meminfo=None)
 
-        return ary._getvalue()
+        return cary._getvalue()
 
     def insert_const_string(self, mod, string):
         """

This does require the example to be modified so that the data is already on the device:

from numba import cuda
import numpy as np

N = 100000
closed_array = cuda.to_device(np.ones(N))

@cuda.jit(cache=True)
def kernel(r, x):
    r[0] = closed_array[x]

r = np.zeros(1)

kernel[1, 1](r, 2)
print(r[0])

And now works as expected:

$ python repro.py 
1.0

To complete the implementation, the following considerations need to be addressed:

Caching needs to be disabled for kernels with references to globals and closure variables. The docstring for BaseContext.add_dynamic_addr() suggests that addition of a dynamic address will disable caching, but this does not seem to be the case for CUDA at least.
Making copies of data to / from the device at launch time for global and closure variables on the host needs to be considered.
CUDA documentation should clarify that const copies are never automatically made, and indicate the const array constructor for their explicit construction.
Tests need to be added for both global and closure variable cases
Any other memory management considerations as required (e.g. across multiple devices, avoidance of leaks when making implicit copies from the host, etc.)

The text was updated successfully, but these errors were encountered:

…osure numba/numba#9084

gmarkall · 2023-07-25T14:13:28Z

Noted also in the triage discussion - we will need to hold a reference to referenced globals and closure variables to ensure their lifetime exceeds that of the dispatcher.

yxqd · 2023-07-27T05:18:42Z

@gmarkall I would like to report one error I encountered. I patched my local numba installation using your patch and it works well for my current use case. I then try to use your test script above as a quick check of the numba treatment of global/closure arrays. The first time I run your test script, it succeeds. But all the later runs of the script will throw an error like this:

Traceback (most recent call last):                                                                                                                                                                                  
  File "/dv/neutron/mcvine/acc/tests/./test_array_in_closure.py", line 17, in <module>                                                                                                                 
    kernel[1, 1](r, 2)                                                                                                                                                                                              
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 541, in __call__                                                                      
    return self.dispatcher.call(args, self.griddim, self.blockdim,                                                                                                                                                  
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 677, in call                                                                          
    kernel.launch(args, griddim, blockdim, stream, sharedmem)                                                                                                                                                       
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/dispatcher.py", line 373, in launch                                                                        
    wb()                                                                                                                                                                                                            
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/args.py", line 62, in <lambda>                                                                             
    retr.append(lambda: devary.copy_to_host(self.value, stream=stream))                                                                                                                                             
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/devices.py", line 232, in _require_cuda_context                                                    
    return fn(*args, **kws)                                                                                                                                                                                         
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py", line 277, in copy_to_host                                                         
    _driver.device_to_host(hostary, self, self.alloc_size,                                                                                                                                                          
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 3247, in device_to_host                                                           
    fn(host_pointer(dst), device_pointer(src), size, *varargs)                                                                                                                                                      
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 331, in safe_cuda_api_call                                                        
    self._check_ctypes_error(fname, retcode)                                                                                                                                                                        
  File "/miniconda3/envs/dev-mcvine-acc-numba_0.57.1/lib/python3.9/site-packages/numba/cuda/cudadrv/driver.py", line 399, in _check_ctypes_error                                                       
    raise CudaAPIError(retcode, msg)                                                                                                                                                                                
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

Does this have something to do with lifetime management you mentioned above?

gmarkall · 2023-07-27T06:53:18Z

Are you caching the kernel?

yxqd · 2023-07-28T13:51:13Z

Got it. Thanks! Turning it off solves that problem

gmarkall · 2023-07-28T15:58:51Z

No problem. A final patch implementing this would raise an error when attempting to cache a kernel that makes references to globals and/or closure variables.

yxqd · 2023-09-14T02:28:38Z

@gmarkall I wonder if there is a timeline when an official patch for this will be released? Thanks!

gmarkall · 2023-09-14T09:07:28Z

@yxqd This is in the 0.59 milestone, so it should be part of that release.

gmarkall · 2023-09-14T09:08:04Z

0.59 seems to be aimed for 2023-12: #8971

yxqd · 2023-09-14T14:01:00Z

Got it. Thanks!

gmarkall added CUDA CUDA related issue/PR bug - incorrect behavior Bugs: incorrect behavior labels Jul 21, 2023

yxqd added a commit to mcvine/acc that referenced this issue Jul 23, 2023

try applying numba patch by Graham to avoid copying of np array in cl…

0859635

…osure numba/numba#9084

gmarkall added this to the 0.59.0-rc1 milestone Jul 25, 2023

gmarkall self-assigned this Jul 25, 2023

yxqd mentioned this issue Sep 21, 2023

Check compatibility with current numba development branch mcvine/acc#123

Open

stuartarchibald modified the milestones: 0.59.0-rc1, 0.60.0-rc1 Feb 9, 2024

gmarkall modified the milestones: 0.60.0-rc1, 0.61.0-rc1 Apr 9, 2024

dlee992 mentioned this issue Jul 6, 2024

follow the NVVM IR form: not use . in global identifier #9642

Closed

dlee992 linked a pull request Jul 13, 2024 that will close this issue

Use dynamic address of global and closure variables NVIDIA/numba-cuda#15

Draft

9 tasks

gmarkall modified the milestones: 0.61.0-rc1, nvidia-cuda-next Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA kernels should not make const copies of global and closure variables #9084

CUDA kernels should not make const copies of global and closure variables #9084

gmarkall commented Jul 21, 2023

gmarkall commented Jul 25, 2023

yxqd commented Jul 27, 2023 •

edited

Loading

gmarkall commented Jul 27, 2023

yxqd commented Jul 28, 2023

gmarkall commented Jul 28, 2023

yxqd commented Sep 14, 2023

gmarkall commented Sep 14, 2023

gmarkall commented Sep 14, 2023

yxqd commented Sep 14, 2023

CUDA kernels should not make const copies of global and closure variables #9084

CUDA kernels should not make const copies of global and closure variables #9084

Comments

gmarkall commented Jul 21, 2023

gmarkall commented Jul 25, 2023

yxqd commented Jul 27, 2023 • edited Loading

gmarkall commented Jul 27, 2023

yxqd commented Jul 28, 2023

gmarkall commented Jul 28, 2023

yxqd commented Sep 14, 2023

gmarkall commented Sep 14, 2023

gmarkall commented Sep 14, 2023

yxqd commented Sep 14, 2023

yxqd commented Jul 27, 2023 •

edited

Loading