diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 1ffa8362..4aaaa3ad 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.7","generation_timestamp":"2025-01-08T18:21:18","documenter_version":"1.8.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.7","generation_timestamp":"2025-01-09T13:07:17","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/dev/api/index.html b/dev/api/index.html index 78f158c8..302103fa 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

AMDGPU API Reference

Indexing

AMDGPU.Device.gridItemDimFunction
gridItemDim()::ROCDim3

Returns the size of the grid in workitems. This behaviour is different from CUDA where gridDim gives the size of the grid in blocks.

source

Use these functions for compatibility with CUDA.jl.

Synchronization

AMDGPU.Device.sync_workgroup_countFunction
sync_workgroup_count(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns the number of workitems for which predicate evaluates to non-zero.

source
AMDGPU.Device.sync_workgroup_andFunction
sync_workgroup_and(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for all of them.

source
AMDGPU.Device.sync_workgroup_orFunction
sync_workgroup_or(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for any of them.

source
+

AMDGPU API Reference

Indexing

AMDGPU.Device.gridItemDimFunction
gridItemDim()::ROCDim3

Returns the size of the grid in workitems. This behaviour is different from CUDA where gridDim gives the size of the grid in blocks.

source

Use these functions for compatibility with CUDA.jl.

Synchronization

AMDGPU.Device.sync_workgroup_countFunction
sync_workgroup_count(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns the number of workitems for which predicate evaluates to non-zero.

source
AMDGPU.Device.sync_workgroup_andFunction
sync_workgroup_and(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for all of them.

source
AMDGPU.Device.sync_workgroup_orFunction
sync_workgroup_or(predicate::Cint)::Cint

Identical to sync_workgroup, with the additional feature that it evaluates the predicate for all workitems in the workgroup and returns non-zero if and only if predicate evaluates to non-zero for any of them.

source
diff --git a/dev/devices/index.html b/dev/devices/index.html index 99a8f282..26b28f3a 100644 --- a/dev/devices/index.html +++ b/dev/devices/index.html @@ -6,5 +6,5 @@

Devices

In AMDGPU, all GPU devices are auto-detected by the runtime, if they're supported.

AMDGPU maintains a global default device. The default device is relevant for all kernel and GPUArray operations. If one is not specified via @roc or an equivalent interface, then the default device is used for those operations, which affects compilation and kernel launch.

The device bound to a current Julia task is accessible via AMDGPU.device method. The list of available devices can be queried with AMDGPU.devices method.

If you have a HIPDevice object, you can also switch the device with AMDGPU.device!. This will switch it only within the task it is called from.

xd1 = AMDGPU.ones(Float32, 16) # On `AMDGPU.device()` device.
 
 AMDGPU.device!(AMDGPU.devices()[2]) # Switch to second device.
-xd2 = AMDPGU.ones(Float32, 16) # On second device.

Additionally, devices have an associated numeric ID. This value is bounded between 1 and length(AMDGPU.devices()), and device 1 is the default device when AMDGPU is first loaded. The ID of the device associated with the current task can be queried with AMDGPU.device_id and changed with AMDGPU.device_id!.

AMDGPU.deviceFunction
device()::HIPDevice

Get currently active device. This device is used when launching kernels via @roc.

source
device(A::ROCArray) -> HIPDevice

Return the device associated with the array A.

source
AMDGPU.device!Function
device!(device::HIPDevice)

Switch current device being used. This switches only for a task inside which it is called.

source
AMDGPU.device_idFunction
device_id() -> Int
-device_id(device::HIPDevice) -> Int

Returns the numerical device ID for device or for the current AMDGPU.device().

source
AMDGPU.device_id!Function
device_id!(idx::Integer)

Sets the current device to AMDGPU.devices()[idx]. See device_id for details on the numbering semantics.

source

Device Properties

AMDGPU.HIP.propertiesFunction
properties(dev::HIPDevice)::hipDeviceProp_t

Get all properties for the device. See HIP documentation for hipDeviceProp_t for the meaning of each field.

source
+xd2 = AMDPGU.ones(Float32, 16) # On second device.

Additionally, devices have an associated numeric ID. This value is bounded between 1 and length(AMDGPU.devices()), and device 1 is the default device when AMDGPU is first loaded. The ID of the device associated with the current task can be queried with AMDGPU.device_id and changed with AMDGPU.device_id!.

AMDGPU.HIP.devicesFunction
devices()

Get list of all devices.

source
AMDGPU.deviceFunction
device()::HIPDevice

Get currently active device. This device is used when launching kernels via @roc.

source
device(A::ROCArray) -> HIPDevice

Return the device associated with the array A.

source
AMDGPU.device!Function
device!(device::HIPDevice)

Switch current device being used. This switches only for a task inside which it is called.

source
AMDGPU.device_idFunction
device_id() -> Int
+device_id(device::HIPDevice) -> Int

Returns the numerical device ID for device or for the current AMDGPU.device().

source
AMDGPU.device_id!Function
device_id!(idx::Integer)

Sets the current device to AMDGPU.devices()[idx]. See device_id for details on the numbering semantics.

source

Device Properties

AMDGPU.HIP.nameFunction
name(dev::HIPDevice)::String

Get name of the device.

source
AMDGPU.HIP.wavefrontsizeFunction
wavefrontsize(d::HIPDevice)::Cint

Get size of the wavefront. AMD GPUs support either 32 or 64.

source
AMDGPU.HIP.gcn_archFunction
gcn_arch(d::HIPDevice)::String

Get GCN architecture for the device.

source
AMDGPU.HIP.device_idFunction
device_id(d::HIPDevice)

Zero-based device ID as expected by HIP functions. Differs from AMDGPU.device_id method by 1.

source
AMDGPU.HIP.propertiesFunction
properties(dev::HIPDevice)::hipDeviceProp_t

Get all properties for the device. See HIP documentation for hipDeviceProp_t for the meaning of each field.

source
diff --git a/dev/exceptions/index.html b/dev/exceptions/index.html index 63f20cee..b84ceddd 100644 --- a/dev/exceptions/index.html +++ b/dev/exceptions/index.html @@ -25,4 +25,4 @@ [4] synchronize() @ AMDGPU ~/.julia/dev/AMDGPU/src/highlevel.jl:154 [5] top-level scope - @ REPL[5]:1

Kernel-thrown exceptions are thrown during the host synchronization AMDGPU.synchronize or on the next kernel launch.

Kernels that hit an exception will write its information into a pre-allocated host buffer. Once complete, the wavefront throwing the exception will lock the buffer to prevent other wavefronts from overwriting the exception and stop itself, but other wavefronts will continue executing.

+ @ REPL[5]:1

Kernel-thrown exceptions are thrown during the host synchronization AMDGPU.synchronize or on the next kernel launch.

Kernels that hit an exception will write its information into a pre-allocated host buffer. Once complete, the wavefront throwing the exception will lock the buffer to prevent other wavefronts from overwriting the exception and stop itself, but other wavefronts will continue executing.

diff --git a/dev/hostcall/index.html b/dev/hostcall/index.html index 20b0e2fd..8acfb32d 100644 --- a/dev/hostcall/index.html +++ b/dev/hostcall/index.html @@ -17,4 +17,4 @@ AMDGPU.synchronize(; stop_hostcalls=true) # Stop hostcall. AMDGPU.Device.free!(hc) # Free hostcall buffers. -@assert Array(y)[1] ≈ 42f0

In this example, HostCallHolder is used to create and launch HostCall. HostCallHolder contains the HostCall structure itself that is passed to kernel, a task that is spawned on creation and some additional info for controlling the lifetime of the task.

First argument is a function we want to execute when we call the hostcall. In this case we add 42f0 to input argument x and return the result.

Second and third arguments are the return type Float32 and the tuple of types of input arguments Tuple{Float32}.

hostcall! is used to execute the function on the host, wait on the result, and obtain the return values. At the moment, it is performed once per workgroup.

Continuous Host-Call

By default, hostcalls can be used only once. After executing the function on the host, the task finishes and exits.

However, if you need your hostcall to live indefinitely, pass continuous=true keyword argument to HostCallHolder(...; continuous=true).

To then stop the hostcall, call Device.non_continuous!(hc) or Device.finish!(hc) on the HostCallHolder.

The difference between them is that non_continuous! will allow calling hostcall one more time before exiting, while finish! will exit immediately.

finish! can be used on any HostCallHolder to force-exit the running hostcall task.

Free hostcall buffers

For custom hostcalls it is important to call AMDGPU.Device.free! once kernel has finished to free buffers that hostcall used in the process.

+@assert Array(y)[1] ≈ 42f0

In this example, HostCallHolder is used to create and launch HostCall. HostCallHolder contains the HostCall structure itself that is passed to kernel, a task that is spawned on creation and some additional info for controlling the lifetime of the task.

First argument is a function we want to execute when we call the hostcall. In this case we add 42f0 to input argument x and return the result.

Second and third arguments are the return type Float32 and the tuple of types of input arguments Tuple{Float32}.

hostcall! is used to execute the function on the host, wait on the result, and obtain the return values. At the moment, it is performed once per workgroup.

Continuous Host-Call

By default, hostcalls can be used only once. After executing the function on the host, the task finishes and exits.

However, if you need your hostcall to live indefinitely, pass continuous=true keyword argument to HostCallHolder(...; continuous=true).

To then stop the hostcall, call Device.non_continuous!(hc) or Device.finish!(hc) on the HostCallHolder.

The difference between them is that non_continuous! will allow calling hostcall one more time before exiting, while finish! will exit immediately.

finish! can be used on any HostCallHolder to force-exit the running hostcall task.

Free hostcall buffers

For custom hostcalls it is important to call AMDGPU.Device.free! once kernel has finished to free buffers that hostcall used in the process.

diff --git a/dev/index.html b/dev/index.html index 5f90d593..6885a60a 100644 --- a/dev/index.html +++ b/dev/index.html @@ -17,4 +17,4 @@ # Default is "none", which does not apply any limitation. hard_memory_limit = "none" # Notice a space between the value and percentage sign. -# hard_memory_limit = "80 %" +# hard_memory_limit = "80 %" diff --git a/dev/kernel_programming/index.html b/dev/kernel_programming/index.html index df1a4ef0..e24a19fd 100644 --- a/dev/kernel_programming/index.html +++ b/dev/kernel_programming/index.html @@ -6,7 +6,7 @@

Kernel Programming

Launch Configuration

While an almost arbitrarily large number of workitems can be executed per kernel launch, the hardware can only support executing a limited number of wavefronts at one time.

To alleviate this, the compiler calculates the "occupancy" of each compiled kernel (which is the number of wavefronts that can be simultaneously executing on the GPU), and passes this information to the hardware; the hardware then launches a limited number of wavefronts at once, based on the kernel's "occupancy" values.

The rest of the wavefronts are not launched until hardware resources become available, which means that a kernel with better occupancy will see more of its wavefronts executing simultaneously (which often leads to better performance). Suffice to say, it's important to know the occupancy of kernels if you want the best performance.

Like CUDA.jl, AMDGPU.jl has the ability to calculate kernel occupancy, with the launch_configuration function:

kernel = @roc launch=false mykernel(args...)
 occupancy = AMDGPU.launch_configuration(kernel)
 @show occupancy.gridsize
-@show occupancy.groupsize

Specifically, launch_configuration calculates the occupancy of mykernel(args...), and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:

@roc groupsize=occupancy.groupsize mykernel(args...)
AMDGPU.@rocMacro
@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

  • launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
  • Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
  • Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.
source
AMDGPU.Runtime.HIPKernelType
(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

  • gridsize::ROCDim = 1: Size of the grid.
  • groupsize::ROCDim = 1: Size of the workgroup.
  • shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.
source
AMDGPU.Compiler.hipfunctionFunction
hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

  • name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
  • unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
source

Atomics

AMDGPU.jl relies on Atomix.jl for atomics.

Example of a kernel that computes atomic max:

using AMDGPU
+@show occupancy.groupsize

Specifically, launch_configuration calculates the occupancy of mykernel(args...), and then calculates an optimal groupsize based on the occupancy. This value can then be used to select the groupsize for the kernel:

@roc groupsize=occupancy.groupsize mykernel(args...)
AMDGPU.@rocMacro
@roc [kwargs...] func(args...)

High-level interface for launching kernels on GPU. Upon a first call it will be compiled, subsequent calls will re-use the compiled object.

Several keyword arguments are supported:

  • launch::Bool = true: whether to launch the kernel. If false, then returns a compiled kernel which can be launched by calling it and passing arguments.
  • Arguments that influence kernel compilation, see AMDGPU.Compiler.hipfunction.
  • Arguments that influence kernel launch, see AMDGPU.Runtime.HIPKernel.
source
AMDGPU.Runtime.HIPKernelType
(ker::HIPKernel)(args::Vararg{Any, N}; kwargs...)

Launch compiled HIPKernel by passing arguments to it.

The following kwargs are supported:

  • gridsize::ROCDim = 1: Size of the grid.
  • groupsize::ROCDim = 1: Size of the workgroup.
  • shmem::Integer = 0: Amount of dynamically-allocated shared memory in bytes.
  • stream::HIP.HIPStream = AMDGPU.stream(): Stream on which to launch the kernel.
source
AMDGPU.Compiler.hipfunctionFunction
hipfunction(f::F, tt::TT = Tuple{}; kwargs...)

Compile Julia function f to a HIP kernel given a tuple of argument's types tt that it accepts.

The following kwargs are supported:

  • name::Union{String, Nothing} = nothing: A unique name to give a compiled kernel.
  • unsafe_fp_atomics::Bool = true: Whether to use 'unsafe' floating-point atomics. AMD GPU devices support fast atomic read-modify-write (RMW) operations on floating-point values. On single- or double-precision floating-point values this may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
source

Atomics

AMDGPU.jl relies on Atomix.jl for atomics.

Example of a kernel that computes atomic max:

using AMDGPU
 
 function ker_atomic_max!(target, source, indices)
     i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
@@ -20,7 +20,7 @@
 source = ROCArray(rand(UInt32, n))
 indices = ROCArray(rand(1:bins, n))
 target = ROCArray(zeros(UInt32, bins))
-@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)

Device Intrinsics

Wavefront-Level Primitives

AMDGPU.Device.activelaneFunction
activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia> function ker!(x)
+@roc groupsize=256 gridsize=4 ker_atomic_max!(target, source, indices)

Device Intrinsics

Wavefront-Level Primitives

AMDGPU.Device.activelaneFunction
activelane()::Cuint

Get id of the current lane within a wavefront/warp.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            x[i + 1] = i
            return
@@ -33,7 +33,7 @@
 
 julia> Array(x)
 1×8 Matrix{Int32}:
- 0  1  2  3  4  5  6  7
source
AMDGPU.Device.ballotFunction
ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia> function ker!(x)
+ 0  1  2  3  4  5  6  7
source
AMDGPU.Device.ballotFunction
ballot(predicate::Bool)::UInt64

Return a value whose Nth bit is set if and only if predicate evaluates to true for the Nth lane and the lane is active.

julia> function ker!(x)
            x[1] = AMDGPU.Device.ballot(true)
            return
        end
@@ -45,7 +45,7 @@
 
 julia> x
 1-element ROCArray{UInt64, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
- 0x00000000ffffffff
source
AMDGPU.Device.ballot_syncFunction
ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia> function ker!(x)
+ 0x00000000ffffffff
source
AMDGPU.Device.ballot_syncFunction
ballot_sync(mask::UInt64, predicate::Bool)::UInt64

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the wavefront and the Nth thread is active.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            if i % 2 == 0
                mask = 0x0000000055555555 # Only even threads.
@@ -60,7 +60,7 @@
 julia> @roc groupsize=32 ker!(x);
 
 julia> bitstring(Array(x)[1])
-"0000000000000000000000000000000001010101010101010101010101010101"
source
AMDGPU.Device.bpermuteFunction
bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia> function ker!(x)
+"0000000000000000000000000000000001010101010101010101010101010101"
source
AMDGPU.Device.bpermuteFunction
bpermute(addr::Integer, val::Cint)::Cint

Read data stored in val from the lane VGPR (vector general purpose register) given by addr.

The permute instruction moves data between lanes but still uses the notion of byte addressing, as do other LDS instructions. Hence, the value in the addr VGPR should be desired_lane_id * 4, since VGPR values are 4 bytes wide.

Example below shifts all values in the wavefront by 1 to the "left".

julia> function ker!(x)
            i::Cint = AMDGPU.Device.activelane()
            # `addr` points to the next immediate lane.
            addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
@@ -76,7 +76,7 @@
 
 julia> x
 1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1  2  3  4  5  6  7  0
source
AMDGPU.Device.permuteFunction
permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia> function ker!(x)
+ 1  2  3  4  5  6  7  0
source
AMDGPU.Device.permuteFunction
permute(addr::Integer, val::Cint)::Cint

Put data stored in val to the lane VGPR (vector general purpose register) given by addr.

Example below shifts all values in the wavefront by 1 to the "right".

julia> function ker!(x)
            i::Cint = AMDGPU.Device.activelane()
            # `addr` points to the next immediate lane.
            addr = ((i + 1) % 8) * 4 # VGPRs are 4 bytes wide
@@ -92,7 +92,7 @@
 
 julia> x
 1×8 ROCArray{Int32, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 7  0  1  2  3  4  5  6
source
AMDGPU.Device.shflFunction
shfl(val, lane, width = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia> function ker!(x)
+ 7  0  1  2  3  4  5  6
source
AMDGPU.Device.shflFunction
shfl(val, lane, width = wavefrontsize())

Read data stored in val from a lane (this is a more high-level op than bpermute).

If lane is outside the range [0:width - 1], the value returned corresponds to the value held by the lane modulo width (within the same subsection).

julia> function ker!(x)
            i::UInt32 = AMDGPU.Device.activelane()
            x[i + 1] = AMDGPU.Device.shfl(i, i + 1)
            return
@@ -118,7 +118,7 @@
 
 julia> Int.(x)
 1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1  2  3  0  5  6  7  4
source
AMDGPU.Device.shfl_syncFunction
shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

source
AMDGPU.Device.shfl_upFunction
shfl_up(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia> function ker!(x)
+ 1  2  3  0  5  6  7  4
source
AMDGPU.Device.shfl_syncFunction
shfl_sync(mask::UInt64, val, lane, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane ID.

source
AMDGPU.Device.shfl_upFunction
shfl_up(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is subtracted from the current lane ID. I.e. read from a lane with lower ID relative to the caller.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            x[i + 1] = AMDGPU.Device.shfl_up(i, 1)
            return
@@ -131,7 +131,7 @@
 
 julia> x
 1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 0  0  1  2  3  4  5  6
source
AMDGPU.Device.shfl_up_syncFunction
shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

source
AMDGPU.Device.shfl_downFunction
shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia> function ker!(x)
+ 0  0  1  2  3  4  5  6
source
AMDGPU.Device.shfl_up_syncFunction
shfl_up_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with lower ID relative to the caller.

source
AMDGPU.Device.shfl_downFunction
shfl_down(val, δ, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, accepts δ that is added to the current lane ID. I.e. read from a lane with higher ID relative to the caller.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            x[i + 1] = AMDGPU.Device.shfl_down(i, 1, 8)
            return
@@ -144,7 +144,7 @@
 
 julia> x
 1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1  2  3  4  5  6  7  7
source
AMDGPU.Device.shfl_down_syncFunction
shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

source
AMDGPU.Device.shfl_xorFunction
shfl_xor(val, lane_mask, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia> function ker!(x)
+ 1  2  3  4  5  6  7  7
source
AMDGPU.Device.shfl_down_syncFunction
shfl_down_sync(mask::UInt64, val, δ, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane with higher ID relative to the caller.

source
AMDGPU.Device.shfl_xorFunction
shfl_xor(val, lane_mask, width = wavefrontsize())

Same as shfl, but instead of specifying lane ID, performs bitwise XOR of the caller's lane ID with the lane_mask.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            x[i + 1] = AMDGPU.Device.shfl_xor(i, 1)
            return
@@ -157,7 +157,7 @@
 
 julia> x
 1×8 ROCArray{Int64, 2, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1  0  3  2  5  4  7  6
source
AMDGPU.Device.shfl_xor_syncFunction
shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

source
AMDGPU.Device.any_syncFunction
any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia> function ker!(x)
+ 1  0  3  2  5  4  7  6
source
AMDGPU.Device.shfl_xor_syncFunction
shfl_xor_sync(mask::UInt64, val, lane_mask, width = wavefrontsize())

Synchronize threads according to a mask and read data stored in val from a lane according to a bitwise XOR of the caller's lane ID with the lane_mask.

source
AMDGPU.Device.any_syncFunction
any_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            if i % 2 == 0
                mask = 0x0000000055555555 # Only even threads.
@@ -173,7 +173,7 @@
 
 julia> x
 1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1
source
AMDGPU.Device.all_syncFunction
all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia> function ker!(x)
+ 1
source
AMDGPU.Device.all_syncFunction
all_sync(mask::UInt64, predicate::Bool)::Bool

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

julia> function ker!(x)
            i = AMDGPU.Device.activelane()
            if i % 2 == 0
                mask = 0x0000000055555555 # Only even threads.
@@ -189,4 +189,4 @@
 
 julia> x
 1-element ROCArray{Bool, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
- 1
source
+ 1source diff --git a/dev/logging/index.html b/dev/logging/index.html index f5710a51..aade2db2 100644 --- a/dev/logging/index.html +++ b/dev/logging/index.html @@ -9,4 +9,4 @@ fill!(B, 1f0) C = Array(B) end -@show logs[1] +@show logs[1] diff --git a/dev/memory/index.html b/dev/memory/index.html index 532b15d5..f276e139 100644 --- a/dev/memory/index.html +++ b/dev/memory/index.html @@ -62,4 +62,4 @@ xd * xd # Freeing is a no-op for `xd`, since `xd` does not own the underlying memory. -AMDGPU.unsafe_free!(xd) # No-op.

Notice mandatory ; lock=false keyword, this is to be able to differentiate between host & device pointers.

+AMDGPU.unsafe_free!(xd) # No-op.

Notice mandatory ; lock=false keyword, this is to be able to differentiate between host & device pointers.

diff --git a/dev/printing/index.html b/dev/printing/index.html index 1269ae23..b8499a6d 100644 --- a/dev/printing/index.html +++ b/dev/printing/index.html @@ -38,4 +38,4 @@ My index is 1 # :grid -My index is 1

Differences to @cuprintf

Similar to CUDA's @cuprintf, @rocprintf is a printf-compatible macro which takes a format string and arguments, and commands the host CPU to display it as formatted text. However, in contrast to @cuprintf, we use AMDGPU's hostcall and Julia's Printf stdlib to implement this. This means that anything that Printf can print, so can @rocprintf (assuming such an object can be represented on the GPU). The macro is also handled as a regular hostcall, which means that argument types are checked at compile time (although currently, any errors while printing will be detected on the host, and will terminate the kernel).

+My index is 1

Differences to @cuprintf

Similar to CUDA's @cuprintf, @rocprintf is a printf-compatible macro which takes a format string and arguments, and commands the host CPU to display it as formatted text. However, in contrast to @cuprintf, we use AMDGPU's hostcall and Julia's Printf stdlib to implement this. This means that anything that Printf can print, so can @rocprintf (assuming such an object can be represented on the GPU). The macro is also handled as a regular hostcall, which means that argument types are checked at compile time (although currently, any errors while printing will be detected on the host, and will terminate the kernel).

diff --git a/dev/profiling/index.html b/dev/profiling/index.html index 2671adc9..0a3d6a28 100644 --- a/dev/profiling/index.html +++ b/dev/profiling/index.html @@ -34,4 +34,4 @@ @roc groupsize=groupsize gridsize=gridsize mycopy!(dst, src) end AMDGPU.synchronize() - ...

Running profiling again and visualizing results we now see that kernel launches are adjacent to each other and that the average wall duration is lower.

Zoomed outZoomed in
imageimage

Debugging

Use HIP_LAUNCH_BLOCKING=1 to synchronize immediately after launching GPU kernels. This will allow to pinpoint exact kernel that caused the exception.

+ ...

Running profiling again and visualizing results we now see that kernel launches are adjacent to each other and that the average wall duration is lower.

Zoomed outZoomed in
imageimage

Debugging

Use HIP_LAUNCH_BLOCKING=1 to synchronize immediately after launching GPU kernels. This will allow to pinpoint exact kernel that caused the exception.

diff --git a/dev/quickstart/index.html b/dev/quickstart/index.html index b0607558..782e78ea 100644 --- a/dev/quickstart/index.html +++ b/dev/quickstart/index.html @@ -28,4 +28,4 @@ julia> @roc groupsize=groupsize gridsize=gridsize vadd!(c_d, a_d, b_d); julia> Array(c_d) ≈ c -true

The easiest way to launch a GPU kernel is with the @roc macro, specifying groupsize and gridsize to cover full array, and calling it like a regular function.

Keep in mind that kernel launches are asynchronous, meaning that you need to synchronize before you can use the result (e.g. with AMDGPU.synchronize). However, GPU <-> CPU transfers synchronize implicitly.

The grid is the domain over which the entire kernel executes over. The grid will be split into multiple workgroups by hardware automatically, and the kernel does not complete until all workgroups complete.

Like OpenCL, AMDGPU has the concept of "workitems", "workgroups", and the "grid". A workitem is a single thread of execution, capable of performing arithmentic operations. Workitems are grouped into "wavefronts" ("warps" in CUDA) which share the same compute unit, and execute the same instructions simulatenously. The workgroup is a logical unit of compute supported by hardware which comprises multiple wavefronts, which shares resources (specifically local memory) and can be efficiently synchronized. A workgroup may be executed by one or multiple hardware compute units, making it often the only dimension of importance for smaller kernel launches.

Notice how we explicitly specify that this function does not return a value by adding the return statement. This is necessary for all GPU kernels and we can enforce it by adding a return, return nothing, or even nothing at the end of the kernel. If this statement is omitted, Julia will attempt to return the value of the last evaluated expression, in this case a Float64, which will cause a compilation failure as kernels cannot return values.

Naming conventions

Throughout this example we use terms like "work group" and "work item". These terms are used by the Khronos consortium and their APIs including OpenCL and Vulkan, as well as the HSA foundation.

NVIDIA, on the other hand, uses some different terms in their CUDA API, which might be confusing to some users porting their kernels from CUDA to AMDGPU.

As a quick summary, here is a mapping of the most common terms:

AMDGPUCUDA
workitemIdxthreadIdx
workgroupIdxblockIdx
workgroupDimblockDim
gridItemDimNo equivalent
gridGroupDimgridDim
groupsizethreads
gridsizeblocks
streamstream
+true

The easiest way to launch a GPU kernel is with the @roc macro, specifying groupsize and gridsize to cover full array, and calling it like a regular function.

Keep in mind that kernel launches are asynchronous, meaning that you need to synchronize before you can use the result (e.g. with AMDGPU.synchronize). However, GPU <-> CPU transfers synchronize implicitly.

The grid is the domain over which the entire kernel executes over. The grid will be split into multiple workgroups by hardware automatically, and the kernel does not complete until all workgroups complete.

Like OpenCL, AMDGPU has the concept of "workitems", "workgroups", and the "grid". A workitem is a single thread of execution, capable of performing arithmentic operations. Workitems are grouped into "wavefronts" ("warps" in CUDA) which share the same compute unit, and execute the same instructions simulatenously. The workgroup is a logical unit of compute supported by hardware which comprises multiple wavefronts, which shares resources (specifically local memory) and can be efficiently synchronized. A workgroup may be executed by one or multiple hardware compute units, making it often the only dimension of importance for smaller kernel launches.

Notice how we explicitly specify that this function does not return a value by adding the return statement. This is necessary for all GPU kernels and we can enforce it by adding a return, return nothing, or even nothing at the end of the kernel. If this statement is omitted, Julia will attempt to return the value of the last evaluated expression, in this case a Float64, which will cause a compilation failure as kernels cannot return values.

Naming conventions

Throughout this example we use terms like "work group" and "work item". These terms are used by the Khronos consortium and their APIs including OpenCL and Vulkan, as well as the HSA foundation.

NVIDIA, on the other hand, uses some different terms in their CUDA API, which might be confusing to some users porting their kernels from CUDA to AMDGPU.

As a quick summary, here is a mapping of the most common terms:

AMDGPUCUDA
workitemIdxthreadIdx
workgroupIdxblockIdx
workgroupDimblockDim
gridItemDimNo equivalent
gridGroupDimgridDim
groupsizethreads
gridsizeblocks
streamstream
diff --git a/dev/streams/index.html b/dev/streams/index.html index 89786285..81b1466c 100644 --- a/dev/streams/index.html +++ b/dev/streams/index.html @@ -9,6 +9,6 @@ x = AMDGPU.stream!(() -> AMDGPU.ones(Float32, 16), stream)
stream = AMDGPU.HIPStream()
 @roc stream=stream kernel(...)

Streams also have an inherent priority, which allows control of kernel submission latency and on-device scheduling preference with respect to kernels submitted on other streams. There are three priorities: normal (the default), low, and high priority.

Priority of the default stream can be set with AMDGPU.priority!. Alternatively, it can be set at stream creation time:

low_prio = HIPStream(:low)
 high_prio = HIPStream(:high)
-normal_prio = HIPStream(:normal) # or just omit "priority"
AMDGPU.streamFunction
stream()::HIPStream

Get the HIP stream that should be used as the default one for the currently executing task.

source
AMDGPU.stream!Function
stream!(s::HIPStream)

Change the default stream to be used within the same Julia task.

source
stream!(f::Base.Callable, stream::HIPStream)

Change the default stream to be used within the same Julia task, execute f and revert to the original stream.

Returns:

Return value of the function f.

source
AMDGPU.priority!Function
priority!(p::Symbol)

Change the priority of the default stream. Accepted values are :normal (the default), :low and :high.

source
priority!(f::Base.Callable, priority::Symbol)

Chnage the priority of default stream, execute f and revert to the original priority. Accepted values are :normal (the default), :low and :high.

Returns:

Return value of the function f.

source
AMDGPU.HIP.HIPStreamType
HIPStream(priority::Symbol = :normal)

Arguments:

  • priority::Symbol: Priority of the stream: :normal, :high or :low.

Create HIPStream with given priority. Device is the default device that's currently in use.

source
HIPStream(stream::hipStream_t)

Create HIPStream from hipStream_t handle. Device is the default device that's currently in use.

source

Synchronization

AMDGPU.jl by default uses non-blocking stream synchronization with AMDGPU.synchronize to work correctly with TLS and Hostcall.

Users, however, can switch to a blocking synchronization globally with nonblocking_synchronization preference or with fine-grained AMDGPU.synchronize(; blocking=true). Blocking synchronization might offer slightly lower latency.

You can also perform synchronization of the expression with AMDGPU.@sync macro, which will execute given expression and synchronize afterwards (using AMDGPU.synchronize under the hood).

AMDGPU.@sync begin
+normal_prio = HIPStream(:normal) # or just omit "priority"
AMDGPU.streamFunction
stream()::HIPStream

Get the HIP stream that should be used as the default one for the currently executing task.

source
AMDGPU.stream!Function
stream!(s::HIPStream)

Change the default stream to be used within the same Julia task.

source
stream!(f::Base.Callable, stream::HIPStream)

Change the default stream to be used within the same Julia task, execute f and revert to the original stream.

Returns:

Return value of the function f.

source
AMDGPU.priority!Function
priority!(p::Symbol)

Change the priority of the default stream. Accepted values are :normal (the default), :low and :high.

source
priority!(f::Base.Callable, priority::Symbol)

Chnage the priority of default stream, execute f and revert to the original priority. Accepted values are :normal (the default), :low and :high.

Returns:

Return value of the function f.

source
AMDGPU.HIP.HIPStreamType
HIPStream(priority::Symbol = :normal)

Arguments:

  • priority::Symbol: Priority of the stream: :normal, :high or :low.

Create HIPStream with given priority. Device is the default device that's currently in use.

source
HIPStream(stream::hipStream_t)

Create HIPStream from hipStream_t handle. Device is the default device that's currently in use.

source

Synchronization

AMDGPU.jl by default uses non-blocking stream synchronization with AMDGPU.synchronize to work correctly with TLS and Hostcall.

Users, however, can switch to a blocking synchronization globally with nonblocking_synchronization preference or with fine-grained AMDGPU.synchronize(; blocking=true). Blocking synchronization might offer slightly lower latency.

You can also perform synchronization of the expression with AMDGPU.@sync macro, which will execute given expression and synchronize afterwards (using AMDGPU.synchronize under the hood).

AMDGPU.@sync begin
     @roc ...
-end

Finally, you can perform full device synchronization with AMDGPU.device_synchronize.

AMDGPU.synchronizeFunction
synchronize(stream::HIPStream = stream(); blocking::Bool = false)

Wait until all kernels executing on stream have completed.

If there are running HostCalls, then blocking must be false. Additionally, if you want to stop host calls afterwards, then provide stop_hostcalls=true keyword argument.

source
AMDGPU.@syncMacro
@sync ex

Run expression ex on currently active stream and synchronize the GPU on that stream afterwards.

See also: synchronize.

source
AMDGPU.HIP.device_synchronizeFunction

Blocks until all kernels on all streams have completed. Uses currently active device.

source
+end

Finally, you can perform full device synchronization with AMDGPU.device_synchronize.

AMDGPU.synchronizeFunction
synchronize(stream::HIPStream = stream(); blocking::Bool = false)

Wait until all kernels executing on stream have completed.

If there are running HostCalls, then blocking must be false. Additionally, if you want to stop host calls afterwards, then provide stop_hostcalls=true keyword argument.

source
AMDGPU.@syncMacro
@sync ex

Run expression ex on currently active stream and synchronize the GPU on that stream afterwards.

See also: synchronize.

source
AMDGPU.HIP.device_synchronizeFunction

Blocks until all kernels on all streams have completed. Uses currently active device.

source