CubeCL v0.3.0 Release Notes

This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.

Language Features

Added support for numeric constants by @booti386 in #112
Added for in syntax for immutable arrays, tensors and slices by @wingertge in #119
Added support for ROCm HIP by @syl20bnr in #183
Added if as a value expression by @wingertge in #120
Added select (ternary) operations by @wingertge in #152
Implemented support for func generics for impl block by @nathanielsimard in #189
Added support for Enum + Const Match by @nathanielsimard in #145
Added support for numeric match at runtime by @wingertge in #143
Added support for comptime arrays available as runtime constants by @wingertge in #147
Added features for each supported datatype by @wingertge in #193
Reimplemented macro to make writing kernels more ergonomic by @wingertge in #80
Clean up macro and optimize branch operations by @wingertge in #118

Runtime Improvements

CUDA

Improved CUDA compiler by @nathanielsimard in #88
Fixed CUDA architecture version by @nathanielsimard in #89
Fixed native vector types by @nathanielsimard in #92
Fixed CUDA support for different ranks by @nathanielsimard in #124
Better CMMA configuration by @nathanielsimard in #146
Support SSA bindings for CUDA by @wingertge in #153
Fixed various CUDA bugs by @nathanielsimard in #168

WGPU

Fixed WGPU memory corruption for CubeCount::Dynamic by @ArthurBrussee in #156
Added support for autotuning on WebGPU, more precise timings by @ArthurBrussee in #167
Fixed overflow when max page == 4GB on WASM by @ArthurBrussee in #194
Merged cubecl-wgpu and cubecl-wgpu-spirv by @wingertge in #184

HIP/ROCm

Added support for ROCm HIP by @syl20bnr in #183
Added half precision support to HIP by @syl20bnr in #201
Limited cubecl-hip for Linux targets only by @syl20bnr in #205

SPIR-V

Added SPIR-V compiler by @wingertge in #155
Fixed casting, powf and alignment for SPIR-V by @wingertge in #188

Optimization & Performance

Added value-based partial redundancy elimination by @wingertge in #169
Added prefetching to into_contiguous by @wingertge in #181
Added block merging by @wingertge in #163
Added round and bitwise or operations by @laggui in #99
Skipped zero initialization of workgroup memory by @ArthurBrussee in #125
CMMA Optimizations:
- CMMA: cube dispatch strategy by @louisfd in #126
- Reuse lhs frag strategy by @louisfd in #132
- Invert k n loops by @louisfd in #131
- Continuous warp loading by @louisfd in #138
- Relative warp IDs by @louisfd in #144
- Relaxed b_m = b_n by @louisfd in #148
- New strategy for num compute planes + many refactors by @louisfd in #150

Infrastructure

Added profiling support by @nathanielsimard in #137
Improved compilation arguments by @nathanielsimard in #141
Added simple benchmarking capabilities by @jbelanich in #190
Added periodic memory cleanup by @ArthurBrussee in #178
Reworked & added ExclusivePages as memory management option by @ArthurBrussee in #158
Fixed concurrency problems with autotune by @nathanielsimard in #200
Improved timing methods for benchmarking by @jbelanich in #190
Fixed CI for Rust 1.82 by @nathanielsimard in #182
Migrated xtask to tracel-xtask by @syl20bnr in #93
Updated CI workflow and badges by @syl20bnr in #96