sync : llama.cpp #1160

ggerganov · 2025-03-27T07:13:45Z

No description provided.

…247) This commit updates the compilation of default.metallib to skip the intermediate .air (Apple Intermediate Representation) file. The motivation for this change is to simplify the custom command a little and avoid generating and then removing the .air file.

ggml-ci

This patch nudges the llama.cpp a bit to be supported on PoCL which doesn't support OpenCL C CL2.0. The issue is solved by querying the device for the supported OpenCL C versions and using the highest one available.

Signed-off-by: Xiaodong Ye <[email protected]>

…265)

* Fix backend search path * replace .native() with '/' * reverted .native()

…per block between host and device code. (llama/12177) refactor mmqv to unify the calculation of nwarps and rows per block between host and device code. --------- Co-authored-by: Johannes Gäßler <[email protected]>

* tests: run mul_mat_id with a larger N * vulkan: fix bug in coopmat1 mul_mat_id

When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need to avoid launching them with parameters for warp64

…llama/12399) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

* SYCL: set extras only on GGML_TYPE_Q4_0 * release tensor_extras in reset buffer interface

* cmake: Factor out compiler flag function from ggml llama.cpps's build requires it, too, and we may want to make use of it without add_subdirectory(ggml). * cmake: Enable building against system ggml This facilitates package maintenance for Linux distributions, where the libggml library most likely will be shipped as an individual package upon which a llama.cpp package depends.

…s checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking

* vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <[email protected]>

It's already found by FindVulkan.cmake in the parent CMakeLists

* Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA

* ggml: Add op l2_norm Signed-off-by: Molly Sophia <[email protected]> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <[email protected]> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <[email protected]> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <[email protected]> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <[email protected]> * Apply code-format changes Signed-off-by: Molly Sophia <[email protected]> * fix MUSA build Signed-off-by: Molly Sophia <[email protected]> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>

…ion and driver issues (llama/12434)

…e option (llama/12371) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <[email protected]> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

* opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

* ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <[email protected]>

…a/12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <[email protected]>

…architecture (llama/12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments

* [SYCL] Fix build on Windows when ccache enabled (llama/9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <[email protected]>

…2472)

* Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <[email protected]> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <[email protected]>

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.

* musa: refine compute capability Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

Signed-off-by: Xiaodong Ye <[email protected]>

Co-authored-by: Max Krasnyansky <[email protected]>

ggml-ci

ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <[email protected]>

* ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci

* metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci

* SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

ggml-ci

danbev and others added 30 commits March 27, 2025 09:06

ggml-backend : make path_str compatible with C++20 (llama/12269)

6f1482f

tests : fix test-quantize-fns to init the CPU backend (llama/12306)

3b9ab12

ggml-ci

opencl: use OpenCL C standard supported by the device (llama/12221)

bd498fb

This patch nudges the llama.cpp a bit to be supported on PoCL which doesn't support OpenCL C CL2.0. The issue is solved by querying the device for the supported OpenCL C versions and using the highest one available.

musa: support new arch mp_31 and update doc (llama/12296)

f548924

Signed-off-by: Xiaodong Ye <[email protected]>

mat vec double buffer (llama/12188)

0a9761e

metal : Cache the Metal library at the device context level (llama/12…

9091eea

…265)

ggml-backend : fix backend search path (llama/12330)

60bd86a

* Fix backend search path * replace .native() with '/' * reverted .native()

vulkan: fix bug in coopmat1 mul_mat_id (llama/12316)

fbedb17

* tests: run mul_mat_id with a larger N * vulkan: fix bug in coopmat1 mul_mat_id

sycl : variable sg_size support for mmvq kernels (llama/12336)

ab5b0d1

MUL_MAT optimization (llama/12382)

eb84db8

SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (…

96c5d14

…llama/12399) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

SYCL: Delete redundant plus sign and space (llama/12391)

afbf61d

SYCL: set extras only on GGML_TYPE_Q4_0 (llama/12366)

1c8153d

* SYCL: set extras only on GGML_TYPE_Q4_0 * release tensor_extras in reset buffer interface

vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)

5c70888

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bound…

64386ff

…s checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking

vulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309)

a373dd2

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)

26c8697

ggml-vulkan: remove unused find_program(glslc) (llama/12416)

915c4f8

It's already found by FindVulkan.cmake in the parent CMakeLists

fixed compilation warnings in ggml-sycl (llama/12424)

7018b32

Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentat…

3eb6097

…ion and driver issues (llama/12434)

ggml : add SVE support for q6_K_q8_K (llama/12361)

6a168b1

musa: override warp_size of musa device to 32 (llama/12445)

2a94810

Signed-off-by: Xiaodong Ye <[email protected]>

lhez and others added 26 commits March 27, 2025 09:06

opencl: improve profiling (llama/12442)

0a05a50

* opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing

vulkan: optimize iq1 coopmat2 dequant functions (llama/12427)

39748c1

sycl: cleanup oneDNN related code (llama/12097)

e8d18a4

Fix build on Windows when ccache enabled (#9954) (llama/9976)

61b7451

* [SYCL] Fix build on Windows when ccache enabled (llama/9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <[email protected]>

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/1…

d44565f

…2472)

musa: refine compute capability (llama/12493)

d385fc7

* musa: refine compute capability Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix quantized cpy op (llama/12310)

80ce83d

* ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci

vulkan: fix mul_mat_vec failure in backend tests (llama/12529)

5193783

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

CUDA: Fix clang warnings (llama/12540)

3b5918f

Signed-off-by: Xiaodong Ye <[email protected]>

opencl: simplify kernel embedding logic in cmakefile (llama/12503)

935dab3

Co-authored-by: Max Krasnyansky <[email protected]>

SYCL: disable Q4_0 reorder optimization (llama/12560)

e52a9ed

ggml-ci

ggml-cpu : update KleidiAI to v1.5.0 (llama/12568)

2517757

ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <[email protected]>

ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544)

1f4255c

* ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci

HIP: Add support for RDNA4 targets (llama/12372)

44fd23e

SYCL: implement memset ggml backend buffer interface (llama/12580)

5bc4000

* SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation

ggml : sync/merge cmake,riscv,powerpc, add common.cmake (#0)

89205af

sync : llama.cpp

4c4f07a

ggml-ci

files : remove old wkv6 sources (#0)

c838c22

ggml-ci

ggerganov merged commit 660def0 into master Mar 27, 2025
11 checks passed

ggerganov deleted the sync-llama.cpp-25-03-27 branch March 27, 2025 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #1160

sync : llama.cpp #1160

ggerganov commented Mar 27, 2025

sync : llama.cpp #1160

sync : llama.cpp #1160

Conversation

ggerganov commented Mar 27, 2025