Files

T

shahondin1624 ddebb5ddf6 turboquant: squash-merge TheTom/llama-cpp-turboquant feature/turboquant-kv-cache

Squashes the entire TurboQuant KV-cache feature branch from
https://github.com/TheTom/llama-cpp-turboquant (tip 5aeb2fdbe) onto our master.

Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s,
tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated
mul_mm path), CPU reference paths, HIP template instances, perplexity tooling,
and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool
retention, n_head_v reshape, sparse-V CUDA gating, etc.).

Conflict-resolution notes (review carefully before depending on these paths):

- common/arg.cpp, common/speculative.cpp: master's refactored speculative API
  kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last).

- ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192
  and 640 alongside other sizes).

- ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND
  TurboQuant's GGML_OP_TURBO_WHT support cases kept.

- ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature
  (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added.

- ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ
  else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API.

- ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration
  taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added
  to the fa_kv_ok lambda for type validation.

- ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new
  spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro
  path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need
  re-implementation against the new spec-constant API. *** Vulkan TURBO3_0
  inference will likely fail until that work is redone.

Squash base: 7fc1c4ef78 (TheTom's last upstream merge point).

2026-05-19 15:13:49 +02:00

4.1 KiB

Raw Blame History

TurboQuant on AMD Instinct MI300X & MI355X (ROCm/HIP)

Summary

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs correctly on AMD Instinct MI300X (gfx942) and MI355X (gfx950). MI300X requires zero code changes. MI355X requires adding CDNA4 arch defines to the HIP vendor header.

Test Environment

Component	MI300X	MI355X
GPU	MI300X (gfx942), 192 GB HBM3	MI355X (gfx950), 288 GB HBM3e
ROCm	7.0.2	7.0.1
Wave Size	64	64
Build	`-DAMDGPU_TARGETS="gfx942"`	`-DAMDGPU_TARGETS="gfx950"`
Model	Qwen2.5-1.5B Q4_K_M (1.04 GiB)	same

WHT Kernel Correctness

Standalone roundtrip test (forward WHT → inverse WHT) confirms the Walsh-Hadamard Transform kernel works correctly on HIP with 64-wide wavefronts:

=== TurboQuant WHT Roundtrip Test (HIP/gfx942) ===
Total elements: 512 (4 heads x 128 dim)
Forward WHT zeros: 0 / 512
Roundtrip max error: 2.980232e-07
Roundtrip RMSE:      6.816018e-08
Result: PASS ✅

The kernel uses shared memory + __syncthreads() (no warp shuffles), so it works correctly with GCN's 64-thread wavefronts without modification.

Performance Results

MI300X (single GPU, Qwen2.5-1.5B Q4_K_M)

KV Cache	pp512 (tok/s)	tg128 (tok/s)	Prefill vs f16	Decode vs f16
f16	24,453 ± 230	181.2 ± 2.0	baseline	baseline
turbo3	~25,200	~160	+3%	88%
turbo4	25,427 ± 17	161.1 ± 0.2	+4%	89%

MI355X (single GPU, Qwen2.5-1.5B Q4_K_M)

KV Cache	pp512 (tok/s)	tg128 (tok/s)	Prefill vs f16	Decode vs f16
f16+FA	40,013 ± 902	254.5 ± 1.0	baseline	baseline
turbo3	39,140 ± 475	162.3 ± 0.1	98%	64%
turbo4	39,232 ± 508	214.1 ± 0.7	98%	84%

Key Observations

MI300X prefill is faster with TurboQuant (+3-4%) — less KV cache data to write to HBM.
MI300X decode at 88-89% of f16 — consistent with Apple Silicon community results.
MI355X turbo4 decode at 84% — turbo4 outperforms turbo3 in decode due to simpler 4-bit dequant.
MI355X turbo3 decode at 64% — the 3-bit codebook + sign extraction is more expensive on gfx950.
MI355X non-FA MMQ path crashes (xf32 MFMA issue) — turbo types force FA and work correctly.

Build Instructions

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# MI300X (gfx942) — works without code changes
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx942"
cmake --build build --config Release -j

# MI355X (gfx950) — requires CDNA4 define patch (see commit)
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx950"
cmake --build build --config Release -j

# Test
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench \
  -m model.gguf -ctk turbo3 -ctv turbo3 -ngl 99 -r 3 -p 512 -n 128

Code Changes for gfx950 (MI355X)

Three files modified to add CDNA4 (gfx950) architecture support:

ggml/src/ggml-cuda/vendors/hip.h — Add CDNA4 define for __gfx950__, include in CDNA family
ggml/src/ggml-cuda/common.cuh — Add GGML_CUDA_CC_CDNA4 constant and GGML_CUDA_CC_IS_CDNA4 macro
ggml/src/ggml-cuda/mma.cuh — Route CDNA4 to compatible MFMA instructions (bf16_1k, i32x16x32_i8, f32x16x4f32 — NOT xf32 which doesn't exist on gfx950)

Known Limitations

MI355X non-FA MMQ crashes: The default (non-flash-attention) matrix multiply path crashes on gfx950 due to the xf32 MFMA instruction (mfma_f32_16x16x8_xf32) not being available. TurboQuant types force flash attention and work correctly. Standard f16/q8_0 KV cache types need -fa 1 flag on MI355X.
llama-cli text output: Interactive mode produces empty tokens on ROCm (display issue), but llama-bench confirms computation is correct.

Tested By

Andy Luo (@andyluo7)

AMD Instinct MI300X (gfx942), ROCm 7.0.2 — April 2026
AMD Instinct MI355X (gfx950), ROCm 7.0.1 — April 2026

4.1 KiB Raw Blame History