ddebb5ddf6
Squashes the entire TurboQuant KV-cache feature branch from https://github.com/TheTom/llama-cpp-turboquant (tip5aeb2fdbe) onto our master. Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s, tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated mul_mm path), CPU reference paths, HIP template instances, perplexity tooling, and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool retention, n_head_v reshape, sparse-V CUDA gating, etc.). Conflict-resolution notes (review carefully before depending on these paths): - common/arg.cpp, common/speculative.cpp: master's refactored speculative API kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last). - ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192 and 640 alongside other sizes). - ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND TurboQuant's GGML_OP_TURBO_WHT support cases kept. - ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added. - ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API. - ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added to the fa_kv_ok lambda for type validation. - ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need re-implementation against the new spec-constant API. *** Vulkan TURBO3_0 inference will likely fail until that work is redone. Squash base:7fc1c4ef78(TheTom's last upstream merge point).
4.1 KiB
4.1 KiB
TurboQuant on AMD Instinct MI300X & MI355X (ROCm/HIP)
Summary
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs correctly on AMD Instinct MI300X (gfx942) and MI355X (gfx950). MI300X requires zero code changes. MI355X requires adding CDNA4 arch defines to the HIP vendor header.
Test Environment
| Component | MI300X | MI355X |
|---|---|---|
| GPU | MI300X (gfx942), 192 GB HBM3 | MI355X (gfx950), 288 GB HBM3e |
| ROCm | 7.0.2 | 7.0.1 |
| Wave Size | 64 | 64 |
| Build | -DAMDGPU_TARGETS="gfx942" |
-DAMDGPU_TARGETS="gfx950" |
| Model | Qwen2.5-1.5B Q4_K_M (1.04 GiB) | same |
WHT Kernel Correctness
Standalone roundtrip test (forward WHT → inverse WHT) confirms the Walsh-Hadamard Transform kernel works correctly on HIP with 64-wide wavefronts:
=== TurboQuant WHT Roundtrip Test (HIP/gfx942) ===
Total elements: 512 (4 heads x 128 dim)
Forward WHT zeros: 0 / 512
Roundtrip max error: 2.980232e-07
Roundtrip RMSE: 6.816018e-08
Result: PASS ✅
The kernel uses shared memory + __syncthreads() (no warp shuffles), so it works correctly with GCN's 64-thread wavefronts without modification.
Performance Results
MI300X (single GPU, Qwen2.5-1.5B Q4_K_M)
| KV Cache | pp512 (tok/s) | tg128 (tok/s) | Prefill vs f16 | Decode vs f16 |
|---|---|---|---|---|
| f16 | 24,453 ± 230 | 181.2 ± 2.0 | baseline | baseline |
| turbo3 | ~25,200 | ~160 | +3% | 88% |
| turbo4 | 25,427 ± 17 | 161.1 ± 0.2 | +4% | 89% |
MI355X (single GPU, Qwen2.5-1.5B Q4_K_M)
| KV Cache | pp512 (tok/s) | tg128 (tok/s) | Prefill vs f16 | Decode vs f16 |
|---|---|---|---|---|
| f16+FA | 40,013 ± 902 | 254.5 ± 1.0 | baseline | baseline |
| turbo3 | 39,140 ± 475 | 162.3 ± 0.1 | 98% | 64% |
| turbo4 | 39,232 ± 508 | 214.1 ± 0.7 | 98% | 84% |
Key Observations
- MI300X prefill is faster with TurboQuant (+3-4%) — less KV cache data to write to HBM.
- MI300X decode at 88-89% of f16 — consistent with Apple Silicon community results.
- MI355X turbo4 decode at 84% — turbo4 outperforms turbo3 in decode due to simpler 4-bit dequant.
- MI355X turbo3 decode at 64% — the 3-bit codebook + sign extraction is more expensive on gfx950.
- MI355X non-FA MMQ path crashes (xf32 MFMA issue) — turbo types force FA and work correctly.
Build Instructions
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
# MI300X (gfx942) — works without code changes
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx942"
cmake --build build --config Release -j
# MI355X (gfx950) — requires CDNA4 define patch (see commit)
cmake -B build -DGGML_HIP=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS="gfx950"
cmake --build build --config Release -j
# Test
HIP_VISIBLE_DEVICES=0 ./build/bin/llama-bench \
-m model.gguf -ctk turbo3 -ctv turbo3 -ngl 99 -r 3 -p 512 -n 128
Code Changes for gfx950 (MI355X)
Three files modified to add CDNA4 (gfx950) architecture support:
ggml/src/ggml-cuda/vendors/hip.h— AddCDNA4define for__gfx950__, include inCDNAfamilyggml/src/ggml-cuda/common.cuh— AddGGML_CUDA_CC_CDNA4constant andGGML_CUDA_CC_IS_CDNA4macroggml/src/ggml-cuda/mma.cuh— Route CDNA4 to compatible MFMA instructions (bf16_1k, i32x16x32_i8, f32x16x4f32 — NOT xf32 which doesn't exist on gfx950)
Known Limitations
- MI355X non-FA MMQ crashes: The default (non-flash-attention) matrix multiply path crashes on gfx950 due to the xf32 MFMA instruction (
mfma_f32_16x16x8_xf32) not being available. TurboQuant types force flash attention and work correctly. Standard f16/q8_0 KV cache types need-fa 1flag on MI355X. - llama-cli text output: Interactive mode produces empty tokens on ROCm (display issue), but
llama-benchconfirms computation is correct.
Tested By
Andy Luo (@andyluo7)
- AMD Instinct MI300X (gfx942), ROCm 7.0.2 — April 2026
- AMD Instinct MI355X (gfx950), ROCm 7.0.1 — April 2026