ddebb5ddf6
Squashes the entire TurboQuant KV-cache feature branch from https://github.com/TheTom/llama-cpp-turboquant (tip5aeb2fdbe) onto our master. Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s, tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated mul_mm path), CPU reference paths, HIP template instances, perplexity tooling, and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool retention, n_head_v reshape, sparse-V CUDA gating, etc.). Conflict-resolution notes (review carefully before depending on these paths): - common/arg.cpp, common/speculative.cpp: master's refactored speculative API kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last). - ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192 and 640 alongside other sizes). - ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND TurboQuant's GGML_OP_TURBO_WHT support cases kept. - ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added. - ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API. - ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added to the fa_kv_ok lambda for type validation. - ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need re-implementation against the new spec-constant API. *** Vulkan TURBO3_0 inference will likely fail until that work is redone. Squash base:7fc1c4ef78(TheTom's last upstream merge point).
2 lines
17 B
YAML
2 lines
17 B
YAML
github: [TheTom]
|