ddebb5ddf6
Squashes the entire TurboQuant KV-cache feature branch from https://github.com/TheTom/llama-cpp-turboquant (tip5aeb2fdbe) onto our master. Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s, tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated mul_mm path), CPU reference paths, HIP template instances, perplexity tooling, and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool retention, n_head_v reshape, sparse-V CUDA gating, etc.). Conflict-resolution notes (review carefully before depending on these paths): - common/arg.cpp, common/speculative.cpp: master's refactored speculative API kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last). - ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192 and 640 alongside other sizes). - ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND TurboQuant's GGML_OP_TURBO_WHT support cases kept. - ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added. - ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API. - ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added to the fa_kv_ok lambda for type validation. - ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need re-implementation against the new spec-constant API. *** Vulkan TURBO3_0 inference will likely fail until that work is redone. Squash base:7fc1c4ef78(TheTom's last upstream merge point).
16 lines
527 B
JSON
16 lines
527 B
JSON
{
|
|
"model": "/tmp/qwen2.5-7b-instruct-tq4_1s.gguf",
|
|
"bench_args": "-p 0 -n 128",
|
|
"tg128": 69.2,
|
|
"ppl": 7.599,
|
|
"ppl_threshold": 0.1,
|
|
"ppl_file": "/mnt/ai/data/wikitext-2-raw/wiki.test.raw",
|
|
"target_file": "ggml/src/ggml-cuda/mmvq-tq.cu",
|
|
"no_convert": true,
|
|
"coherence_prompts": [
|
|
{"prompt": "What is the capital of France? One word.", "expect": "Paris"},
|
|
{"prompt": "What is 2+2? Just the number.", "expect": "4"},
|
|
{"prompt": "Who wrote Romeo and Juliet? One name.", "expect": "Shakespeare"}
|
|
]
|
|
}
|