Squashes 8 commits from https://github.com/am17an/llama.cpp.git mtp-clean
adding Multi-Token Prediction speculative decoding support, primarily for
Qwen3.5 / Qwen3.6 models with native MTP heads.
Mode: COMMON_SPECULATIVE_TYPE_DRAFT_MTP (CLI: --draft-mtp).
Per PR: ~1.8-2x speedup with ~75% draft acceptance using 3 draft tokens.
Currently requires --parallel 1.
Files touched: 26 (1319 insertions, 136 deletions). Hot spots:
- src/models/qwen35.cpp, qwen35moe.cpp — MTP head integration
- common/speculative.{cpp,h} — new MTP draft path
- src/llama-{context,model,memory}.* — MTP-specific KV cache
- convert_hf_to_gguf.py — converter writes MTP head tensors
Merge applied cleanly against current master (TurboQuant tip 7161dee3f).
Source branch HEAD: e7b484815 (add need_embd in speculative).
Note: upstream PR remains open as of this merge; performance regressions
were under discussion. Watch upstream for the final API.
Two fixes surfaced by running the full test suite against the squash-merged
turboquant branch, plus one CMake registration.
1. ggml-cuda/ggml-cuda.cu (GET_ROWS supports_op)
Removed TQ3_1S/TQ4_1S from the CUDA/HIP GET_ROWS supports_op switch.
TheTom's branch advertised these as supported but never added the matching
cases to getrows.cu — a latent bug present on both his branch and master.
master's test-backend-ops triggers it; the scheduler will now route
get_rows on TQ types to CPU.
2. ggml-cuda/fattn.cu (HIP head-size gate)
Master's get_best_fattn_kernel falls through to BEST_FATTN_KERNEL_TILE as
default. On HIP, fattn-tile.cu only instantiates head sizes 64, 128, 256,
320, 512 (576/640 exceed local memory limits per #ifndef GGML_USE_HIP).
Without this gate, supports_op returns true for unsupported sizes and the
dispatch aborts. Now returns BEST_FATTN_KERNEL_NONE on HIP for head sizes
the tile kernel cannot compile, letting the scheduler fall back to CPU.
3. tests/CMakeLists.txt (test-turbo-quant registration)
TheTom added tests/test-turbo-quant.c (CPU round-trip diagnostic for
turbo3/turbo4 quant→dequant→inverse-WHT) but never wired it into the
build. Registered as a ctest entry linked against ggml + libm.
Test status with these fixes:
- CPU (build-cpu): 51/51 ctest pass, including new test-turbo-quant.
- HIP (build-hip, gfx1151): 50/50 ctest pass with GGML_CUDA_DISABLE_GRAPHS=1
and test-backend-ops excluded. test-backend-ops itself runs 13674/13677
internal cases; the 3 remaining failures (CLAMP f16 → inf, bf16 FA graph
capture) are pre-existing master-side regressions on RDNA3.5+HIP that
reproduce on plain master and are unrelated to TurboQuant.
Squashes the entire TurboQuant KV-cache feature branch from
https://github.com/TheTom/llama-cpp-turboquant (tip 5aeb2fdbe) onto our master.
Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s,
tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated
mul_mm path), CPU reference paths, HIP template instances, perplexity tooling,
and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool
retention, n_head_v reshape, sparse-V CUDA gating, etc.).
Conflict-resolution notes (review carefully before depending on these paths):
- common/arg.cpp, common/speculative.cpp: master's refactored speculative API
kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last).
- ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192
and 640 alongside other sizes).
- ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND
TurboQuant's GGML_OP_TURBO_WHT support cases kept.
- ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature
(const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added.
- ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ
else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API.
- ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration
taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added
to the fa_kv_ok lambda for type validation.
- ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new
spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro
path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need
re-implementation against the new spec-constant API. *** Vulkan TURBO3_0
inference will likely fail until that work is redone.
Squash base: 7fc1c4ef78 (TheTom's last upstream merge point).
* opencl: add q5_0 moe support
* opencl: add q5_1 moe support
* opencl: avoid potential leak
* opencl: suppress unused var warning when building for non-Adreno
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* server, webui: accept continue_final_message flag for vLLM API compat
Add the continue_final_message body flag from the vLLM and transformers
API. When set together with add_generation_prompt false, it triggers the
existing prefill_assistant code path, regardless of the server side
opt.prefill_assistant option. Mutual exclusion with add_generation_prompt
true is enforced, matching vLLM behavior.
WebUI sends continue_final_message and add_generation_prompt false on
the Continue button, with the matching opt in option on the chat service.
Pure API alignment, no change to the prefill logic itself. Paves the way
for the upcoming per-template prefill plumbing in common/chat.
* test: add coverage for continue_final_message vLLM compat flag
Two cases on top of the existing assistant prefill coverage. First,
continue_final_message true with add_generation_prompt false produces
the same rendered prompt as the prefill_assistant heuristic, proving
the new flag is a correct alias of the existing path. Second, both
flags set to true is rejected with HTTP 400, matching the
vLLM/transformers mutual exclusion contract.
* chore: update webui build output
* server, webui : support continue generation on reasoning models (#22727)
Remove the throw blocking assistant prefill on reasoning models and
orchestrate thinking tags around the prefilled message so the parser
routes the next stream chunks correctly. WebUI drops the reasoning
guard on the Continue button, sends reasoning_content with the
prefilled message and persists partial reasoning on stop so the CoT
survives reload and resume.
Scope : templates with a simple thinking_start_tag / thinking_end_tag
pair. Channel-based templates like GPT-OSS are out of scope, pending
a per-template prefill API in common/chat.
First step toward #21754.
* chore: update webui build output
* server: reject reasoning prefill on channel based templates
* ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled)
* ggml-zendnn : restore original fallback logic when adaptive fallback is disabled
* hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase
* hmx-mm: optimize per-group scale handling
* hmx-fa: optimize slope load from vtcm
* hmx-fa: use aligned access where possible in hmx-utils
* hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32
* fix(unary): correct the gelu, gelu quick and gelu erf functions
* fix(flash-attn-tile): fix the hardcode v type
* fix(flash_attn): fix tile path
* fix: pass editorconfig and address the type conflicts
* fix: remove reduant pipeline keys
* fix: remove inline min/max group size functions and revert the flash attn path order
* fix: use clamp to avoid NaN for GELU
* fix: use the right range for exp, 80 is safer for f32 exp
* model-conversion : add causal-convert-mmproj target [no ci]
This commit adds a new Make target that only converts the mmproj model.
The motivation for this that the causal-convert-mm-model target will
convert both the test model and the mmproj model which is nice when the
model model conversion is finalized. But during development it is nice
to be able to just convert the mmproj model and not have to wait for
the often more time consuming text model conversion.
* add path model path validation check
* working llama-eval mc and math suite
* multi source llama-eval
* Add readme
* add checkpointing
* examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:
- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting
Also includes test scripts and documentation for testing and understanding
the simulator functionality.
* examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
* docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
* examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
* docs: remove README.md from llama-eval
* examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
* examples: remove HF_HUB_OFFLINE to allow dataset download
* examples: use cached dataset path to avoid HF Hub requests
* examples: use cached dataset path in simulator to avoid HF Hub requests
* docs: update llama-eval-discussion.md with session work summary
* examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
* docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
* examples: add task summary table to llama-eval-new.py
* eval : print progress
* eval : add prompts
* test : fix path
* sim : fix answer matching
* eval : support multiple dataset runs
* minor
* improve grader
* docs
* remove old files
* datasets : add gsm8k
* add gpqa + sampling + docs
* rename
* grader : improve example answers
* cont
* datasets : add aime2025
* grader : update prompt
* grade : improve regex + logs
* datasets : fix aime2025
* cleanup
* add AGENTS.md
* ignore errors
* resume eval
* cleanup
* fix counts
* simplify
* fix prompts
* add html
* store full response
* add tokens
* resoning and error handling
* refactor
* track total time
* remove junk
* eval : unify "judge" terminology to "grader"
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).
Assisted-by: llama.cpp:local pi
* eval : add Wilson score confidence interval to results
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
* llama-eval : add per-task generation speed from server timings
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : add per-task generation time from server timings
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : rename display, escaped, and count variables to use prefix convention
- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)
Assisted-by: llama.cpp:local pi
* llama-eval : support multiple evaluation endpoints with dynamic task distribution
- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before
Assisted-by: llama.cpp:local pi
* llama-server-simulator : replace Flask with stdlib http.server
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports
Assisted-by: llama.cpp:local pi
* llama-eval : update README with PR link and quick-start examples
Assisted-by: llama.cpp:local pi
* llama-eval : track model name in eval state and verify on resume
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming
Assisted-by: llama.cpp:local pi
* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
Assisted-by: llama.cpp:local pi
* llama-eval : require --grader-model or --model when using --grader-type llm
Assisted-by: llama.cpp:local pi
* llama-eval : protect dump() with lock for thread safety
Assisted-by: llama.cpp:local pi
* llama-eval : compact HTML report output
- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers
Assisted-by: llama.cpp:local pi
* llama-eval : check server connectivity on startup
- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override
Assisted-by: llama.cpp:local pi
* llama-eval : use server1/server2 instead of gpu1/gpu2 in README
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
* convert : add split() method to LoraTorchTensor
* Fix python type-check
* Fix flake8 Lint
* fix: handle positional dim arg in torch.split dispatch
* Fix type-check again
* Fix type-checks
* Remove unit test per reviewers feedback
* work around ty deficiency
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Q4_1 MoE CLC pass sanity check
* remove unnecessary code
* opencl: remove unnecessary asserts and reformat
* opencl: fix supports_op for q4_1 moe
* q4_1 moe is supported by Adreno with certain shapes
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
`im2col_cuda` and `im2col_3d_cuda` both dispatch with
`block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
`invalid configuration argument`.
Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the
kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern
already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel`
and 3D `im2col_3d_kernel` need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).
Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
`invalid configuration argument`, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)
* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)
* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)
* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16
bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
This commit updates the command line arguments to use the correct names
and values which are now required.
The motivation for this change is that currently running the example
command as is will generate the following errors:
```console
error while handling argument "--color": error: unknown value for --color: '--sampling-seq'
usage:
-co, --color [on|off|auto] Colorize output to distinguish prompt and user input from generations
('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
error while handling argument "-fa": error: unknown value for --flash-attn: '--temp'
usage:
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max
usage:
--draft, --draft-n, --draft-max N the argument has been removed. use --spec-draft-n-max or
--spec-ngram-mod-n-max
(env: LLAMA_ARG_DRAFT_MAX)
error while handling argument "--draft-min": the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min
usage:
--draft-min, --draft-n-min N the argument has been removed. use --spec-draft-n-min or
--spec-ngram-mod-n-min
(env: LLAMA_ARG_DRAFT_MIN)
```
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* Revert "convert : add image break token fallback"
This reverts commit 292e40cfdf.
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Co-authored-by: Pascal <admin@serveurperso.com>
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* convert : allow zero value for img_break_tok_id