The HIP branch in launch_fattn used raw hipMalloc / hipFree /
hipStreamSynchronize(main_stream) for the K/V f16 dequant temp buffers
(introduced to avoid pool retention OOM). These three calls are illegal
during HIP graph capture and abort cudaStreamEndCapture with
hipErrorStreamCaptureUnsupported, manifesting as the "ROCm error" at
ggml-cuda.cu:104 when running models like Qwen3.6-27B-Dense and
Qwen3.6-35B-A3B-Q8 with -fa 1 on gfx1151. Workaround was
GGML_CUDA_DISABLE_GRAPHS=1.
Probe cudaStreamIsCapturing on entry; when a capture is in progress
use ggml_cuda_pool_alloc<half> (legal in capture). Outside capture,
behavior is unchanged so the OOM-avoidance the raw-alloc branch was
added for is preserved.
Also: ggml_cuda_error wrote only via GGML_LOG_ERROR, which llama-bench
silences with llama_null_log_callback, so the actual hipError was
invisible. Mirror the message to stderr with fflush so failures stay
diagnosable from bench. Expand the inline CUDA_CHECK around
cudaStreamEndCapture / cudaGraphInstantiate / cudaGraphLaunch to print
which graph step failed plus the cgraph's first/last op for context.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes surfaced by running the full test suite against the squash-merged
turboquant branch, plus one CMake registration.
1. ggml-cuda/ggml-cuda.cu (GET_ROWS supports_op)
Removed TQ3_1S/TQ4_1S from the CUDA/HIP GET_ROWS supports_op switch.
TheTom's branch advertised these as supported but never added the matching
cases to getrows.cu — a latent bug present on both his branch and master.
master's test-backend-ops triggers it; the scheduler will now route
get_rows on TQ types to CPU.
2. ggml-cuda/fattn.cu (HIP head-size gate)
Master's get_best_fattn_kernel falls through to BEST_FATTN_KERNEL_TILE as
default. On HIP, fattn-tile.cu only instantiates head sizes 64, 128, 256,
320, 512 (576/640 exceed local memory limits per #ifndef GGML_USE_HIP).
Without this gate, supports_op returns true for unsupported sizes and the
dispatch aborts. Now returns BEST_FATTN_KERNEL_NONE on HIP for head sizes
the tile kernel cannot compile, letting the scheduler fall back to CPU.
3. tests/CMakeLists.txt (test-turbo-quant registration)
TheTom added tests/test-turbo-quant.c (CPU round-trip diagnostic for
turbo3/turbo4 quant→dequant→inverse-WHT) but never wired it into the
build. Registered as a ctest entry linked against ggml + libm.
Test status with these fixes:
- CPU (build-cpu): 51/51 ctest pass, including new test-turbo-quant.
- HIP (build-hip, gfx1151): 50/50 ctest pass with GGML_CUDA_DISABLE_GRAPHS=1
and test-backend-ops excluded. test-backend-ops itself runs 13674/13677
internal cases; the 3 remaining failures (CLAMP f16 → inf, bf16 FA graph
capture) are pre-existing master-side regressions on RDNA3.5+HIP that
reproduce on plain master and are unrelated to TurboQuant.
Squashes the entire TurboQuant KV-cache feature branch from
https://github.com/TheTom/llama-cpp-turboquant (tip 5aeb2fdbe) onto our master.
Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s,
tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated
mul_mm path), CPU reference paths, HIP template instances, perplexity tooling,
and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool
retention, n_head_v reshape, sparse-V CUDA gating, etc.).
Conflict-resolution notes (review carefully before depending on these paths):
- common/arg.cpp, common/speculative.cpp: master's refactored speculative API
kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last).
- ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192
and 640 alongside other sizes).
- ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND
TurboQuant's GGML_OP_TURBO_WHT support cases kept.
- ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature
(const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added.
- ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ
else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API.
- ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration
taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added
to the fa_kv_ok lambda for type validation.
- ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new
spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro
path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need
re-implementation against the new spec-constant API. *** Vulkan TURBO3_0
inference will likely fail until that work is redone.
Squash base: 7fc1c4ef78 (TheTom's last upstream merge point).
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().
Assisted-by: llama.cpp:local pi
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* save-load-state : refactor into separate phase functions
- Split monolithic main() into 4 self-contained phase functions, each
managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)
Assisted-by: llama.cpp:local pi
* save-load-state : use params.out_file instead of separate state_file
- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt
Assisted-by: llama.cpp:local pi
* save-load-state : use smart pointers for ctx and smpl
- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)
Assisted-by: llama.cpp:local pi
* save-load-state : add local llama_batch_ptr RAII wrapper
- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls
Assisted-by: llama.cpp:local pi
* save-load-state : replace printf/fprintf with logging macros
- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG
Assisted-by: llama.cpp:local pi
* save-load-state : refactor tests to check results inline
Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().
Assisted-by: llama.cpp:local pi
* save-load-state : improve test output readability
Add phase labels, remove redundant run prefixes, and show
PASS after each test.
Assisted-by: llama.cpp:local pi
* pi : add rule about git signing
* save-load-state : simplify llama_batch_ptr
Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.
Assisted-by: llama.cpp:local pi
* save-load-state : extract generate_tokens helper
Factor out the repeated token generation loop into a shared
helper function used by all phases.
Assisted-by: llama.cpp:local pi
* save-load-state : update comments to use test terminology
Replace "Phase" with "Test" and list each test's steps
as bullet points.
Assisted-by: llama.cpp:local pi
* save-load-state : rename test functions
Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.
Assisted-by: llama.cpp:local pi
* pi : add rule to never git push without confirmation
Assisted-by: llama.cpp:local pi
* common : add model_only option to common_init_from_params
Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.
Use in save-load-state to initialize only the model,
with each test creating its own context.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* llama-eval : add per-problem summary table to HTML reports
- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
- Columns: Problem (zero-padded), Runs, Correct (n/r),
Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
- Sorted by problem index, monospace font, right-aligned numbers
- Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
template-aware question matching, --dataset arg (aime/aime2025)
Assisted-by: llama.cpp:local pi
* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally
- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar
Assisted-by: llama.cpp:local pi
* llama-eval : redesign top bar as CSS grid label/value pairs
- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns
Assisted-by: llama.cpp:local pi
* llama-eval : use realistic token counts and throughput in simulator
- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps
Assisted-by: llama.cpp:local pi
* llama-eval : color Answer column green/red based on correctness
Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.
Assisted-by: llama.cpp:local pi
* llama-eval : fix pyright errors from max(..., key=len) type inference
Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
- unresolved-attribute: Object of type Sized has no attribute lower
- not-subscriptable: Cannot subscript object of type Sized
Assisted-by: llama.cpp:local pi
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.
closes: #23242
* ggml-hexagon: add PAD op HVX kernel
Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.
* hex-ggml: remove duplicate op cases (merge conflict)
* hex-pad: fix editorconfig checks and macro alignment
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* docker: add OCI image labels to all published images
* docker: propagate OCI labels as manifest and index annotations
* docker: drop hardcoded org URL and revert accidental intel version bump
The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.
* docker: add skip_s390x workflow_dispatch input for fast test runs
Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
---------
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars
* refactor: skip MCP proxy probe when no server requires it
* refactor: suppress expected disconnect errors during MCP client shutdown
* refactor: Deduplicate requests
* refactor: deduplicate model fetching across ROUTER and MODEL modes
* refactor: Clean up models logic
* chore: Add `.env.example` file
* refactor: replace client-side CORS proxy probe with server status flag
* refactor: Post-review fixes
* test: add vitest client setup with API fetch mocks
* common : delegate assistant continuation to template handler
* server : implement echo parameter to exclude assistant prefill in the response
* server : fix tests for prefill
* server : use existing llama template
* cont : clean up
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI
For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).
This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.
* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers
This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.