Commit Graph

9145 Commits

Author SHA1 Message Date
shahondin1624 cdd851c05a mtp: squash-merge am17an/mtp-clean (upstream PR #22673)
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
CI / ubuntu-22-hip (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
CI (android) / android (push) Failing after 3m46s
CI (android) / android-ndk (push) Failing after 5s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Failing after 8s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Failing after 8s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Failing after 8s
CI / build-cmake-pkg (push) Successful in 13m58s
CI / android-arm64 (push) Failing after 13s
CI / ubuntu-latest-rpc (push) Failing after 10s
CI / ubuntu-latest-cuda (push) Failing after 6s
Release / android-arm64 (push) Failing after 28s
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Failing after 6s
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Failing after 5s
Server / server (default) (push) Failing after 6s
Server / server (backend-sampling) (push) Failing after 6s
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
CI (3rd-party) / ubuntu-24-llguidance (push) Has been cancelled
CI (apple) / macOS-latest-ios (push) Has been cancelled
CI (apple) / macos-latest-ios-xcode (push) Has been cancelled
CI (apple) / macOS-latest-tvos (push) Has been cancelled
CI (apple) / macOS-latest-visionos (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Has been cancelled
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
CI (sycl) / windows-latest-sycl (push) Has been cancelled
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
Code Style Checker / model-naming (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Release / release (push) Has been cancelled
Server / server-windows (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
Publish Docker image / Create shared tags from digests (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Publish Docker image / Prepare Docker matrices (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Registry (push) Has been cancelled
CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled
CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled
CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been skipped
Squashes 8 commits from https://github.com/am17an/llama.cpp.git mtp-clean
adding Multi-Token Prediction speculative decoding support, primarily for
Qwen3.5 / Qwen3.6 models with native MTP heads.

Mode: COMMON_SPECULATIVE_TYPE_DRAFT_MTP (CLI: --draft-mtp).
Per PR: ~1.8-2x speedup with ~75% draft acceptance using 3 draft tokens.
Currently requires --parallel 1.

Files touched: 26 (1319 insertions, 136 deletions). Hot spots:
- src/models/qwen35.cpp, qwen35moe.cpp — MTP head integration
- common/speculative.{cpp,h} — new MTP draft path
- src/llama-{context,model,memory}.* — MTP-specific KV cache
- convert_hf_to_gguf.py — converter writes MTP head tensors

Merge applied cleanly against current master (TurboQuant tip 7161dee3f).
Source branch HEAD: e7b484815 (add need_embd in speculative).

Note: upstream PR remains open as of this merge; performance regressions
were under discussion. Watch upstream for the final API.
2026-05-14 01:03:05 +02:00
shahondin1624 7161dee3f3 turboquant: post-merge integration fixes from test validation
CI (android) / android (push) Failing after 5m23s
CI (android) / android-ndk (push) Failing after 2m23s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Failing after 15s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Failing after 8s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Failing after 8s
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Has been cancelled
Release / android-arm64 (push) Failing after 24s
CI / build-cmake-pkg (push) Successful in 14m47s
CI / android-arm64 (push) Failing after 11s
CI / ubuntu-latest-rpc (push) Failing after 8s
CI (3rd-party) / ubuntu-24-llguidance (push) Has been cancelled
CI (apple) / macOS-latest-ios (push) Has been cancelled
CI (apple) / macos-latest-ios-xcode (push) Has been cancelled
CI (apple) / macOS-latest-tvos (push) Has been cancelled
CI (apple) / macOS-latest-visionos (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled
CI / ubuntu-latest-cuda (push) Failing after 3m49s
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Failing after 38s
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Failing after 4s
Server / server (default) (push) Failing after 5s
Server / server (backend-sampling) (push) Failing after 4s
CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
CI (sycl) / windows-latest-sycl (push) Has been cancelled
CI (virtgpu) / ubuntu-24-virtgpu (push) Has been cancelled
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Has been cancelled
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-hip (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
Code Style Checker / model-naming (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Release / release (push) Has been cancelled
Server / server-windows (push) Has been cancelled
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled
Two fixes surfaced by running the full test suite against the squash-merged
turboquant branch, plus one CMake registration.

1. ggml-cuda/ggml-cuda.cu (GET_ROWS supports_op)
   Removed TQ3_1S/TQ4_1S from the CUDA/HIP GET_ROWS supports_op switch.
   TheTom's branch advertised these as supported but never added the matching
   cases to getrows.cu — a latent bug present on both his branch and master.
   master's test-backend-ops triggers it; the scheduler will now route
   get_rows on TQ types to CPU.

2. ggml-cuda/fattn.cu (HIP head-size gate)
   Master's get_best_fattn_kernel falls through to BEST_FATTN_KERNEL_TILE as
   default. On HIP, fattn-tile.cu only instantiates head sizes 64, 128, 256,
   320, 512 (576/640 exceed local memory limits per #ifndef GGML_USE_HIP).
   Without this gate, supports_op returns true for unsupported sizes and the
   dispatch aborts. Now returns BEST_FATTN_KERNEL_NONE on HIP for head sizes
   the tile kernel cannot compile, letting the scheduler fall back to CPU.

3. tests/CMakeLists.txt (test-turbo-quant registration)
   TheTom added tests/test-turbo-quant.c (CPU round-trip diagnostic for
   turbo3/turbo4 quant→dequant→inverse-WHT) but never wired it into the
   build. Registered as a ctest entry linked against ggml + libm.

Test status with these fixes:
- CPU (build-cpu): 51/51 ctest pass, including new test-turbo-quant.
- HIP (build-hip, gfx1151): 50/50 ctest pass with GGML_CUDA_DISABLE_GRAPHS=1
  and test-backend-ops excluded. test-backend-ops itself runs 13674/13677
  internal cases; the 3 remaining failures (CLAMP f16 → inf, bf16 FA graph
  capture) are pre-existing master-side regressions on RDNA3.5+HIP that
  reproduce on plain master and are unrelated to TurboQuant.
2026-05-14 00:38:58 +02:00
shahondin1624 15a6a36b59 turboquant: squash-merge TheTom/llama-cpp-turboquant feature/turboquant-kv-cache
Python Type-Check / python type-check (push) Has been cancelled
Squashes the entire TurboQuant KV-cache feature branch from
https://github.com/TheTom/llama-cpp-turboquant (tip 5aeb2fdbe) onto our master.

Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s,
tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated
mul_mm path), CPU reference paths, HIP template instances, perplexity tooling,
and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool
retention, n_head_v reshape, sparse-V CUDA gating, etc.).

Conflict-resolution notes (review carefully before depending on these paths):

- common/arg.cpp, common/speculative.cpp: master's refactored speculative API
  kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last).

- ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192
  and 640 alongside other sizes).

- ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND
  TurboQuant's GGML_OP_TURBO_WHT support cases kept.

- ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature
  (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added.

- ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ
  else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API.

- ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration
  taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added
  to the fa_kv_ok lambda for type validation.

- ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new
  spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro
  path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need
  re-implementation against the new spec-constant API. *** Vulkan TURBO3_0
  inference will likely fail until that work is redone.

Squash base: 7fc1c4ef78 (TheTom's last upstream merge point).
2026-05-13 23:01:46 +02:00
shaofeiqi ec562eb673 opencl: add q5_0 and q5_1 MoE for Adreno (#22985)
* opencl: add q5_0 moe support

* opencl: add q5_1 moe support

* opencl: avoid potential leak

* opencl: suppress unused var warning when building for non-Adreno

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-13 11:57:31 -07:00
Pascal 95d469a915 server, webui: accept continue_final_message flag for vLLM API compat (#23012)
* server, webui: accept continue_final_message flag for vLLM API compat

Add the continue_final_message body flag from the vLLM and transformers
API. When set together with add_generation_prompt false, it triggers the
existing prefill_assistant code path, regardless of the server side
opt.prefill_assistant option. Mutual exclusion with add_generation_prompt
true is enforced, matching vLLM behavior.

WebUI sends continue_final_message and add_generation_prompt false on
the Continue button, with the matching opt in option on the chat service.

Pure API alignment, no change to the prefill logic itself. Paves the way
for the upcoming per-template prefill plumbing in common/chat.

* test: add coverage for continue_final_message vLLM compat flag

Two cases on top of the existing assistant prefill coverage. First,
continue_final_message true with add_generation_prompt false produces
the same rendered prompt as the prefill_assistant heuristic, proving
the new flag is a correct alias of the existing path. Second, both
flags set to true is rejected with HTTP 400, matching the
vLLM/transformers mutual exclusion contract.

* chore: update webui build output
2026-05-13 20:47:58 +02:00
lhez 1e4579fbb8 opencl: fix crash when warming up MoE on Adreno (#22876) 2026-05-13 11:24:33 -07:00
Masashi Yoshimura 527045bfb0 flush the gpu profile timestamp before the queryset is overflowed (#22995) 2026-05-13 10:22:44 -07:00
Aleksander Grygier 2dfeca31cc webui: Deduplicate model aliases in data + handle single/multiple aliases in UI (#22979)
* fix: Deduplicate aliases + display single alias instead of default name or 2+ aliases as tags

* refactor: Address review comments
2026-05-13 16:39:36 +02:00
Pascal 46be24d121 webui: preserve system message on edit cancel (#22911)
* webui: preserve system message on edit cancel when content is not the placeholder

* chore: update webui build output
2026-05-13 16:16:02 +02:00
Ravi Panchumarthy 7e16646015 docs : Update OPENVINO.md (#22959)
Updated OPENVINO.md with Validated models and quantizations

Co-authored-by: Haarika Madaka <haarika.madaka@intel.com>
2026-05-13 17:12:15 +03:00
Max Krasnyansky ad96bb8c0c hexagon: add unary tanh op (#22999) 2026-05-13 06:59:28 -07:00
Xuan-Son Nguyen e75cd5efb5 download: do not exit() on error (#23008) b9134 2026-05-13 15:14:58 +02:00
Pascal 5d44db6008 server, webui: support continue generation on reasoning models (#22727)
* server, webui : support continue generation on reasoning models (#22727)

Remove the throw blocking assistant prefill on reasoning models and
orchestrate thinking tags around the prefilled message so the parser
routes the next stream chunks correctly. WebUI drops the reasoning
guard on the Continue button, sends reasoning_content with the
prefilled message and persists partial reasoning on stop so the CoT
survives reload and resume.

Scope : templates with a simple thinking_start_tag / thinking_end_tag
pair. Channel-based templates like GPT-OSS are out of scope, pending
a per-template prefill API in common/chat.

First step toward #21754.

* chore: update webui build output

* server: reject reasoning prefill on channel based templates
b9133
2026-05-13 11:09:51 +02:00
Xuan-Son Nguyen 3796c94bad ci: validate model naming convention (#22680)
* ci: validate model naming convention

* bring back dedicated ec workflow

* add missing jobs

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-13 10:59:37 +02:00
Georgi Gerganov 634275fbbb spec : update CLI arguments for better consistency (#22964)
* spec : update CLI arguments for better consistency

* cont : fix CLI arg message
b9131
2026-05-13 09:15:39 +03:00
Sigbjørn Skjæret bcfe63fc53 llama-eval : enable type check (#22988) 2026-05-13 09:14:24 +03:00
Sachin Sharma 61af07c22d ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (#22681)
* ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled)

* ggml-zendnn : restore original fallback logic when adaptive fallback is disabled
b9129
2026-05-13 09:13:47 +03:00
Trivikram Reddy 856c3adac1 hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993)
* hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase

* hmx-mm: optimize per-group scale handling

* hmx-fa: optimize slope load from vtcm

* hmx-fa: use aligned access where possible in hmx-utils

* hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b9128
2026-05-12 17:28:02 -07:00
yzyyzyhhh a9883db8ee opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755)
* ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill

* ggml-opencl: address Adreno xmem review comments

* ggml-opencl: align xmem gemm kernel naming

---------

Co-authored-by: Your Name <your@email.com>
b9127
2026-05-12 13:10:37 -07:00
fredzillman cce09f0b2b convert : fix Pixtral 12B --mistral-format conversion (3 bugs) (#22981) 2026-05-12 21:46:01 +02:00
Aleksander Grygier dded58b450 webui: Fix Chat Screen Form box disappearing + autoscroll issues on WebKit (#22977)
* debug: Scroll/Sticky issues

* fix: UI improvements

* refactor: Remove unneeded logic

* fix: Better logic for initial load of messages
2026-05-12 20:41:11 +02:00
Xuan-Son Nguyen 7bfe120c21 mtmd, server, common: expose modalities to /v1/models (#22952)
* mtmd, server, common: expose modalities to /v1/models

* fix build

* rename to mtmd_caps
b9124
2026-05-12 19:08:07 +02:00
Masashi Yoshimura 927dada6c9 ggml-webgpu: Enables running gpt-oss-20b (#22906)
* Enable to run gpt-oss-20b and refactor mulmat-q

* disable test-backend-ops in ubuntu-24-webgpu
b9123
2026-05-12 07:27:40 -07:00
Chen Yuan 239a497e5f ggml-webgpu: address precision issues for multimodal (#22808)
* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32

* fix(unary): correct the gelu, gelu quick and gelu erf functions

* fix(flash-attn-tile): fix the hardcode v type

* fix(flash_attn): fix tile path

* fix: pass editorconfig and address the type conflicts

* fix: remove reduant pipeline keys

* fix: remove inline min/max group size functions and revert the flash attn path order

* fix: use clamp to avoid NaN for GELU

* fix: use the right range for exp, 80 is safer for f32 exp
b9122
2026-05-12 07:27:04 -07:00
Daniel Bevenius 89730c8d26 model-conversion : add causal-convert-mmproj target [no ci] (#22969)
* model-conversion : add causal-convert-mmproj target [no ci]

This commit adds a new Make target that only converts the mmproj model.

The motivation for this that the causal-convert-mm-model target will
convert both the test model and the mmproj model which is nice when the
model model conversion is finalized. But during development it is nice
to be able to just convert the mmproj model and not have to wait for
the often more time consuming text model conversion.

* add path model path validation check
2026-05-12 15:15:40 +02:00
Georgi Gerganov fde69a3607 examples : add llama-eval (#21152)
* working llama-eval mc and math suite

* multi source llama-eval

* Add readme

* add checkpointing

* examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.

* examples: refactor test-simulator.sh for better readability

Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.

* docs: update llama-eval-discussion.md with session work summary

Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.

* examples: add simplified llama-eval-new.py for AIME evaluation

- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers

* docs: remove README.md from llama-eval

* examples: implement flexible grader system for answer validation

- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers

* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings

* examples: remove HF_HUB_OFFLINE to allow dataset download

* examples: use cached dataset path to avoid HF Hub requests

* examples: use cached dataset path in simulator to avoid HF Hub requests

* docs: update llama-eval-discussion.md with session work summary

* examples: add threading support and model parameter to llama-eval-new.py

- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution

* docs: update llama-eval-discussion.md with threading and model parameter updates

- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features

* examples: add task summary table to llama-eval-new.py

* eval : print progress

* eval : add prompts

* test : fix path

* sim : fix answer matching

* eval : support multiple dataset runs

* minor

* improve grader

* docs

* remove old files

* datasets : add gsm8k

* add gpqa + sampling + docs

* rename

* grader : improve example answers

* cont

* datasets : add aime2025

* grader : update prompt

* grade : improve regex + logs

* datasets : fix aime2025

* cleanup

* add AGENTS.md

* ignore errors

* resume eval

* cleanup

* fix counts

* simplify

* fix prompts

* add html

* store full response

* add tokens

* resoning and error handling

* refactor

* track total time

* remove junk

* eval : unify "judge" terminology to "grader"

Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi

* eval : add Wilson score confidence interval to results

Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.

* llama-eval : add per-task generation speed from server timings

Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : add per-task generation time from server timings

Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi

* llama-eval : rename display, escaped, and count variables to use prefix convention

- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi

* llama-eval : support multiple evaluation endpoints with dynamic task distribution

- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi

* llama-server-simulator : replace Flask with stdlib http.server

- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi

* llama-eval : update README with PR link and quick-start examples

Assisted-by: llama.cpp:local pi

* llama-eval : track model name in eval state and verify on resume

- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi

* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein

Assisted-by: llama.cpp:local pi

* llama-eval : require --grader-model or --model when using --grader-type llm

Assisted-by: llama.cpp:local pi

* llama-eval : protect dump() with lock for thread safety

Assisted-by: llama.cpp:local pi

* llama-eval : compact HTML report output

- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers

Assisted-by: llama.cpp:local pi

* llama-eval : check server connectivity on startup

- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override

Assisted-by: llama.cpp:local pi

* llama-eval : use server1/server2 instead of gpu1/gpu2 in README

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
2026-05-12 15:07:00 +03:00
Masato Nakasaka ef93e98d01 vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (#22461)
* refactor

* Use l_warptile only when coopamt is available for BF16
b9119
2026-05-12 12:15:34 +02:00
Jeff Bolz 706fbd8ab6 vulkan: Check shared memory size for mmq shaders (#22693) b9118 2026-05-12 11:41:58 +02:00
Sigbjørn Skjæret fa62042af9 ci : bump ty to 0.0.35 (#22961) 2026-05-12 11:34:10 +02:00
AesSedai 4178259130 mtmd: add MiMo v2.5 vision (#22883)
* mimo-v2.5: vision support

* mimo-v2.5: use fused qkv for vision

* mimi-v2.5: fix f16 vision overflow

* mimo-v2.5: comment cleanups

* mimo-v2.5: Flash doesn't have mmproj
more cleanup
remember to use filter_tensors

* mimo-v2.5: fix trailing whitespace
b9116
2026-05-12 11:11:14 +02:00
Jesus Talavera 78fbbc2c07 convert : add split() to LoraTorchTensor in LoRA converter (#22832)
* convert : add split() method to LoraTorchTensor

* Fix python type-check

* Fix flake8 Lint

* fix: handle positional dim arg in torch.split dispatch

* Fix type-check again

* Fix type-checks

* Remove unit test per reviewers feedback

* work around ty deficiency

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9115
2026-05-12 08:17:04 +03:00
guyfischman da44953329 metal : promote mul_mv/mul_mm batch divisors to function constants (#22711)
* metal : promote mul_mv/mul_mm batch divisors to function constants

* metal : take op directly in get_pipeline_mul_mv_ext
b9114
2026-05-12 08:15:02 +03:00
Shawn Gu 1ec7ba0c14 opencl: add q4_1 MoE for Adreno (#22856)
* Q4_1 MoE CLC pass sanity check

* remove unnecessary code

* opencl: remove unnecessary asserts and reformat

* opencl: fix supports_op for q4_1 moe

* q4_1 moe is supported by Adreno with certain shapes

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
b9113
2026-05-11 11:57:26 -07:00
CrispStrobe 8e1f9d0834 CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944)
`im2col_cuda` and `im2col_3d_cuda` both dispatch with
`block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
`invalid configuration argument`.

Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the
kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern
already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel`
and 3D `im2col_3d_kernel` need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).

Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
`invalid configuration argument`, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.
b9112
2026-05-11 19:48:29 +02:00
Pascal e936660760 Ggml/cuda snake fusion hardening (#22912)
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)

* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)

* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)

* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16

bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.

* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
2026-05-11 18:42:08 +02:00
willjoha ef22b3e4ac docs: fix metrics endpoint description in server README (#22879)
* docs: fix metrics endpoint description in server README

Required model query parameter for router mode described.

Removed metrics:
- llamacpp:kv_cache_usage_ratio
- llamacpp:kv_cache_tokens

Added metrics:
- llamacpp:prompt_seconds_total
- llamacpp:tokens_predicted_seconds_total
- llamacpp:n_decode_total
- llamacpp:n_busy_slots_per_decode

* server: fix metrics type for n_busy_slots_per_decode metric
b9110
2026-05-11 18:32:26 +02:00
Georgi Gerganov 68e7ea3eab spec : parallel drafting support (#22838)
* spec : refactor

* spec : drop support for incompatible vocabs

* spec : update common_speculative_init()

* cont : pass seq_id

* cont : dedup ctx_seq_rm_type

* server : sketch the ctx_dft decode loop

* server : draft prompt cache and checkpoints

* server : improve ctx names

* server, spec : transition to unified spec context

* cont : sync main and drft contexts

* cont : async drft eval when possible

* cont : handle non-ckpt models

* cont : pass correct n_past for drafting

* cont : process images throught the draft context

* spec : handle draft running out of context

* server : fix mtmd draft processing

* server : fix URL for draft model

* server : add comment

* server : clean-up + dry

* speculative-simple : update

* spec : fix n_past type

* server : fix slot ctx_drft ptr

* tools : update readme

* naming : improve consistency

* spec : refactor for multi-sequence speculative context

* cont : prepare params

* cont : prepare params

* spec : support parallel drafts

* server : support parallel drafting

* llama : reuse device buffers when possible

* server, spec : clean-up

* cont : clean-up

* cont : minor

* spec : reset `drafting` flag at the end

* spec : introduce `common_speculative_process()`

* spec : allow for multiple spec types (chain of speculators)

* replace old type field of type common_speculative_type in the
  common_params_speculative struct with a vector to allow multiple
  types to be specified

* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
  to figure out which implementations the user has enabled

* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
  to parse the already user provided spec types

* all speculators run sequentially, best one wins (we verify its drafted tokens)

* maximize expected accepted tokens for current round by calculating the
  product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
  and the draft's length

---------

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
b9109
2026-05-11 19:09:43 +03:00
Kevin Pouget 928b486b0c ggml-virtgpu: Add a GHA build check (#22943)
* [ggml-virtgpu] Add a GHA build check

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-11 21:38:22 +08:00
Daniel Bevenius 7dbb0e998a examples : update args speculative-simple README.md [no ci] (#22938)
This commit updates the command line arguments to use the correct names
and values which are now required.

The motivation for this change is that currently running the example
command as is will generate the following errors:
```console
error while handling argument "--color": error: unknown value for --color: '--sampling-seq'

usage:
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal

error while handling argument "-fa": error: unknown value for --flash-attn: '--temp'

usage:
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)

error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max

usage:
--draft, --draft-n, --draft-max N       the argument has been removed. use --spec-draft-n-max or
                                        --spec-ngram-mod-n-max
                                        (env: LLAMA_ARG_DRAFT_MAX)

error while handling argument "--draft-min": the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min

usage:
--draft-min, --draft-n-min N            the argument has been removed. use --spec-draft-n-min or
                                        --spec-ngram-mod-n-min
                                        (env: LLAMA_ARG_DRAFT_MIN)
```
2026-05-11 14:00:57 +03:00
Jeff Bolz dd9280a664 vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (#22589) b9106 2026-05-11 12:49:03 +02:00
Oliver Simons 8cef8201a1 CUDA: directly include cuda/iterator (#22936)
Before, we relied on a transient import from `cub/cub.cuh`, which is
bad practice to do as cub may not always expose cuda/iterator
b9105
2026-05-11 12:16:38 +02:00
Daniel Bevenius f5636f8fc7 convert : add image break token fallback (#22914)
* convert : add image break token fallback

This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
  "vision_encoder": {
    "image_token_id": 10,
    "image_break_token_id": -1,
    ...
```
But the tokenizer.json has this token:
```console
115       "id": 12,
116       "content": "[IMG_BREAK]",
117       "single_word": false,
118       "lstrip": false,
119       "rstrip": false,
120       "normalized": false,
121       "special": true
122     },
```
If we look in convert_hf_to_gguf.py we have:
```python
        elif self.is_mistral_format:
            # hparams is already vision config here so norm_eps is only defined in global_config.
            self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
            assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
            if self.use_break_tok:
                self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```

The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size:         5131.60 MiB
load_hparams: metadata size:      0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break

mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf

Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```

With this fallback the model loads successfully.

Resolves: https://github.com/ggml-org/llama.cpp/issues/22901

* Revert "convert : add image break token fallback"

This reverts commit 292e40cfdf.

* convert : add image break token fallback

This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
  "vision_encoder": {
    "image_token_id": 10,
    "image_break_token_id": -1,
    ...
```
But the tokenizer.json has this token:
```console
115       "id": 12,
116       "content": "[IMG_BREAK]",
117       "single_word": false,
118       "lstrip": false,
119       "rstrip": false,
120       "normalized": false,
121       "special": true
122     },
```
If we look in convert_hf_to_gguf.py we have:
```python
        elif self.is_mistral_format:
            # hparams is already vision config here so norm_eps is only defined in global_config.
            self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
            assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
            if self.use_break_tok:
                self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```

The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size:         5131.60 MiB
load_hparams: metadata size:      0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break

mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf

Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```

With this fallback the model loads successfully.

Co-authored-by: Pascal <admin@serveurperso.com>

Resolves: https://github.com/ggml-org/llama.cpp/issues/22901

* convert : allow zero value for img_break_tok_id
2026-05-11 12:07:17 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 838374375c vendor : update cpp-httplib to 0.44.0 (#22919) b9103 2026-05-11 08:47:13 +02:00
Neo Zhang 7d442abf5c [SYCL] Add OP im2col_3d (#22903)
* add im2col_3d

* format code

* update the ops.md
b9102
2026-05-11 08:01:47 +03:00
Georgi Gerganov 389ff61d77 server : print warning when HTTP timeout exceeded (#22907) b9101 2026-05-10 22:00:18 +03:00
Tim Neumann 2e97c5f96f backend sampling: support returning post-sampling probs (#22622)
* server: Never return 0.0 post-sampling probabilities

* backend sampling: support returning post-sampling probs
b9100
2026-05-10 19:12:02 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 5d5d2e15d2 vendor : update cpp-httplib to 0.43.4 (#22888) b9099 2026-05-10 18:46:54 +02:00
Oliver Walsh 2b2babd124 ggml-virtgpu : include missing mutex header (#22810)
Add missing `#include <mutex>` in ggml-backend-device.cpp.

Fixes: #22809

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-05-10 17:32:41 +02:00
Georgi Gerganov 0b047287fe sync : ggml b9097 2026-05-10 17:00:11 +03:00
Georgi Gerganov efbada936f ggml : bump version to 0.11.1 (ggml/1484) 2026-05-10 17:00:11 +03:00