llama.cpp

Author	SHA1	Message	Date
shahondin1624	cdd851c05a	mtp: squash-merge am17an/mtp-clean (upstream PR #22673 ) Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled Details Python check requirements.txt / check-requirements (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details CI / ubuntu-22-hip (push) Has been cancelled Details flake8 Lint / Lint (push) Has been cancelled Details CI (android) / android (push) Failing after 3m46s Details CI (android) / android-ndk (push) Failing after 5s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Failing after 8s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Failing after 8s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Failing after 8s Details CI / build-cmake-pkg (push) Successful in 13m58s Details CI / android-arm64 (push) Failing after 13s Details CI / ubuntu-latest-rpc (push) Failing after 10s Details CI / ubuntu-latest-cuda (push) Failing after 6s Details Release / android-arm64 (push) Failing after 28s Details Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Failing after 6s Details Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Failing after 5s Details Server / server (default) (push) Failing after 6s Details Server / server (backend-sampling) (push) Failing after 6s Details Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details Close inactive issues / close-issues (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled Details CI (3rd-party) / ubuntu-24-llguidance (push) Has been cancelled Details CI (apple) / macOS-latest-ios (push) Has been cancelled Details CI (apple) / macos-latest-ios-xcode (push) Has been cancelled Details CI (apple) / macOS-latest-tvos (push) Has been cancelled Details CI (apple) / macOS-latest-visionos (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Has been cancelled Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Has been cancelled Details CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled Details CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Has been cancelled Details CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled Details CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled Details CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled Details CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled Details CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled Details CI (sycl) / windows-latest-sycl (push) Has been cancelled Details CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Has been cancelled Details CI / macOS-latest-arm64 (push) Has been cancelled Details CI / macOS-latest-x64 (push) Has been cancelled Details CI / macOS-latest-arm64-webgpu (push) Has been cancelled Details CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled Details CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled Details CI / ubuntu-24-webgpu (push) Has been cancelled Details CI / ubuntu-24-webgpu-wasm (push) Has been cancelled Details CI / ubuntu-22-musa (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled Details CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled Details CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled Details CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled Details CI / windows-2022-cuda (12.4) (push) Has been cancelled Details CI / windows-latest-hip (push) Has been cancelled Details CI / ubuntu-cpu-riscv64-native (push) Has been cancelled Details CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled Details Code Style Checker / model-naming (push) Has been cancelled Details EditorConfig Checker / editorconfig (push) Has been cancelled Details Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled Details Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled Details Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled Details Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled Details Release / ubuntu-24-openvino (push) Has been cancelled Details Release / windows-cpu (arm64) (push) Has been cancelled Details Release / windows-cpu (x64) (push) Has been cancelled Details Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled Details Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled Details Release / windows-cuda (12.4) (push) Has been cancelled Details Release / windows-cuda (13.1) (push) Has been cancelled Details Release / windows-sycl (push) Has been cancelled Details Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled Details Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled Details Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled Details Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled Details Release / ios-xcode-build (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled Details Release / release (push) Has been cancelled Details Server / server-windows (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled Details Publish Docker image / Create shared tags from digests (push) Has been cancelled Details Publish Docker image / Create and push git tag (push) Has been cancelled Details Publish Docker image / Prepare Docker matrices (push) Has been cancelled Details Publish Docker image / Push Docker image to Docker Registry (push) Has been cancelled Details CI (msys) / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled Details CI (msys) / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled Details CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled Details CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled Details CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled Details Update Winget Package / Update Winget Package (push) Has been skipped Details Squashes 8 commits from https://github.com/am17an/llama.cpp.git mtp-clean adding Multi-Token Prediction speculative decoding support, primarily for Qwen3.5 / Qwen3.6 models with native MTP heads. Mode: COMMON_SPECULATIVE_TYPE_DRAFT_MTP (CLI: --draft-mtp). Per PR: ~1.8-2x speedup with ~75% draft acceptance using 3 draft tokens. Currently requires --parallel 1. Files touched: 26 (1319 insertions, 136 deletions). Hot spots: - src/models/qwen35.cpp, qwen35moe.cpp — MTP head integration - common/speculative.{cpp,h} — new MTP draft path - src/llama-{context,model,memory}.* — MTP-specific KV cache - convert_hf_to_gguf.py — converter writes MTP head tensors Merge applied cleanly against current master (TurboQuant tip `7161dee3f`). Source branch HEAD: `e7b484815` (add need_embd in speculative). Note: upstream PR remains open as of this merge; performance regressions were under discussion. Watch upstream for the final API.	2026-05-14 01:03:05 +02:00
shahondin1624	7161dee3f3	turboquant: post-merge integration fixes from test validation CI (android) / android (push) Failing after 5m23s Details CI (android) / android-ndk (push) Failing after 2m23s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Failing after 15s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Failing after 8s Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Failing after 8s Details CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Has been cancelled Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Has been cancelled Details Release / android-arm64 (push) Failing after 24s Details CI / build-cmake-pkg (push) Successful in 14m47s Details CI / android-arm64 (push) Failing after 11s Details CI / ubuntu-latest-rpc (push) Failing after 8s Details CI (3rd-party) / ubuntu-24-llguidance (push) Has been cancelled Details CI (apple) / macOS-latest-ios (push) Has been cancelled Details CI (apple) / macos-latest-ios-xcode (push) Has been cancelled Details CI (apple) / macOS-latest-tvos (push) Has been cancelled Details CI (apple) / macOS-latest-visionos (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled Details CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled Details CI / ubuntu-latest-cuda (push) Failing after 3m49s Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Has been cancelled Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Has been cancelled Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Has been cancelled Details CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled Details CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled Details CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled Details Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Failing after 38s Details Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Failing after 4s Details Server / server (default) (push) Failing after 5s Details Server / server (backend-sampling) (push) Failing after 4s Details CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled Details CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled Details CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled Details CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled Details CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled Details CI (sycl) / windows-latest-sycl (push) Has been cancelled Details CI (virtgpu) / ubuntu-24-virtgpu (push) Has been cancelled Details CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Has been cancelled Details CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled Details CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled Details CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled Details CI / windows-2022-cuda (12.4) (push) Has been cancelled Details CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled Details CI / macOS-latest-arm64 (push) Has been cancelled Details CI / macOS-latest-x64 (push) Has been cancelled Details CI / macOS-latest-arm64-webgpu (push) Has been cancelled Details CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled Details CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details CI / ubuntu-24-webgpu (push) Has been cancelled Details CI / ubuntu-24-webgpu-wasm (push) Has been cancelled Details CI / ubuntu-22-hip (push) Has been cancelled Details CI / ubuntu-22-musa (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled Details CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled Details CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled Details CI / windows-latest-hip (push) Has been cancelled Details CI / ubuntu-cpu-riscv64-native (push) Has been cancelled Details CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled Details CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled Details CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled Details Code Style Checker / model-naming (push) Has been cancelled Details EditorConfig Checker / editorconfig (push) Has been cancelled Details flake8 Lint / Lint (push) Has been cancelled Details Python Type-Check / python type-check (push) Has been cancelled Details Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled Details Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled Details Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled Details Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled Details Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled Details Release / windows-cpu (arm64) (push) Has been cancelled Details Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled Details Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled Details Release / ubuntu-24-openvino (push) Has been cancelled Details Release / windows-cpu (x64) (push) Has been cancelled Details Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled Details Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled Details Release / windows-cuda (12.4) (push) Has been cancelled Details Release / windows-cuda (13.1) (push) Has been cancelled Details Release / windows-sycl (push) Has been cancelled Details Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled Details Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled Details Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled Details Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled Details Release / ios-xcode-build (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled Details Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled Details Release / release (push) Has been cancelled Details Server / server-windows (push) Has been cancelled Details Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled Details Build Actions Cache / ubuntu-24-openvino-cache (push) Has been cancelled Details Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled Details Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled Details HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled Details Two fixes surfaced by running the full test suite against the squash-merged turboquant branch, plus one CMake registration. 1. ggml-cuda/ggml-cuda.cu (GET_ROWS supports_op) Removed TQ3_1S/TQ4_1S from the CUDA/HIP GET_ROWS supports_op switch. TheTom's branch advertised these as supported but never added the matching cases to getrows.cu — a latent bug present on both his branch and master. master's test-backend-ops triggers it; the scheduler will now route get_rows on TQ types to CPU. 2. ggml-cuda/fattn.cu (HIP head-size gate) Master's get_best_fattn_kernel falls through to BEST_FATTN_KERNEL_TILE as default. On HIP, fattn-tile.cu only instantiates head sizes 64, 128, 256, 320, 512 (576/640 exceed local memory limits per #ifndef GGML_USE_HIP). Without this gate, supports_op returns true for unsupported sizes and the dispatch aborts. Now returns BEST_FATTN_KERNEL_NONE on HIP for head sizes the tile kernel cannot compile, letting the scheduler fall back to CPU. 3. tests/CMakeLists.txt (test-turbo-quant registration) TheTom added tests/test-turbo-quant.c (CPU round-trip diagnostic for turbo3/turbo4 quant→dequant→inverse-WHT) but never wired it into the build. Registered as a ctest entry linked against ggml + libm. Test status with these fixes: - CPU (build-cpu): 51/51 ctest pass, including new test-turbo-quant. - HIP (build-hip, gfx1151): 50/50 ctest pass with GGML_CUDA_DISABLE_GRAPHS=1 and test-backend-ops excluded. test-backend-ops itself runs 13674/13677 internal cases; the 3 remaining failures (CLAMP f16 → inf, bf16 FA graph capture) are pre-existing master-side regressions on RDNA3.5+HIP that reproduce on plain master and are unrelated to TurboQuant.	2026-05-14 00:38:58 +02:00
shahondin1624	15a6a36b59	turboquant: squash-merge TheTom/llama-cpp-turboquant feature/turboquant-kv-cache Python Type-Check / python type-check (push) Has been cancelled Details Squashes the entire TurboQuant KV-cache feature branch from https://github.com/TheTom/llama-cpp-turboquant (tip `5aeb2fdbe`) onto our master. Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s, tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated mul_mm path), CPU reference paths, HIP template instances, perplexity tooling, and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool retention, n_head_v reshape, sparse-V CUDA gating, etc.). Conflict-resolution notes (review carefully before depending on these paths): - common/arg.cpp, common/speculative.cpp: master's refactored speculative API kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last). - ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192 and 640 alongside other sizes). - ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND TurboQuant's GGML_OP_TURBO_WHT support cases kept. - ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added. - ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API. - ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added to the fa_kv_ok lambda for type validation. - ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro path NOT carried over. * Vulkan TURBO3_0 flash-attention paths need re-implementation against the new spec-constant API. * Vulkan TURBO3_0 inference will likely fail until that work is redone. Squash base: `7fc1c4ef78` (TheTom's last upstream merge point).	2026-05-13 23:01:46 +02:00
shaofeiqi	ec562eb673	opencl: add q5_0 and q5_1 MoE for Adreno (#22985 ) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-05-13 11:57:31 -07:00
Pascal	95d469a915	server, webui: accept continue_final_message flag for vLLM API compat (#23012 ) * server, webui: accept continue_final_message flag for vLLM API compat Add the continue_final_message body flag from the vLLM and transformers API. When set together with add_generation_prompt false, it triggers the existing prefill_assistant code path, regardless of the server side opt.prefill_assistant option. Mutual exclusion with add_generation_prompt true is enforced, matching vLLM behavior. WebUI sends continue_final_message and add_generation_prompt false on the Continue button, with the matching opt in option on the chat service. Pure API alignment, no change to the prefill logic itself. Paves the way for the upcoming per-template prefill plumbing in common/chat. * test: add coverage for continue_final_message vLLM compat flag Two cases on top of the existing assistant prefill coverage. First, continue_final_message true with add_generation_prompt false produces the same rendered prompt as the prefill_assistant heuristic, proving the new flag is a correct alias of the existing path. Second, both flags set to true is rejected with HTTP 400, matching the vLLM/transformers mutual exclusion contract. * chore: update webui build output	2026-05-13 20:47:58 +02:00
lhez	1e4579fbb8	opencl: fix crash when warming up MoE on Adreno (#22876 )	2026-05-13 11:24:33 -07:00
Masashi Yoshimura	527045bfb0	flush the gpu profile timestamp before the queryset is overflowed (#22995 )	2026-05-13 10:22:44 -07:00
Aleksander Grygier	2dfeca31cc	webui: Deduplicate model aliases in data + handle single/multiple aliases in UI (#22979 ) * fix: Deduplicate aliases + display single alias instead of default name or 2+ aliases as tags * refactor: Address review comments	2026-05-13 16:39:36 +02:00
Pascal	46be24d121	webui: preserve system message on edit cancel (#22911 ) * webui: preserve system message on edit cancel when content is not the placeholder * chore: update webui build output	2026-05-13 16:16:02 +02:00
Ravi Panchumarthy	7e16646015	docs : Update OPENVINO.md (#22959 ) Updated OPENVINO.md with Validated models and quantizations Co-authored-by: Haarika Madaka <haarika.madaka@intel.com>	2026-05-13 17:12:15 +03:00
Max Krasnyansky	ad96bb8c0c	hexagon: add unary tanh op (#22999 )	2026-05-13 06:59:28 -07:00
Xuan-Son Nguyen	e75cd5efb5	download: do not exit() on error (#23008 ) b9134	2026-05-13 15:14:58 +02:00
Pascal	5d44db6008	server, webui: support continue generation on reasoning models (#22727 ) * server, webui : support continue generation on reasoning models (#22727) Remove the throw blocking assistant prefill on reasoning models and orchestrate thinking tags around the prefilled message so the parser routes the next stream chunks correctly. WebUI drops the reasoning guard on the Continue button, sends reasoning_content with the prefilled message and persists partial reasoning on stop so the CoT survives reload and resume. Scope : templates with a simple thinking_start_tag / thinking_end_tag pair. Channel-based templates like GPT-OSS are out of scope, pending a per-template prefill API in common/chat. First step toward #21754. * chore: update webui build output * server: reject reasoning prefill on channel based templates b9133	2026-05-13 11:09:51 +02:00
Xuan-Son Nguyen	3796c94bad	ci: validate model naming convention (#22680 ) * ci: validate model naming convention * bring back dedicated ec workflow * add missing jobs --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-13 10:59:37 +02:00
Georgi Gerganov	634275fbbb	spec : update CLI arguments for better consistency (#22964 ) * spec : update CLI arguments for better consistency * cont : fix CLI arg message b9131	2026-05-13 09:15:39 +03:00
Sigbjørn Skjæret	bcfe63fc53	llama-eval : enable type check (#22988 )	2026-05-13 09:14:24 +03:00
Sachin Sharma	61af07c22d	ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (#22681 ) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled b9129	2026-05-13 09:13:47 +03:00
Trivikram Reddy	856c3adac1	hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993 ) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b9128	2026-05-12 17:28:02 -07:00
yzyyzyhhh	a9883db8ee	opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (#22755 ) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com> b9127	2026-05-12 13:10:37 -07:00
fredzillman	cce09f0b2b	convert : fix Pixtral 12B --mistral-format conversion (3 bugs) (#22981 )	2026-05-12 21:46:01 +02:00
Aleksander Grygier	dded58b450	webui: Fix Chat Screen Form box disappearing + autoscroll issues on WebKit (#22977 ) * debug: Scroll/Sticky issues * fix: UI improvements * refactor: Remove unneeded logic * fix: Better logic for initial load of messages	2026-05-12 20:41:11 +02:00
Xuan-Son Nguyen	7bfe120c21	mtmd, server, common: expose modalities to /v1/models (#22952 ) * mtmd, server, common: expose modalities to /v1/models * fix build * rename to mtmd_caps b9124	2026-05-12 19:08:07 +02:00
Masashi Yoshimura	927dada6c9	ggml-webgpu: Enables running gpt-oss-20b (#22906 ) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu b9123	2026-05-12 07:27:40 -07:00
Chen Yuan	239a497e5f	ggml-webgpu: address precision issues for multimodal (#22808 ) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp b9122	2026-05-12 07:27:04 -07:00
Daniel Bevenius	89730c8d26	model-conversion : add causal-convert-mmproj target [no ci] (#22969 ) * model-conversion : add causal-convert-mmproj target [no ci] This commit adds a new Make target that only converts the mmproj model. The motivation for this that the causal-convert-mm-model target will convert both the test model and the mmproj model which is nice when the model model conversion is finalized. But during development it is nice to be able to just convert the mmproj model and not have to wait for the often more time consuming text model conversion. * add path model path validation check	2026-05-12 15:15:40 +02:00
Georgi Gerganov	fde69a3607	examples : add llama-eval (#21152 ) * working llama-eval mc and math suite * multi source llama-eval * Add readme * add checkpointing * examples: add llama-server simulator for testing eval scripts Add a standalone Python script that simulates a llama-server HTTP endpoint for testing the eval script. The simulator: - Implements /v1/chat/completions endpoint with OpenAI-compatible format - Loads AIME dataset from HuggingFace with local caching - Uses Levenshtein distance for intelligent question matching - Supports configurable success rate for correct/wrong answer generation - Provides debug logging for troubleshooting Also includes test scripts and documentation for testing and understanding the simulator functionality. * examples: refactor test-simulator.sh for better readability Extract repeating question string into TEST_QUESTION variable and create make_request() helper function to reduce code duplication. Add proper error handling for error responses. * docs: update llama-eval-discussion.md with session work summary Add summary of llama-server-simulator implementation work including features, testing results, technical decisions, and refactoring. * examples: add simplified llama-eval-new.py for AIME evaluation - Create new simplified evaluation script focused only on AIME - Implement EvalState and Processor dataclasses for structured state management - Add real-time feedback showing correct/incorrect status per case - Abstract grading interface for external grader support - Use structured JSON output for eval state - Apply HuggingFace dataset caching to avoid repeated downloads - Remove Levenshtein matching - eval script only sends requests and validates answers * docs: remove README.md from llama-eval * examples: implement flexible grader system for answer validation - Add Grader class supporting regex and CLI-based grading - Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande - Add CLI grader interface: python script.py --answer <pred> --expected <gold> - Add HF telemetry disable to avoid warnings - Support exact match requirement for regex patterns - Add 30-second timeout for CLI grader - Handle both boxed and plain text formats for AIME answers * examples: use HF_HUB_OFFLINE to avoid HF Hub warnings * examples: remove HF_HUB_OFFLINE to allow dataset download * examples: use cached dataset path to avoid HF Hub requests * examples: use cached dataset path in simulator to avoid HF Hub requests * docs: update llama-eval-discussion.md with session work summary * examples: add threading support and model parameter to llama-eval-new.py - Add ThreadPoolExecutor for parallel request processing controlled by --threads - Add --model argument to specify model name in request data - Refactor process() to use thread-safe _process_single_case() method - Update progress tracking to work with concurrent execution * docs: update llama-eval-discussion.md with threading and model parameter updates - Add threading support implementation details - Document ThreadPoolExecutor usage and thread safety - Add model parameter implementation details - Include testing results for both features * examples: add task summary table to llama-eval-new.py * eval : print progress * eval : add prompts * test : fix path * sim : fix answer matching * eval : support multiple dataset runs * minor * improve grader * docs * remove old files * datasets : add gsm8k * add gpqa + sampling + docs * rename * grader : improve example answers * cont * datasets : add aime2025 * grader : update prompt * grade : improve regex + logs * datasets : fix aime2025 * cleanup * add AGENTS.md * ignore errors * resume eval * cleanup * fix counts * simplify * fix prompts * add html * store full response * add tokens * resoning and error handling * refactor * track total time * remove junk * eval : unify "judge" terminology to "grader" Replace all occurrences of "judge" with "grader" for consistency across the codebase (CLI args, Grader class fields, help text). Assisted-by: llama.cpp:local pi * eval : add Wilson score confidence interval to results Compute 95% CI on-the-fly from completed cases. Displayed in terminal output, HTML report, and JSON state. * llama-eval : add per-task generation speed from server timings Extract predicted_per_second from the server timings response and store it as tps_gen per task. Display in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : add per-task generation time from server timings Extract predicted_ms from the server timings response and store it as t_gen_ms per task. Display in seconds with one decimal digit in console progress, print_all_tasks, and HTML report. Assisted-by: llama.cpp:local pi * llama-eval : rename display, escaped, and count variables to use prefix convention - _display suffix → display_ prefix (answer, tokens, tps, t_gen) - _escaped suffix → escaped_ prefix (response, prompt, reasoning) - _count suffix → n_ prefix (correct, incorrect, pending) Assisted-by: llama.cpp:local pi * llama-eval : support multiple evaluation endpoints with dynamic task distribution - Add ServerConfig dataclass (url, threads, name) - Accept comma-separated --server, --threads, --server-name CLI args - Dynamic shared-queue task distribution across servers (fast servers do more work) - One ThreadPoolExecutor per server, workers pull from shared Queue - Track which server processed each task (server_name in results) - Thread-safe EvalState with threading.Lock for concurrent mutations - Server column in HTML report and console output - Backward compatible: single server works as before Assisted-by: llama.cpp:local pi * llama-server-simulator : replace Flask with stdlib http.server - Use HTTPServer + BaseHTTPRequestHandler instead of Flask - RequestHandler handles POST /v1/chat/completions - Server runs in daemon thread with clean Ctrl+C shutdown - Remove flask and unused asdict imports Assisted-by: llama.cpp:local pi * llama-eval : update README with PR link and quick-start examples Assisted-by: llama.cpp:local pi * llama-eval : track model name in eval state and verify on resume - Store model_name in EvalState and JSON output - Display model in HTML summary table - Verify --model matches stored model when resuming Assisted-by: llama.cpp:local pi * llama-server-simulator : fix comment - Dice coefficient, not Levenshtein Assisted-by: llama.cpp:local pi * llama-eval : require --grader-model or --model when using --grader-type llm Assisted-by: llama.cpp:local pi * llama-eval : protect dump() with lock for thread safety Assisted-by: llama.cpp:local pi * llama-eval : compact HTML report output - Replace verbose summary table with single inline bar - Shorten status text: '✓'/'✗'/'–'/'!' instead of full words - Flatten CSS: remove box-shadows, border-radius, reduce padding - Use system-ui font, 13px table, 12px details - Conditional reasoning section (only shown when present) - Single toggle JS function instead of two - Shorter column headers Assisted-by: llama.cpp:local pi * llama-eval : check server connectivity on startup - Hit /v1/models for each server before evaluation - Exit with error if any server is unreachable - Print comma-separated model IDs per server in startup output - Sequential checks, no retries, no timeout override Assisted-by: llama.cpp:local pi * llama-eval : use server1/server2 instead of gpu1/gpu2 in README Assisted-by: llama.cpp:local pi --------- Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>	2026-05-12 15:07:00 +03:00
Masato Nakasaka	ef93e98d01	vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (#22461 ) * refactor * Use l_warptile only when coopamt is available for BF16 b9119	2026-05-12 12:15:34 +02:00
Jeff Bolz	706fbd8ab6	vulkan: Check shared memory size for mmq shaders (#22693 ) b9118	2026-05-12 11:41:58 +02:00
Sigbjørn Skjæret	fa62042af9	ci : bump ty to 0.0.35 (#22961 )	2026-05-12 11:34:10 +02:00
AesSedai	4178259130	mtmd: add MiMo v2.5 vision (#22883 ) * mimo-v2.5: vision support * mimo-v2.5: use fused qkv for vision * mimi-v2.5: fix f16 vision overflow * mimo-v2.5: comment cleanups * mimo-v2.5: Flash doesn't have mmproj more cleanup remember to use filter_tensors * mimo-v2.5: fix trailing whitespace b9116	2026-05-12 11:11:14 +02:00
Jesus Talavera	78fbbc2c07	convert : add split() to LoraTorchTensor in LoRA converter (#22832 ) * convert : add split() method to LoraTorchTensor * Fix python type-check * Fix flake8 Lint * fix: handle positional dim arg in torch.split dispatch * Fix type-check again * Fix type-checks * Remove unit test per reviewers feedback * work around ty deficiency --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9115	2026-05-12 08:17:04 +03:00
guyfischman	da44953329	metal : promote mul_mv/mul_mm batch divisors to function constants (#22711 ) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext b9114	2026-05-12 08:15:02 +03:00
Shawn Gu	1ec7ba0c14	opencl: add q4_1 MoE for Adreno (#22856 ) * Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b9113	2026-05-11 11:57:26 -07:00
CrispStrobe	8e1f9d0834	CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944 ) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged. b9112	2026-05-11 19:48:29 +02:00
Pascal	e936660760	Ggml/cuda snake fusion hardening (#22912 ) * cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases	2026-05-11 18:42:08 +02:00
willjoha	ef22b3e4ac	docs: fix metrics endpoint description in server README (#22879 ) * docs: fix metrics endpoint description in server README Required model query parameter for router mode described. Removed metrics: - llamacpp:kv_cache_usage_ratio - llamacpp:kv_cache_tokens Added metrics: - llamacpp:prompt_seconds_total - llamacpp:tokens_predicted_seconds_total - llamacpp:n_decode_total - llamacpp:n_busy_slots_per_decode * server: fix metrics type for n_busy_slots_per_decode metric b9110	2026-05-11 18:32:26 +02:00
Georgi Gerganov	68e7ea3eab	spec : parallel drafting support (#22838 ) * spec : refactor * spec : drop support for incompatible vocabs * spec : update common_speculative_init() * cont : pass seq_id * cont : dedup ctx_seq_rm_type * server : sketch the ctx_dft decode loop * server : draft prompt cache and checkpoints * server : improve ctx names * server, spec : transition to unified spec context * cont : sync main and drft contexts * cont : async drft eval when possible * cont : handle non-ckpt models * cont : pass correct n_past for drafting * cont : process images throught the draft context * spec : handle draft running out of context * server : fix mtmd draft processing * server : fix URL for draft model * server : add comment * server : clean-up + dry * speculative-simple : update * spec : fix n_past type * server : fix slot ctx_drft ptr * tools : update readme * naming : improve consistency * spec : refactor for multi-sequence speculative context * cont : prepare params * cont : prepare params * spec : support parallel drafts * server : support parallel drafting * llama : reuse device buffers when possible * server, spec : clean-up * cont : clean-up * cont : minor * spec : reset `drafting` flag at the end * spec : introduce `common_speculative_process()` * spec : allow for multiple spec types (chain of speculators) * replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified * introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>) to figure out which implementations the user has enabled * introduce common_speculative_type_from_names(const std::vector<std::string> & names) to parse the already user provided spec types * all speculators run sequentially, best one wins (we verify its drafted tokens) * maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length --------- Co-authored-by: Petros Sideris <petros.sideris@nokia.com> b9109	2026-05-11 19:09:43 +03:00
Kevin Pouget	928b486b0c	ggml-virtgpu: Add a GHA build check (#22943 ) * [ggml-virtgpu] Add a GHA build check * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-11 21:38:22 +08:00
Daniel Bevenius	7dbb0e998a	examples : update args speculative-simple README.md [no ci] (#22938 ) This commit updates the command line arguments to use the correct names and values which are now required. The motivation for this change is that currently running the example command as is will generate the following errors: ```console error while handling argument "--color": error: unknown value for --color: '--sampling-seq' usage: -co, --color [on\|off\|auto] Colorize output to distinguish prompt and user input from generations ('on', 'off', or 'auto', default: 'auto') 'auto' enables colors when output is to a terminal error while handling argument "-fa": error: unknown value for --flash-attn: '--temp' usage: -fa, --flash-attn [on\|off\|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto') (env: LLAMA_ARG_FLASH_ATTN) error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max usage: --draft, --draft-n, --draft-max N the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max (env: LLAMA_ARG_DRAFT_MAX) error while handling argument "--draft-min": the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min usage: --draft-min, --draft-n-min N the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min (env: LLAMA_ARG_DRAFT_MIN) ```	2026-05-11 14:00:57 +03:00
Jeff Bolz	dd9280a664	vulkan: Support asymmetric FA in scalar/mmq/coopmat1 paths (#22589 ) b9106	2026-05-11 12:49:03 +02:00
Oliver Simons	8cef8201a1	CUDA: directly include cuda/iterator (#22936 ) Before, we relied on a transient import from `cub/cub.cuh`, which is bad practice to do as cub may not always expose cuda/iterator b9105	2026-05-11 12:16:38 +02:00
Daniel Bevenius	f5636f8fc7	convert : add image break token fallback (#22914 ) * convert : add image break token fallback This commit adds a image_break_token_id fallback for mistral where the config contains a image_break_token_id of -1: ```console "vision_encoder": { "image_token_id": 10, "image_break_token_id": -1, ... ``` But the tokenizer.json has this token: ```console 115 "id": 12, 116 "content": "[IMG_BREAK]", 117 "single_word": false, 118 "lstrip": false, 119 "rstrip": false, 120 "normalized": false, 121 "special": true 122 }, ``` If we look in convert_hf_to_gguf.py we have: ```python elif self.is_mistral_format: # hparams is already vision config here so norm_eps is only defined in global_config. self.hparams["norm_eps"] = self.global_config.get("norm_eps", None) assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json" if self.use_break_tok: self.img_break_tok_id = self.find_vparam(["image_break_token_id"]) ``` The motivation for this is that currently converting this models results in the following error: ```console load_hparams: model size: 5131.60 MiB load_hparams: metadata size: 0.15 MiB clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf ``` With this fallback the model loads successfully. Resolves: https://github.com/ggml-org/llama.cpp/issues/22901 * Revert "convert : add image break token fallback" This reverts commit `292e40cfdf`. * convert : add image break token fallback This commit adds a image_break_token_id fallback for mistral where the config contains a image_break_token_id of -1: ```console "vision_encoder": { "image_token_id": 10, "image_break_token_id": -1, ... ``` But the tokenizer.json has this token: ```console 115 "id": 12, 116 "content": "[IMG_BREAK]", 117 "single_word": false, 118 "lstrip": false, 119 "rstrip": false, 120 "normalized": false, 121 "special": true 122 }, ``` If we look in convert_hf_to_gguf.py we have: ```python elif self.is_mistral_format: # hparams is already vision config here so norm_eps is only defined in global_config. self.hparams["norm_eps"] = self.global_config.get("norm_eps", None) assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json" if self.use_break_tok: self.img_break_tok_id = self.find_vparam(["image_break_token_id"]) ``` The motivation for this is that currently converting this models results in the following error: ```console load_hparams: model size: 5131.60 MiB load_hparams: metadata size: 0.15 MiB clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf ``` With this fallback the model loads successfully. Co-authored-by: Pascal <admin@serveurperso.com> Resolves: https://github.com/ggml-org/llama.cpp/issues/22901 * convert : allow zero value for img_break_tok_id	2026-05-11 12:07:17 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	838374375c	vendor : update cpp-httplib to 0.44.0 (#22919 ) b9103	2026-05-11 08:47:13 +02:00
Neo Zhang	7d442abf5c	[SYCL] Add OP im2col_3d (#22903 ) * add im2col_3d * format code * update the ops.md b9102	2026-05-11 08:01:47 +03:00
Georgi Gerganov	389ff61d77	server : print warning when HTTP timeout exceeded (#22907 ) b9101	2026-05-10 22:00:18 +03:00
Tim Neumann	2e97c5f96f	backend sampling: support returning post-sampling probs (#22622 ) * server: Never return 0.0 post-sampling probabilities * backend sampling: support returning post-sampling probs b9100	2026-05-10 19:12:02 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	5d5d2e15d2	vendor : update cpp-httplib to 0.43.4 (#22888 ) b9099	2026-05-10 18:46:54 +02:00
Oliver Walsh	2b2babd124	ggml-virtgpu : include missing mutex header (#22810 ) Add missing `#include <mutex>` in ggml-backend-device.cpp. Fixes: #22809 Signed-off-by: Oliver Walsh <owalsh@redhat.com>	2026-05-10 17:32:41 +02:00
Georgi Gerganov	0b047287fe	sync : ggml b9097	2026-05-10 17:00:11 +03:00
Georgi Gerganov	efbada936f	ggml : bump version to 0.11.1 (ggml/1484)	2026-05-10 17:00:11 +03:00

1 2 3 4 5 ...

9145 Commits