9239 Commits

Author SHA1 Message Date
shahondin1624 4c66df50ca hip: fix HIP graph capture crash for FA quantized KV f16 dequant
CI / build-cmake-pkg (push) Successful in 15m57s
CI / android-arm64 (push) Failing after 15s
CI / ubuntu-latest-rpc (push) Failing after 13s
CI / ubuntu-latest-cuda (push) Failing after 9s
Release / android-arm64 (push) Failing after 34s
Server / server (default) (push) Failing after 13s
Server / server (backend-sampling) (push) Failing after 17s
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
CI (self-hosted) / Determine tag name (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-webgpu (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-hip (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
Code Style Checker / model-naming (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-kleidiai (CPUx1, kleidiai) (push) Has been cancelled
Server / server-windows (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
Release / release (push) Has been cancelled
Release / ui-publish (push) Has been cancelled
The HIP branch in launch_fattn used raw hipMalloc / hipFree /
hipStreamSynchronize(main_stream) for the K/V f16 dequant temp buffers
(introduced to avoid pool retention OOM). These three calls are illegal
during HIP graph capture and abort cudaStreamEndCapture with
hipErrorStreamCaptureUnsupported, manifesting as the "ROCm error" at
ggml-cuda.cu:104 when running models like Qwen3.6-27B-Dense and
Qwen3.6-35B-A3B-Q8 with -fa 1 on gfx1151. Workaround was
GGML_CUDA_DISABLE_GRAPHS=1.

Probe cudaStreamIsCapturing on entry; when a capture is in progress
use ggml_cuda_pool_alloc<half> (legal in capture). Outside capture,
behavior is unchanged so the OOM-avoidance the raw-alloc branch was
added for is preserved.

Also: ggml_cuda_error wrote only via GGML_LOG_ERROR, which llama-bench
silences with llama_null_log_callback, so the actual hipError was
invisible. Mirror the message to stderr with fflush so failures stay
diagnosable from bench. Expand the inline CUDA_CHECK around
cudaStreamEndCapture / cudaGraphInstantiate / cudaGraphLaunch to print
which graph step failed plus the cgraph's first/last op for context.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 08:08:12 +02:00
Copilot a581eead32 hip: skip unsupported RDNA WMMA flash-attention cases
CI / build-cmake-pkg (push) Successful in 17m32s
CI / android-arm64 (push) Failing after 12s
CI / ubuntu-latest-rpc (push) Failing after 12s
CI / ubuntu-latest-cuda (push) Failing after 6s
Release / android-arm64 (push) Failing after 30s
Server / server (default) (push) Failing after 5s
Server / server (backend-sampling) (push) Failing after 5s
CI / ubuntu-22-hip (push) Has been cancelled
HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled
CI (self-hosted) / Determine tag name (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-webgpu (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
Code Style Checker / model-naming (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-kleidiai (CPUx1, kleidiai) (push) Has been cancelled
Server / server-windows (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
Release / release (push) Has been cancelled
Release / ui-publish (push) Has been cancelled
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-19 23:42:54 +02:00
shahondin1624 2907ee9830 turboquant: post-merge integration fixes from test validation
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
CI (sycl) / windows-latest-sycl (push) Has been cancelled
CI (virtgpu) / ubuntu-24-virtgpu (push) Has been cancelled
Check vendor / check-vendor (push) Has been cancelled
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Has been cancelled
CI (3rd-party) / ubuntu-24-llguidance (push) Has been cancelled
CI (apple) / macOS-latest-ios (push) Has been cancelled
CI (apple) / macos-latest-ios-xcode (push) Has been cancelled
CI (apple) / macOS-latest-tvos (push) Has been cancelled
CI (apple) / macOS-latest-visionos (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Has been cancelled
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Has been cancelled
CI (cross) / debian-13-loongarch64-cpu-cross (push) Has been cancelled
CI (cross) / debian-13-loongarch64-vulkan-cross (push) Has been cancelled
CI (cross) / ubuntu-24-riscv64-cpu-spacemit-ime-cross (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Has been cancelled
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / python type-check (push) Has been cancelled
CI (snapdragon) / android-ndk-snapdragon (push) Failing after 2m33s
CI (android) / android (push) Failing after 4m51s
CI (android) / android-ndk (push) Failing after 4s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Failing after 14s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Failing after 8s
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Failing after 9s
CI (UI) / Build static output (push) Failing after 8m40s
CI (UI) / UI Checks (push) Has been skipped
CI (UI) / E2E Tests (push) Has been skipped
CI (snapdragon) / linux-iot-snapdragon (push) Failing after 3m10s
CI (snapdragon) / Test on QDC Device (QCS9075M) (push) Has been skipped
CI (snapdragon) / Test on QDC Device (SM8750) (push) Has been skipped
CI (snapdragon) / Test on QDC Device (SM8850) (push) Has been skipped
CI / build-cmake-pkg (push) Successful in 15m28s
CI / android-arm64 (push) Failing after 10s
CI / ubuntu-latest-rpc (push) Failing after 8s
CI / ubuntu-latest-cuda (push) Failing after 4m22s
Release / android-arm64 (push) Failing after 1m10s
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Failing after 32s
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Failing after 4s
Server / server (default) (push) Failing after 5s
Server / server (backend-sampling) (push) Failing after 4s
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
CI (self-hosted) / Determine tag name (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-webgpu (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-metal (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-webgpu (push) Has been cancelled
CI (self-hosted) / ggml-ci-mac-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Has been cancelled
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Has been cancelled
CI / macOS-latest-arm64 (push) Has been cancelled
CI / macOS-latest-x64 (push) Has been cancelled
CI / macOS-latest-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Has been cancelled
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Has been cancelled
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / windows-2022-cuda (12.4) (push) Has been cancelled
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-24-webgpu (push) Has been cancelled
CI / ubuntu-24-webgpu-wasm (push) Has been cancelled
CI / ubuntu-22-hip (push) Has been cancelled
CI / ubuntu-22-musa (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Has been cancelled
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest-hip (push) Has been cancelled
CI / ubuntu-cpu-riscv64-native (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
Code Style Checker / model-naming (push) Has been cancelled
EditorConfig Checker / editorconfig (push) Has been cancelled
HIP quality check / ubuntu-22-hip-quality-check (push) Has been cancelled
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Has been cancelled
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Has been cancelled
Release / ubuntu-24-sycl (fp16, ON) (push) Has been cancelled
Release / ubuntu-24-sycl (fp32, OFF) (push) Has been cancelled
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Has been cancelled
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Has been cancelled
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Has been cancelled
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Has been cancelled
Release / ubuntu-24-openvino (push) Has been cancelled
Release / windows-cpu (arm64) (push) Has been cancelled
Release / windows-cpu (x64) (push) Has been cancelled
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Has been cancelled
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Has been cancelled
Release / windows-cuda (12.4) (push) Has been cancelled
Release / windows-cuda (13.1) (push) Has been cancelled
Release / windows-sycl (push) Has been cancelled
Release / ios-xcode-build (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 310p, off) (push) Has been cancelled
Release / openEuler-cann (x86, Release, 910b, on) (push) Has been cancelled
Release / release (push) Has been cancelled
Release / ui-publish (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx2) (push) Has been cancelled
Server (self-hosted) / server-metal (GPUx1) (push) Has been cancelled
Server (self-hosted) / server-kleidiai (CPUx1, kleidiai) (push) Has been cancelled
Server / server-windows (push) Has been cancelled
Two fixes surfaced by running the full test suite against the squash-merged
turboquant branch, plus one CMake registration.

1. ggml-cuda/ggml-cuda.cu (GET_ROWS supports_op)
   Removed TQ3_1S/TQ4_1S from the CUDA/HIP GET_ROWS supports_op switch.
   TheTom's branch advertised these as supported but never added the matching
   cases to getrows.cu — a latent bug present on both his branch and master.
   master's test-backend-ops triggers it; the scheduler will now route
   get_rows on TQ types to CPU.

2. ggml-cuda/fattn.cu (HIP head-size gate)
   Master's get_best_fattn_kernel falls through to BEST_FATTN_KERNEL_TILE as
   default. On HIP, fattn-tile.cu only instantiates head sizes 64, 128, 256,
   320, 512 (576/640 exceed local memory limits per #ifndef GGML_USE_HIP).
   Without this gate, supports_op returns true for unsupported sizes and the
   dispatch aborts. Now returns BEST_FATTN_KERNEL_NONE on HIP for head sizes
   the tile kernel cannot compile, letting the scheduler fall back to CPU.

3. tests/CMakeLists.txt (test-turbo-quant registration)
   TheTom added tests/test-turbo-quant.c (CPU round-trip diagnostic for
   turbo3/turbo4 quant→dequant→inverse-WHT) but never wired it into the
   build. Registered as a ctest entry linked against ggml + libm.

Test status with these fixes:
- CPU (build-cpu): 51/51 ctest pass, including new test-turbo-quant.
- HIP (build-hip, gfx1151): 50/50 ctest pass with GGML_CUDA_DISABLE_GRAPHS=1
  and test-backend-ops excluded. test-backend-ops itself runs 13674/13677
  internal cases; the 3 remaining failures (CLAMP f16 → inf, bf16 FA graph
  capture) are pre-existing master-side regressions on RDNA3.5+HIP that
  reproduce on plain master and are unrelated to TurboQuant.
2026-05-19 15:13:55 +02:00
shahondin1624 ddebb5ddf6 turboquant: squash-merge TheTom/llama-cpp-turboquant feature/turboquant-kv-cache
Squashes the entire TurboQuant KV-cache feature branch from
https://github.com/TheTom/llama-cpp-turboquant (tip 5aeb2fdbe) onto our master.

Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s,
tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated
mul_mm path), CPU reference paths, HIP template instances, perplexity tooling,
and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool
retention, n_head_v reshape, sparse-V CUDA gating, etc.).

Conflict-resolution notes (review carefully before depending on these paths):

- common/arg.cpp, common/speculative.cpp: master's refactored speculative API
  kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last).

- ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192
  and 640 alongside other sizes).

- ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND
  TurboQuant's GGML_OP_TURBO_WHT support cases kept.

- ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature
  (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added.

- ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ
  else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API.

- ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration
  taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added
  to the fa_kv_ok lambda for type validation.

- ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new
  spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro
  path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need
  re-implementation against the new spec-constant API. *** Vulkan TURBO3_0
  inference will likely fail until that work is redone.

Squash base: 7fc1c4ef78 (TheTom's last upstream merge point).
2026-05-19 15:13:49 +02:00
Georgi Gerganov d14ce3dab4 llama : MTP clean-up (#23269)
* llama : disable equal splits for recurrent memory with partial rollback

* spec : re-enable p-min with MTP drafts

* spec : re-enable ngram spec in combination with RS rollback

* spec : fix ngram-map-* params

* spec : fix acceptance logic in combined ngram + draft configs

* graph : fix reuse for combined `token` + `embd` batches

* spec : log parameters for each speculative implementation

- add LOG_INF in each constructor with implementation type and parameters
- extract device string logic into common_speculative_get_devices_str()
- move 'adding speculative implementation' log from init into constructors

Assisted-by: llama.cpp:local pi

* spec : extend --spec-default with ngram-map-k4v

Assisted-by: llama.cpp:local pi

* minor : fix n_embd log

* args : update draft.n_max == 3 + regen docs

* spec : relax ngram-mod rejection thold to 0.25 @ 5 low

* logs : improve

* docs : update speculative decoding CLI argument documentation

- Add missing draft model CPU scheduling and tensor override parameters
- Update --spec-type to include all available types (excluding draft-eagle3 WIP)
- Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
- Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
- Add environment variables for new parameters

Assisted-by: llama.cpp:local pi

* arg : step-back on adding k4v to the default spec config

* cont : fix name
2026-05-19 15:32:58 +03:00
Aleksander Grygier 6db130445d ui: Bump packages + address build warnings (#23300)
* chore: Update vulnerable packages

* chore: Formatting

* refactor: Update Tailwind CSS imports

* ci: Use `ubuntu-latest` for Unit/E2E UI tests

* chore: Bump package

* fix: Add missing tag

* refactor: Enums files naming
2026-05-19 10:16:04 +02:00
Sigbjørn Skjæret 4b262ab662 ci : install libssl-dev (#23325) 2026-05-19 11:11:04 +03:00
Sigbjørn Skjæret 00c461ce1a ci : install server kleidiai runner dependencies (#23259) 2026-05-19 09:06:56 +02:00
Pascal ccee426426 server-context: guarantee there is at least 1 token to decode (#23280) 2026-05-19 09:49:01 +03:00
Georgi Gerganov 3c81c8deea server : print graphs reused in slot timings (#23279)
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:58 +03:00
Georgi Gerganov cd963fee6a save-load-state : refactor tests and improve readability (#23196)
* save-load-state : refactor into separate phase functions

- Split monolithic main() into 4 self-contained phase functions, each
  managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)

Assisted-by: llama.cpp:local pi

* save-load-state : use params.out_file instead of separate state_file

- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt

Assisted-by: llama.cpp:local pi

* save-load-state : use smart pointers for ctx and smpl

- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)

Assisted-by: llama.cpp:local pi

* save-load-state : add local llama_batch_ptr RAII wrapper

- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls

Assisted-by: llama.cpp:local pi

* save-load-state : replace printf/fprintf with logging macros

- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG

Assisted-by: llama.cpp:local pi

* save-load-state : refactor tests to check results inline

Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().

Assisted-by: llama.cpp:local pi

* save-load-state : improve test output readability

Add phase labels, remove redundant run prefixes, and show
PASS after each test.

Assisted-by: llama.cpp:local pi

* pi : add rule about git signing

* save-load-state : simplify llama_batch_ptr

Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.

Assisted-by: llama.cpp:local pi

* save-load-state : extract generate_tokens helper

Factor out the repeated token generation loop into a shared
helper function used by all phases.

Assisted-by: llama.cpp:local pi

* save-load-state : update comments to use test terminology

Replace "Phase" with "Test" and list each test's steps
as bullet points.

Assisted-by: llama.cpp:local pi

* save-load-state : rename test functions

Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.

Assisted-by: llama.cpp:local pi

* pi : add rule to never git push without confirmation

Assisted-by: llama.cpp:local pi

* common : add model_only option to common_init_from_params

Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.

Use in save-load-state to initialize only the model,
with each test creating its own context.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-19 09:46:34 +03:00
Georgi Gerganov d2e179a477 llama-eval : add per-task summary stats (#23151)
* llama-eval : add per-problem summary table to HTML reports

- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
  - Columns: Problem (zero-padded), Runs, Correct (n/r),
    Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
  - Sorted by problem index, monospace font, right-aligned numbers
  - Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
  template-aware question matching, --dataset arg (aime/aime2025)

Assisted-by: llama.cpp:local pi

* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally

- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar

Assisted-by: llama.cpp:local pi

* llama-eval : redesign top bar as CSS grid label/value pairs

- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns

Assisted-by: llama.cpp:local pi

* llama-eval : use realistic token counts and throughput in simulator

- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps

Assisted-by: llama.cpp:local pi

* llama-eval : color Answer column green/red based on correctness

Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.

Assisted-by: llama.cpp:local pi

* llama-eval : fix pyright errors from max(..., key=len) type inference

Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
  - unresolved-attribute: Object of type Sized has no attribute lower
  - not-subscriptable: Cannot subscript object of type Sized

Assisted-by: llama.cpp:local pi
2026-05-19 09:46:05 +03:00
Reese Levine c85a242ed0 ggml-webgpu : extend GDN for K>1 (#23299) 2026-05-19 09:45:41 +03:00
Neo Zhang aabee047d8 [SCYL] add chapter for performance reference in SYCL.md (#23315)
* add chapter for performance reference

* rm unsupported GPU
2026-05-19 09:44:51 +03:00
Sigbjørn Skjæret f1c1c5c057 convert : filter lora tensor names (#23077) 2026-05-19 09:44:25 +03:00
Intel AI Get-to Market Customer Success and Solutions 439f1b193d sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153)
* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Use async mem ops for correctness when SYCL graphs are explicitly on.

Signed-off-by: Tao, Chun <chun.tao@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-19 09:44:02 +03:00
Radoslav Gerganov c3e9ade6dd rpc : keep last_graph_uid in the device context (#23273)
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.

closes: #23242
2026-05-19 09:42:36 +03:00
Pranav Dhinakar 9a532ae4ba hexagon: add support for TRI op (#22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context

* addressed PR review comments for TRI op

* hexagon: clang format

* hex-unary: remove merge conflict markers

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-ggml: fix editor config errors

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 14:04:57 -07:00
Pranav Dhinakar b7340443d4 ggml-hexagon: add PAD op HVX kernel (#23078)
* ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-pad: fix editorconfig checks and macro alignment

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-18 13:39:36 -07:00
SamareshSingh 5cbaa5e69e docker : add OCI image labels for version and build date (#21653)
* docker: add OCI image labels to all published images

* docker: propagate OCI labels as manifest and index annotations

* docker: drop hardcoded org URL and revert accidental intel version bump

The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.

* docker: add skip_s390x workflow_dispatch input for fast test runs

Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
2026-05-18 22:14:45 +02:00
Adrien Gallouët 45b455e66f common : remove hf cache migration (#23266)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-18 17:11:47 +02:00
Aleksander Grygier 3a9c1b854d ui: Update KaTeX package and clean up logs from sass warnings (#23275)
* ui: migrate katex imports to @use to resolve SCSS deprecation warnings

* ci: Use `ubuntu-slim` for CI (UI) workflow
2026-05-18 16:26:01 +02:00
Aleksander Grygier b9a2170fce feat: add scroll-to-bottom button to chat + prevent forced scroll down (#23270) 2026-05-18 16:17:21 +02:00
Aleksander Grygier 1ff0fc1384 ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)
* refactor: Scope console logs to `DEV` + `VITE_DEBUG` env vars

* refactor: skip MCP proxy probe when no server requires it

* refactor: suppress expected disconnect errors during MCP client shutdown

* refactor: Deduplicate requests

* refactor: deduplicate model fetching across ROUTER and MODEL modes

* refactor: Clean up models logic

* chore: Add `.env.example` file

* refactor: replace client-side CORS proxy probe with server status flag

* refactor: Post-review fixes

* test: add vitest client setup with API fetch mocks
2026-05-18 16:09:40 +02:00
Aleksander Grygier a135ec0baa ui: Centralize monospace font styles in app.css (#23272) 2026-05-18 15:10:14 +02:00
Martin Andersson 232f466583 webui: fix Tailwind v4 utility classes missing when built via cmake (#23253) 2026-05-18 14:08:02 +02:00
Andrei 49c21f97cd llama: initialize pre-norm embedding mask flag (#23256) 2026-05-18 14:20:49 +03:00
Sigbjørn Skjæret 77e38d68f2 add myself to conversion (#23261) 2026-05-18 12:42:56 +02:00
Martin Klacer 053e01dff6 ci : added kleidiai-server to server-self-hosted workflow (#22435)
* kleidiai: added kleidiai-server to server-self-hosted workflow

 * Added KleidiAI-enabled Arm64 Linux llama-server CI/integration test
   workflow into the server-self-hosted.yml configuration file

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I032e33c525b7e26bc5d53719f638bee610cec1ee

* Added self-hosted executor for KleidiAI server workflow

Signed-off-by: Martin Klacer <martin.klacer@arm.com>

* Update .github/workflows/server-self-hosted.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 11:14:57 +02:00
Georgi Gerganov c3f95c1f06 scripts : allow wc2wt with an existing branch (#23189) 2026-05-18 08:57:28 +03:00
Intel AI Get-to Market Customer Success and Solutions 0caf2a1d48 sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (#22156)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:12:21 +03:00
Intel AI Get-to Market Customer Success and Solutions 5511965b19 sycl: route small f32 matmuls to oneMKL, bypass oneDNN (#22150)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-18 08:11:51 +03:00
Neo Zhang e98bcfec28 sycl : fix error when use -mg 1 error (#23140) 2026-05-18 08:11:19 +03:00
Incarnas 1867a0c692 update bid to match each layers MTP source (#23237)
* update bid to match each layers MTP source

* Update conversion/qwen.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-18 12:37:12 +08:00
Sigbjørn Skjæret dd7cad7197 cmake : do not check for bin install dir (#23234) 2026-05-18 02:33:14 +02:00
Gabe Goodhart 726704a160 feat: Support d_conv=15 for ssm-conv.cu (#23017)
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-05-17 23:05:11 +02:00
Aldehir Rojas 87589042ca cmake : fix LLAMA_BUILD_UI logic (#23190) 2026-05-17 14:42:26 -04:00
Sigbjørn Skjæret e0de4c2419 cmake : do not install conversion script (#23204) 2026-05-17 18:07:21 +02:00
Oliver Simons 84c678242a CUDA: Continue directly including cuda/iterator (#23102)
Cont of #22936, forgot to update one site
2026-05-17 18:00:10 +02:00
Aman Gupta 3e12fbdea5 llama: avoid copying logits during prompt decode in MTP (#23198)
* llama: avoid copying logits during prompt decode in MTP

* review: update comment

* llama-graph: call set_output for t_h_pre_norm
2026-05-17 23:30:25 +08:00
Aldehir Rojas 39cf5d6191 common : delegate assistant continuation to underlying template handlers (#23089)
* common : delegate assistant continuation to template handler

* server : implement echo parameter to exclude assistant prefill in the response

* server : fix tests for prefill

* server : use existing llama template

* cont : clean up
2026-05-17 13:36:05 +02:00
Jan Ekström a6d6183dbc ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (#22009)
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI

For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).

This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.

* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers

This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
2026-05-17 13:12:11 +02:00
Pascal fcae601e44 vulkan: add cpy bf16 -> f32 pipelines (#22677) 2026-05-17 11:31:20 +02:00
Jeff Bolz 7ba22c6a09 vulkan: Support unaligned tensors for ROPE (#22637) 2026-05-17 11:30:16 +02:00
Aldehir Rojas f4cc787b9f common : enable streaming JSON argument values (#23173)
* common : remove atomic from json arguments

* common : remove parsing logic on JSON arguments
2026-05-17 03:44:34 -05:00
Jeff Bolz 3fbadb06dc vulkan: fuse SSM_CONV + BIAS + SILU (#22653) 2026-05-17 10:25:50 +02:00
Rares Vernica 1a68ec9378 server : honor --embd-normalize CLI arg (#23125)
The --embd-normalize flag was registered only for the embedding and debug
examples, so llama-server rejected it and the /embedding handler used a
hard-coded default of 2 (L2). Add LLAMA_EXAMPLE_SERVER to the flag's
example set and read params.embd_normalize as the handler's default. The
per-request "embd_normalize" body field continues to override.
2026-05-17 09:39:04 +03:00
ddh0 a16cce81d3 ngram : reduce noisy logs (#23185)
* ngram : reduce noisy logs

* ngram : reduce noisy logs
2026-05-17 09:38:17 +03:00
Judd 4f13cb7424 webui: support video files as input (#22830) 2026-05-17 02:13:44 +02:00
Xuan-Son Nguyen b64739ea39 server: (router) alloc tmp buffer on heap (#23159) 2026-05-16 23:42:16 +02:00