ddebb5ddf6
Squashes the entire TurboQuant KV-cache feature branch from https://github.com/TheTom/llama-cpp-turboquant (tip5aeb2fdbe) onto our master. Includes: TurboQuant KV-cache types (turbo2_0, turbo3_0, turbo4_0, tq3_1s, tq4_1s), GGML_OP_TURBO_WHT op, CUDA + Metal kernels (including TQ-rotated mul_mm path), CPU reference paths, HIP template instances, perplexity tooling, and 18 post-upstream-sync fixes (CVE-2026-21869 server clamp, HIP FA pool retention, n_head_v reshape, sparse-V CUDA gating, etc.). Conflict-resolution notes (review carefully before depending on these paths): - common/arg.cpp, common/speculative.cpp: master's refactored speculative API kept (params.speculative.types / ngram_mod struct, per-sinfo n_low/i_last). - ggml-cuda/fattn.cu: head-size exclusion lists unioned (now exclude both 192 and 640 alongside other sizes). - ggml-cuda/ggml-cuda.cu: both master's ADD/SUB/MUL/DIV F16 widening AND TurboQuant's GGML_OP_TURBO_WHT support cases kept. - ggml-metal-device.h/.cpp: master's new get_pipeline_mul_mv_ext signature (const ggml_tensor * op) kept; TurboQuant's get_pipeline_turbo_wht added. - ggml-metal-ops.cpp: TurboQuant's TQ-rotated mul_mm path preserved; non-TQ else-branch adapted to master's pipeline.nr0/nr1/nsg dispatch API. - ggml-vulkan.cpp: master's spec-constant-driven flash_attn pipeline iteration taken (over TurboQuant's CREATE_FA-per-type macro approach). TURBO3_0 added to the fa_kv_ok lambda for type validation. - ggml-vulkan/flash_attn_base.glsl, vulkan-shaders-gen.cpp: master's new spec-constant FA shader generation kept; TurboQuant's DATA_A_TURBO3_0 macro path NOT carried over. *** Vulkan TURBO3_0 flash-attention paths need re-implementation against the new spec-constant API. *** Vulkan TURBO3_0 inference will likely fail until that work is redone. Squash base:7fc1c4ef78(TheTom's last upstream merge point).
110 lines
3.2 KiB
YAML
110 lines
3.2 KiB
YAML
name: TurboQuant+ Release
|
|
|
|
on:
|
|
push:
|
|
tags:
|
|
- 'tqp-v*'
|
|
|
|
env:
|
|
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON"
|
|
|
|
jobs:
|
|
macos-metal:
|
|
runs-on: macos-14
|
|
|
|
steps:
|
|
- name: Clone
|
|
uses: actions/checkout@v6
|
|
with:
|
|
fetch-depth: 0
|
|
|
|
- name: Build
|
|
run: |
|
|
cmake -B build \
|
|
-DGGML_METAL_USE_BF16=ON \
|
|
-DGGML_METAL_EMBED_LIBRARY=ON \
|
|
-DCMAKE_INSTALL_RPATH='@loader_path' \
|
|
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
|
|
${{ env.CMAKE_ARGS }}
|
|
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
|
|
|
|
- name: Pack
|
|
run: |
|
|
cp LICENSE ./build/bin/
|
|
tar -czvf turboquant-plus-${{ github.ref_name }}-macos-arm64-metal.tar.gz \
|
|
-s ",./,turboquant-plus-${{ github.ref_name }}/," -C ./build/bin .
|
|
|
|
- name: Upload
|
|
uses: actions/upload-artifact@v6
|
|
with:
|
|
name: macos-arm64-metal
|
|
path: turboquant-plus-${{ github.ref_name }}-macos-arm64-metal.tar.gz
|
|
|
|
windows-cuda:
|
|
runs-on: windows-2022
|
|
|
|
strategy:
|
|
matrix:
|
|
cuda: ['12.4']
|
|
|
|
steps:
|
|
- name: Clone
|
|
uses: actions/checkout@v6
|
|
|
|
- name: Install Cuda Toolkit
|
|
uses: ./.github/actions/windows-setup-cuda
|
|
with:
|
|
cuda_version: ${{ matrix.cuda }}
|
|
|
|
- name: Install Ninja
|
|
run: choco install ninja
|
|
|
|
- name: Build
|
|
shell: cmd
|
|
run: |
|
|
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
|
|
cmake -S . -B build -G "Ninja Multi-Config" ^
|
|
-DGGML_NATIVE=OFF ^
|
|
-DGGML_CUDA=ON ^
|
|
-DGGML_CUDA_FA_ALL_QUANTS=ON ^
|
|
${{ env.CMAKE_ARGS }}
|
|
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
|
|
cmake --build build --config Release -j %NINJA_JOBS%
|
|
|
|
- name: Pack
|
|
run: |
|
|
cp LICENSE ./build/bin/Release/
|
|
$dst='.\build\bin\Release\'
|
|
robocopy "${{env.CUDA_PATH}}\bin" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
|
|
robocopy "${{env.CUDA_PATH}}\lib" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
|
|
robocopy "${{env.CUDA_PATH}}\bin\x64" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
|
|
7z a turboquant-plus-${{ github.ref_name }}-windows-x64-cuda${{ matrix.cuda }}.zip .\build\bin\Release\*
|
|
|
|
- name: Upload
|
|
uses: actions/upload-artifact@v6
|
|
with:
|
|
name: windows-x64-cuda${{ matrix.cuda }}
|
|
path: turboquant-plus-${{ github.ref_name }}-windows-x64-cuda${{ matrix.cuda }}.zip
|
|
|
|
release:
|
|
needs: [macos-metal, windows-cuda]
|
|
runs-on: ubuntu-latest
|
|
permissions:
|
|
contents: write
|
|
|
|
steps:
|
|
- name: Download artifacts
|
|
uses: actions/download-artifact@v7
|
|
with:
|
|
path: ./release
|
|
merge-multiple: true
|
|
|
|
- name: Create Release
|
|
uses: softprops/action-gh-release@v2
|
|
with:
|
|
tag_name: ${{ github.ref_name }}
|
|
name: TurboQuant+ ${{ github.ref_name }}
|
|
files: ./release/*
|
|
draft: false
|
|
prerelease: false
|