Compare commits

...

2938 Commits

Author SHA1 Message Date
Ed Addario e9192bec56 quantize : fix using combined imatrix GGUFs (multiple datasets) (#14973) 2025-07-30 21:11:56 +02:00
Daniel Bevenius 41e78c567e server : add support for embd_normalize parameter (#14964)
This commit adds support for the `embd_normalize` parameter in the
server code.

The motivation for this is that currently if the server is started with
a pooling type that is not `none`, then Euclidean/L2 normalization will
be the normalization method used for embeddings. However, this is not
always the desired behavior, and users may want to use other
normalization (or none) and this commit allows that.

Example usage:
```console
curl --request POST \
    --url http://localhost:8080/embedding \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello world today", "embd_normalize": -1}
```
2025-07-30 18:07:11 +02:00
uvos ad4a700117 HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (#14949) 2025-07-30 17:38:06 +02:00
Georgi Gerganov e32a4ec60e sync : ggml
ggml-ci
2025-07-30 17:33:11 +03:00
Kai Pastor e228de9449 cmake : Fix BLAS link interface (ggml/1316) 2025-07-30 17:33:11 +03:00
Kai Pastor 73a8e5ca03 vulkan : fix 32-bit builds (ggml/1313)
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
2025-07-30 17:33:11 +03:00
Johannes Gäßler 92b8810ec7 CUDA: skip masked KV slices for all FA kernels (#14924) 2025-07-30 15:46:13 +02:00
Georgi Gerganov 00131d6eaf tests : update for LLAMA_SET_ROWS=1 (#14961)
* test-thread-safety : each context uses a single sequence

* embedding : handle --parallel argument

ggml-ci

* save-load : handle -np 1

ggml-ci

* thread-safety : avoid overriding threads, reduce test case arg

ggml-ci
2025-07-30 15:12:02 +03:00
Georgi Gerganov 1e15bfd42c graph : fix stack-use-after-return (#14960)
ggml-ci
2025-07-30 13:52:11 +03:00
Douglas Hanley a118d80233 embeddings: fix extraction of CLS pooling results (#14927)
* embeddings: fix extraction of CLS pooling results

* merge RANK pooling into CLS case for inputs
2025-07-30 08:25:05 +03:00
Xinpeng Dou 61550f8231 CANN: update ops docs (#14935)
* CANN:add ops docs

* CANN: update ops docs
2025-07-30 08:39:24 +08:00
uvos aa79524c51 HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (#14945) 2025-07-29 20:23:04 +02:00
uvos b77d11179d HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (#14930)
This is useful for testing for regressions on GCN with CDNA hardware.

With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
2025-07-29 17:44:30 +02:00
uvos c7aa1364fd HIP: Ignore unsupported unroll transformation in fattn-vec (#14931)
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
2025-07-29 17:43:43 +02:00
kallewoof 1a67fcc306 common : avoid logging partial messages (which can contain broken UTF-8 sequences) (#14937)
* bug-fix: don't attempt to log partial parsed messages to avoid crash due to unfinished UTF-8 sequences
2025-07-29 17:05:38 +02:00
hipudding 204f2cf168 CANN: Add ggml_set_rows (#14943) 2025-07-29 22:36:43 +08:00
Sigbjørn Skjæret 138b288b59 cuda : add softcap fusion (#14907) 2025-07-29 14:22:03 +02:00
Johannes Gäßler bbd0f91779 server-bench: make seed choice configurable (#14929)
* server-bench: make seed choice configurable

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix error formatting

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-29 10:40:50 +02:00
Aman Gupta 0a5036bee9 CUDA: add roll (#14919)
* CUDA: add roll

* Make everything const, use __restrict__
2025-07-29 14:45:18 +08:00
lhez 8ad7b3e65b opencl : add ops docs (#14910) 2025-07-28 18:50:17 +02:00
Leonard Mosescu bda62193b2 test-backend-ops : extend test case filtering (#14865)
* Extend test case filtering

1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example:

`test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"`

2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example:

`test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"`

These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled)

* Updating the usage help text

* Update tests/test-backend-ops.cpp
2025-07-28 18:04:27 +02:00
Radoslav Gerganov c556418b60 llama-bench : use local GPUs along with RPC servers (#14917)
Currently if RPC servers are specified with '--rpc' and there is a local
GPU available (e.g. CUDA), the benchmark will be performed only on the
RPC device(s) but the backend result column will say "CUDA,RPC" which is
incorrect. This patch is adding all local GPU devices and makes
llama-bench consistent with llama-cli.
2025-07-28 18:59:04 +03:00
xctan db16e2831c ggml-cpu : deduplicate scalar implementations (#14897)
* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0
2025-07-28 17:40:24 +02:00
Akarshan Biswas cd1fce6d4f SYCL: Add set_rows support for quantized types (#14883)
* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp
2025-07-28 20:32:15 +05:30
Xuan-Son Nguyen 00fa15fedc mtmd : add support for Voxtral (#14862)
* mtmd : add support for Voxtral

* clean up

* fix python requirements

* add [BEGIN_AUDIO] token

* also support Devstral conversion

* add docs and tests

* fix regression for ultravox

* minor coding style improvement

* correct project activation fn

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-28 15:01:48 +02:00
Johannes Gäßler 946b1f6859 CUDA: fix pointer incrementation in FA (#14916) 2025-07-28 14:30:22 +02:00
Dongliang Wei 6c6e397aff model : add support for SmallThinker series (#14898)
* support smallthinker

* support 20b softmax, 4b no sliding window

* new build_moe_ffn_from_probs, and can run 4b

* fix 4b rope bug

* fix python type check

* remove is_moe judge

* remove set_dense_start_swa_pattern function and modify set_swa_pattern function

* trim trailing whitespace

* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* better whitespace

Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* use GGML_ASSERT for expert count validation

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Improve null pointer check for probs

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* use template parameter for SWA attention logic

* better whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* move the creation of inp_out_ids before the layer loop

* remove redundant judge for probs

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:47:00 +02:00
Alberto Cabrera Pérez afc0e89698 sycl: refactor quantization to q8_1 (#14815)
* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat
2025-07-28 11:05:53 +01:00
Georgi Gerganov a5771c9eea ops : update BLAS (#14914) 2025-07-28 10:01:03 +02:00
Georgi Gerganov c35f9eaf09 ops : update Metal (#14912) 2025-07-28 08:22:56 +03:00
Georgi Gerganov 1f45f2890e sync : ggml 2025-07-28 08:15:01 +03:00
Kai Pastor 613c5095c3 cmake : Indent ggml-config.cmake (ggml/1310) 2025-07-28 08:15:01 +03:00
Ed Addario 7f97599581 quantize : update README.md (#14905)
* Update README.md

* Fix trailing whitespace

* Update README.md

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-27 23:31:11 +02:00
Ruben Ortlam bf78f5439e vulkan: add ops docs (#14900) 2025-07-27 15:33:08 +02:00
Akarshan Biswas bbfc849274 SYCL: add ops doc (#14901) 2025-07-27 17:52:58 +05:30
Daniel Bevenius ca0ef2dddb llama : clarify comment about pp and tg graphs [no ci] (#14895)
* llama : clarify comment about pp and tg graphs [no ci]

This commit clarifies the comment in `llama-context.cpp` regarding the
prefill prompt (pp), and token generation (tg) graphs.

The motivation for this is that I've struggled to remember these and had
to look them up more than once, so I thought it would be helpful to add
a comment that makes it clear what these stand for.

* squash! llama : clarify comment about pp and tg graphs [no ci]

Change "pp" to "prompt processing".
2025-07-27 12:10:51 +02:00
Erik Scholz 89d1029559 vulkan : add fp16 support for the conv_2d kernel (#14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-27 12:04:33 +02:00
Jeff Bolz f1a4e72de5 vulkan: skip empty set_rows to avoid invalid API usage (#14860) 2025-07-27 11:05:34 +02:00
Gabriel Larson 4762ad7316 model : make rope_yarn_log_mul optional for deepseek2 (#14896)
* make rope_yarn_log_mul optional for deepseek2

* default rope_yarn_log_mul = 0.0f
2025-07-27 11:18:37 +03:00
Shunta Saito 1dc9614e06 llama : fix kq_scale for the attention layers of PLaMo2 (#14892)
* Fix dimensions for expand

* Change dimensions to copy states to cache

* Fix the default value for plamo2 conversion

* Fix scale given to build_attn

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-27 09:38:44 +02:00
Aman Gupta 446595b9b3 Docs: add instructions for adding backends (#14889) 2025-07-27 09:36:43 +08:00
deepsek 66906cd82a HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (#14624)
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-27 00:28:14 +02:00
hipudding 11dd5a44eb CANN: Implement GLU ops (#14884)
Implement REGLU, GEGLU, SWIGLU ops according to #14158
2025-07-26 17:56:18 +08:00
R0CKSTAR 9b8f3c6c77 musa: fix build warnings (unused variable) (#14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-26 10:36:02 +08:00
Aaron Teo c7f3169cd5 ggml-cpu : disable GGML_NNPA by default due to instability (#14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-25 19:09:03 +02:00
Gabe Goodhart 793c0d7f46 metal: SSM_SCAN performance (#14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-25 10:47:39 -06:00
lhez ce111d39d6 opencl: add fused rms_norm_mul (#14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-25 17:12:13 +02:00
wooksong e7fecba934 docs : update HOWTO‑add‑model.md for ModelBase and new model classes (#14874)
This patch updates the example in docs/development/HOWTO-add-model.md to
reflect recent changes after `TextModel` and `MmprojModel` were introduced.

It replaces the outdated `Model` base class with `TextModel` or `MmprojModel`
and updates the registration example accordingly.

Signed-off-by: Wook Song <wook16.song@samsung.com>
2025-07-25 16:25:05 +02:00
Oliver Simons e2b7621e7c ggml : remove invalid portPos specifiers from dot files (#14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
2025-07-25 14:29:57 +03:00
Georgi Gerganov c1dbea752a context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)
ggml-ci
2025-07-25 14:28:06 +03:00
kiwi 749e0d27f0 mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)
* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip

* Update export-lora.cpp

* Update clip.cpp

* Update export-lora.cpp

* format: use space to replace tab
2025-07-25 13:08:04 +02:00
Chris Rohlf 64bf1c3744 rpc : check for null buffers in get/set/copy tensor endpoints (#14868) 2025-07-25 12:17:02 +02:00
Diego Devesa c12bbde372 sched : fix multiple evaluations of the same graph with pipeline parallelism (#14855)
ggml-ci
2025-07-25 11:07:26 +03:00
R0CKSTAR 3f4fc97f1d musa: upgrade musa sdk to rc4.2.0 (#14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-24 20:05:37 +01:00
Georgi Gerganov 2df255da3c sync : ggml
ggml-ci
2025-07-24 20:27:23 +03:00
Kai Pastor 60f816a79d cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-24 20:27:23 +03:00
Daniel Bevenius 5592f278b6 ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-24 20:27:23 +03:00
Georgi Gerganov e4868d16d2 context : perform output reorder lazily upon access after sync (#14853)
* context : perform output reorder after lazily upon access after sync

ggml-ci

* cont : add TODO
2025-07-24 16:31:48 +03:00
Xuan-Son Nguyen 820de57d4f chat : fix kimi-k2 chat template (#14852) 2025-07-24 13:59:56 +02:00
Alberto Cabrera Pérez cb4a63aad6 sycl: fixed semantics of block offset calculation (#14814) 2025-07-24 11:09:57 +01:00
yummy 86f5623d90 llama : fix MiniCPM inference after Granite Four changes (#14850)
MiniCPM models use the llm_build_granite constructor which was changed
in the Granite Four PR to use hparams.rope_finetuned instead of a
use_rope parameter. MiniCPM models need rope enabled by default.

Fixes inference from gibberish to correct responses.
2025-07-24 11:50:51 +02:00
Pouya 39cffdf188 docs: add libcurl-dev install hint for Linux distros (#14801)
* docs: add libcurl-dev install hint for Linux distros

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>

* Update docs/build.md

---------

Signed-off-by: PouyaGhahramanian <PooyaGhahramanian@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-07-24 11:26:44 +02:00
Georgi Gerganov 065908cb09 metal : fix fusion across different encoders (#14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
2025-07-24 10:24:05 +03:00
Donghyeon Jeong 4ec6291a24 sycl: fix undefined variable in work group size check (#14843) 2025-07-24 12:50:41 +08:00
jacekpoplawski a12363bbf0 convert : text-only support for GLM-4.1V-9B-Thinking (#14823)
* use language_model part only, ignore visual layers

* fix rope_dim calculation
2025-07-23 23:23:57 +02:00
Johannes Gäßler a86f52b285 CUDA: fix overflow in FA, tune performance (#14840) 2025-07-23 21:43:25 +02:00
Johannes Gäßler b284197df4 CUDA: fix compilation with GGML_CUDA_F16 (#14837) 2025-07-23 18:22:30 +02:00
Sigbjørn Skjæret 221c0e0c58 ci : correct label refactor->refactoring (#14832) 2025-07-23 14:27:54 +02:00
Johannes Gäßler 07a19e27a2 CUDA: fix quantized KV cache + multiple sequences (#14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-23 14:08:09 +03:00
Georgi Gerganov 18f3b5ff9e tests : add non-cont K,V FA tests
ggml-ci
2025-07-23 14:08:09 +03:00
l3utterfly 7233358d29 memory : handle saving/loading null layers in recurrent memory (#14675)
* Update llama-memory-recurrent.cpp

handle saving/loading null layers in recurrent memory

* fixed styling issues and updated comments

* fix styling issue

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-23 11:16:41 +03:00
lixing-star 6c88b3bb25 ggml: fix loongarch quantize_row_q8_1 error (#14827) 2025-07-23 09:39:51 +03:00
chen fan 14c28dfc50 CANN: weight format to NZ for Ascend310P3 (#14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-23 11:58:00 +08:00
Aman Gupta 8c988fa41d CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
Csaba Kecskemeti acd6cb1c41 ggml : model card yaml tab->2xspace (#14819) 2025-07-22 19:29:43 +03:00
Jeff Bolz 84712b6043 vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817) 2025-07-22 17:35:21 +02:00
Molly Sophia d4d1522b20 llama : add model type detection for rwkv7 7B&14B (#14816)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 23:01:29 +08:00
Ed Addario d1aa0cc5d1 imatrix: add option to display importance score statistics for a given imatrix file (#12718)
* Add --show-statistics option

* Add --show-statistics logic

* Add tensor name parsing

* Tidy output format

* Fix typo in title

* Improve tensor influence ranking

* Add better statistics

* Change statistics' sort order

* Add Cosine Similarity

* Add header search path

* Change header search path to private

* Add weighted statistics per layer

* Update report title

* Refactor compute_statistics out of main

* Refactor compute_cossim out of load_imatrix

* Refactor compute_statistics out of load_imatrix

* Move imatrix statistics calculation into its own functions

* Add checks and validations

* Remove unnecessary include directory

* Rename labels

* Add m_stats getter and refactor compute_statistics out of load_imatrix

* Refactor variable names

* Minor cosmetic change

* Retrigger checks (empty commit)

* Rerun checks (empty commit)

* Fix unnecessary type promotion

Co-authored-by: compilade <git@compilade.net>

* Reverting change to improve code readability

* Rerun checks (empty commit)

* Rerun checks (empty commit)

* Rerun checks - third time's the Charm 🤞 (empty commit)

* Minor cosmetic change

* Update README

* Fix typo

* Update README

* Rerun checks (empty commit)

* Re-implement changes on top of #9400

* Update README.md

* Update README

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

* Remove duplicate option in print_usage()

* Update README.md

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Update README.md

Co-authored-by: compilade <git@compilade.net>

* Remove input check

* Remove commented out code

---------

Co-authored-by: compilade <git@compilade.net>
2025-07-22 14:33:37 +02:00
stduhpf c8ade30036 Mtmd: add a way to select device for vision encoder (#14236)
* Mtmd: add a way to select device for vision encoder

* simplify

* format

* Warn user if manual device selection failed

* initialize backend to nullptr
2025-07-22 12:51:03 +02:00
Sigbjørn Skjæret e28c0b80c2 cuda : implement bf16 cpy ops and enable bf16 cont (#14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-22 12:33:10 +02:00
lhez 8e6f8bc875 opencl: remove unreachable return (#14806) 2025-07-22 08:53:30 +02:00
Molly Sophia adef81781a server : allow setting --reverse-prompt arg (#14799)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-22 09:24:22 +08:00
R0CKSTAR 48b86c4fdb cuda: remove linking to cublasLt (#14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-22 07:45:26 +08:00
Sigbjørn Skjæret 38d3af1b73 opencl: fix im2col when KW!=KH (#14803) 2025-07-21 13:55:10 -07:00
rmatif 6c9ee3b17e opencl: add conv2d kernel (#14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
Romain Biessy cd465d823c sycl: Fix im2col (#14797) 2025-07-21 18:39:29 +02:00
Charles Xu 922042601b kleidiai: add support for get_rows (#14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-21 16:49:52 +03:00
Radoslav Gerganov 2ba1333b35 docs : fix backends table in README.md (#14796) 2025-07-21 14:03:49 +02:00
Jeff Bolz c2e058f1b4 vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Molly Sophia c82d48ec23 llama : fix --reverse-prompt crashing issue (#14794)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-07-21 17:38:36 +08:00
IsaacDynamo b4efd77f8a server : add parse_special option to /tokenize endpoint (#14783) 2025-07-21 10:24:51 +03:00
Aman Gupta 2be60cbc27 docs : fix link for tools/perplexity in README.md (#14780) 2025-07-20 20:13:47 +02:00
rspOverflow b526ad2668 Documentation: Further revisions to the Vulkan section in build.md (#14785)
* Documentation: Revised and further improved the Vulkan instructions for Linux users in build.md.

* Minor: Revise step 2 of the Vulkan instructions for Linux users in build.md
2025-07-20 18:55:32 +02:00
Aman Gupta 938b785764 Clang-format: local files first + fix BinPacking (#14779) 2025-07-20 19:42:34 +08:00
0cc4m 36c153248f Contrib: add 0cc4m as codeowner for Vulkan backend (#14775) 2025-07-19 23:47:21 +03:00
Ervin Áron Tasnádi a979ca22db ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
compilade 90083283ec imatrix : use GGUF to store importance matrices (#9400)
* imatrix : allow processing multiple chunks per batch

* perplexity : simplify filling the batch

* imatrix : fix segfault when using a single chunk per batch

* imatrix : use GGUF to store imatrix data

* imatrix : fix conversion problems

* imatrix : use FMA and sort tensor names

* py : add requirements for legacy imatrix convert script

* perplexity : revert changes

* py : include imatrix converter requirements in toplevel requirements

* imatrix : avoid using designated initializers in C++

* imatrix : remove unused n_entries

* imatrix : allow loading mis-ordered tensors

Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* quantize : use unused imatrix chunk_size with LLAMA_TRACE

* common : use GGUF for imatrix output by default

* imatrix : two-way conversion between old format and GGUF

* convert : remove imatrix to gguf python script

* imatrix : use the function name in more error messages

* imatrix : don't use FMA explicitly

This should make comparisons between the formats easier
because this matches the behavior of the previous version.

* imatrix : avoid returning from void function save_imatrix

* imatrix : support 3d tensors with MUL_MAT

* quantize : fix dataset name loading from gguf imatrix

* common : move string_remove_suffix from quantize and imatrix

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* imatrix : add warning when legacy format is written

* imatrix : warn when writing partial data, to help guess dataset coverage

Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.

* imatrix : avoid loading model to convert or combine imatrix

* imatrix : avoid using imatrix.dat in README

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-19 12:51:22 -04:00
Peter0x44 d4b91ea7b2 vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m 83f5872404 Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
rspOverflow f0d4d176df Documentation: Update build.md's Vulkan section (#14736)
* Documentation: Rewrote and updated the "Without docker" portion of the Vulkan backend build documentation.

* Documentation: Reorganize build.md's Vulkan section.
2025-07-19 12:18:36 +02:00
Georgi Gerganov b17230917c sync : ggml 2025-07-19 11:46:50 +03:00
Georgi Gerganov bf9087f59a metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
Georgi Gerganov 9fb1042ce6 graph : fix graph reuse reset of params (#14760)
ggml-ci
2025-07-18 20:08:33 +03:00
Georgi Gerganov 2adf8d83ac parallel : add option for different RNG seeds (#14757)
ggml-ci
2025-07-18 17:33:41 +03:00
Oliver Simons 021cc28bef cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Georgi Gerganov d498af3d5a graph : avoid huge warm-up graphs for MoE models (#14753)
* graph : avoid huge warm-up graphs for MoE models

ggml-ci

* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Georgi Gerganov eacdeb5bfc model : fix build after merge conflict (#14754) 2025-07-18 11:53:55 +03:00
lgai-exaone e0cb5c5cb8 model : add EXAONE 4.0 support (#14630) 2025-07-18 10:45:49 +02:00
Aman Gupta f9a31eea06 CUDA: set_rows + cpy.cu refactor (#14712) 2025-07-18 14:54:18 +08:00
Georgi Gerganov 8f974bc1e9 graph : refactor context to not pass gf explicitly (#14629)
ggml-ci
2025-07-18 08:29:28 +03:00
Nexes the Elder 09651d09ff graph : Pass the graph placeholder message in debug mode (#14748)
Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.
2025-07-18 07:25:54 +03:00
Neo Zhang Jianyu 349ea79fce use max work group size for device to replace the magic number (#14732) 2025-07-18 10:23:14 +08:00
Piotr Wilkin (ilintar) 670e1360cd convert : fix Ernie4.5 MoE without shared experts (#14746) 2025-07-18 01:17:16 +02:00
Wroclaw 760b4484e3 nix : use optionalAttrs for env mkDerivation attrset argument (#14726) 2025-07-17 15:18:16 -07:00
Piotr Wilkin (ilintar) cb887f1bc1 model: add Ernie 4.5 MoE support (#14658)
* Add Ernie4.5 MoE

* Fix Flake errors.

* Properly encode/decode MoE layer step

* Correct tensor mappings (.weight)

* Pass and read n_ff_exp

* n_ff_shexp calculation and further minor changes

* Rope fixes.

* .gitignore fix

* Add unit32 cast for Linux builds

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes from code review

* Fix trailing whitespace

* Reenable missing experts error

* Code style from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix non-MoE regression

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-17 23:15:32 +02:00
Georgi Gerganov d6fb3f6b49 kv-cache : fix k-shift for multiple streams (#14742)
ggml-ci
2025-07-17 20:52:33 +03:00
Georgi Gerganov 01612b7409 llama : reuse compute graphs (#14482)
* llama : reuse compute graphs

ggml-ci

* llama-bench : add graph reuse parameter

ggml-ci

* cont : remove the parameter and the sched resets

ggml-ci

* graph : rename update() to can_reuse()

ggml-ci

* params : remove is_same()

ggml-ci

* graph : set res->params in llm_graph_context constructor

ggml-ci

* graph : avoid set_max_nodes in llm_graph_result

ggml-ci

* kv-cache : reuse llama_context's graph result instance

ggml-ci

* context : reset the previous graph result upon memory updates

ggml-ci

* batch : llama_ubatch now carries its data instead of pointing to balloc

ggml-ci

* merge : fix build

ggml-ci

* graph : fix can_reuse() checks when flash-attention is disabled

* graph : move llm_graph_result impl in source file + debug env

ggml-ci
2025-07-17 19:08:33 +03:00
Tarek Dakhran 086cf81e88 llama : fix parallel processing for lfm2 (#14705) 2025-07-17 09:22:11 +02:00
Georgi Gerganov d9b691081c kv-cache : opt mask set input (#14600)
ggml-ci
2025-07-17 09:49:15 +03:00
Georgi Gerganov ad57d3edd2 batch : fix uninitialized has_cpl flag (#14733)
ggml-ci
2025-07-17 09:45:54 +03:00
Sigbjørn Skjæret 1ba45d4982 ci : disable failing vulkan crossbuilds (#14723) 2025-07-16 20:52:08 -03:00
Sigbjørn Skjæret 19e5943d9e convert : make hf token optional (#14717)
* make hf token optional

* fail if we can't get necessary tokenizer config
2025-07-16 23:17:43 +02:00
Diner Burger 496957e1cb llama : fix parameter order for hybrid memory initialization (#14725) 2025-07-16 21:17:25 +02:00
Reese Levine 21c021745d ggml: Add initial WebGPU backend (#14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-16 18:18:51 +03:00
tempstudio b0f0ecc3dc model : support output bias for qwen2 (#14711)
Co-authored-by: qwaqrm <qwaqrm@126.com>
2025-07-16 18:02:06 +03:00
Georgi Gerganov 225e7a1438 llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Aman Gupta ab14019821 Support diffusion models: Add Dream 7B (#14644)
* Support diffusion models: Add Dream 7B

* Move diffusion to examples

* Move stuff to examples. Add patch to not use kv-cache

* Address review comments

* Make sampling fast

* llama: remove diffusion functions

* Add basic timings + cleanup

* More cleanup

* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length

* fixup!

* Review: move everything to diffusion-cli for now
2025-07-16 20:03:51 +08:00
Georgi Gerganov 64978340b0 ggml : add asserts (#14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-16 14:43:32 +03:00
Georgi Gerganov 6ffd4e9c44 server : pre-calculate EOG logit biases (#14721)
ggml-ci
2025-07-16 14:04:12 +03:00
Shunta Saito e4841d24d3 llama : fix parallel processing for plamo2 (#14716) 2025-07-16 12:12:22 +02:00
Georgi Gerganov 538cc77f7f server : fix handling of the ignore_eos flag (#14710)
ggml-ci
2025-07-16 12:13:57 +03:00
Johannes Gäßler 5cae766541 scripts: synthetic prompt mode for server-bench.py (#14695) 2025-07-16 09:33:28 +02:00
Sigbjørn Skjæret 4b91d6f71f convert : only check for tokenizer folder if we need it (#14704) 2025-07-16 08:52:04 +02:00
Sigbjørn Skjæret cf91f217f1 convert : add pre-computed hashes first to prevent order mishaps (#14701) 2025-07-16 08:51:12 +02:00
Min-Hua 79e0b68c17 llama: add LLAMA_API to deprecated llama_kv_self_seq_div (#14708)
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?

Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.

Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
2025-07-16 07:00:42 +03:00
Ed Addario c81f4192f9 gguf-py : dump bpw per layer and model in markdown mode (#14703) 2025-07-16 00:04:42 +02:00
Gabriel Larson 4a4f426944 model : add Kimi-K2 support (#14654)
* Kimi-K2 conversion

* add Kimi_K2  pre type

* Kimi-K2

* Kimi-K2 unicode

* Kimi-K2

* LLAMA_MAX_EXPERTS 384

* fix vocab iteration

* regex space fix

* add kimi-k2 to pre_computed_hashes

* Updated with kimi-k2 get_vocab_base_pre hash

* fix whitespaces

* fix flake errors

* remove more unicode.cpp whitespaces

* change set_vocab() flow

* add moonshotai-Kimi-K2.jinja to /models/templates/

* update moonshotai-Kimi-K2.jinja

* add kimi-k2 chat template

* add kimi-k2

* update NotImplementedError

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* except Exception

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-15 21:54:22 +02:00
Jeff Bolz ba1ceb3456 vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-15 21:51:09 +02:00
Jeff Bolz 10a0351a97 vulkan: add RTE variants for glu/add/sub/mul/div (#14653) 2025-07-15 21:32:11 +02:00
Shunta Saito 68e37a61a7 model : add PLaMo-2 support (#14560)
* Add PLaMo-2 model using hybrid memory module

* Fix z shape

* Add cmath to include from llama-vocab.h

* Explicitly dequantize normalization weights before RoPE apply

* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing

* Use ATTN_K/Q_NORM for k,q weights to prevent quantization

* Remove SSM_BCDT that is not used from anywhere

* Do not duplicate embedding weights for output.weight

* Fix tokenizer encoding problem for multibyte strings

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up

* Remove unnecessary part for Grouped Query Attention

* Fix how to load special token id to gguf

* Remove unused tensor mapping

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix plamo2 tokenizer session to prevent multiple calls of build()

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-15 18:11:42 +02:00
R0CKSTAR cbc68be51d cuda: fix build warnings in set-rows.cu (unused variable) (#14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-15 15:28:53 +08:00
Anton Mitkov bdca38376f sycl: Hotfix for non dnnl codepath (#14677) 2025-07-14 18:12:42 +01:00
shalinib-ibm 55c509daf5 ggml : refactor llamafile_sgemm PPC code (#14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-14 16:16:42 +03:00
Aman Gupta 9c9e4fc635 llama-context: add ability to get logits (#14672) 2025-07-14 21:01:41 +08:00
Johannes Gäßler 494c5899cb scripts: benchmark for HTTP server throughput (#14668)
* scripts: benchmark for HTTP server throughput

* fix server connection reset
2025-07-14 13:14:30 +02:00
Akarshan Biswas 0f4c6ec0f1 SYCL: use 1D kernel for set_rows (#14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-14 10:37:55 +01:00
Anton Mitkov 65a3ebb0aa sycl: Batched mulmat rework for oneDNN dispatch (#14617) 2025-07-14 10:37:35 +01:00
Molly Sophia 0d9226763c llama : add jinja template for rwkv-world (#14665)
* llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-14 07:43:43 +08:00
Ed Addario 982e347255 quantize : fix minor logic flaw in --tensor-type (#14572) 2025-07-13 18:02:17 +02:00
Sigbjørn Skjæret 923e3ea2e3 cuda : add set rows for bf16 (#14664) 2025-07-13 15:01:24 +02:00
Yavor Ivanov e743cddb60 cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
Georgi Gerganov 05fec5bd29 ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
2025-07-13 10:36:33 +03:00
Yavor Ivanov dcf7f2ea3c metal : Add missing unary ops Metal support (#14660) 2025-07-13 08:38:13 +03:00
Yavor Ivanov 84b396e051 cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
Tarek Dakhran c31e60647d tests : cover lfm2 cases in test_ssm_conv (#14651) 2025-07-12 19:10:14 +02:00
Tarek Dakhran 67eade1bf9 docs : add LFM2 to models section (#14650)
* readme : add LFM2 to models section

* fix copy paste...
2025-07-12 19:07:08 +02:00
Aman Gupta 7de5c7cab6 CUDA: add set rows for f32 and f16 (#14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-12 16:31:38 +03:00
Georgi Gerganov 8eff95544e sync : ggml 2025-07-12 16:13:27 +03:00
Georgi Gerganov 3120413ccd vulkan : remove unused vars (#0)
ggml-ci
2025-07-12 14:25:44 +03:00
Georgi Gerganov 215535701d sync : ggml
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 74bb294591 vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly 3e303b1107 vulkan : implement ggml_roll (ggml/1290)
ggml-ci
2025-07-12 14:25:44 +03:00
Douglas Hanley 0c1df14b5f server : fix pooled embedding output (#14645) 2025-07-12 13:21:02 +03:00
Jeff Bolz b3ad3a0191 vulkan: support SET_ROWS (#14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 12:12:26 +02:00
Jeff Bolz 98197e5c98 vulkan: optimizations for deepseek prompt processing (#14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 11:51:58 +02:00
Tarek Dakhran f5e96b368f model : support LiquidAI LFM2 hybrid family (#14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-11 20:27:01 +02:00
Slobodan Josic 756aa1020a HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634) 2025-07-11 18:55:00 +02:00
Georgi Gerganov aaa088d87f readme : add hot PRs (#14636)
* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont
2025-07-11 16:07:55 +03:00
Georgi Gerganov 0d5375d54b llama : move enum llama_vocab_pre_type to implementation (#14631)
ggml-ci
2025-07-11 13:46:07 +03:00
Dowon 576c82eda2 vocab : add midm-2.0 model pre-tokenizer (#14626) 2025-07-11 09:36:04 +02:00
Gabe Goodhart 0aedae00e6 model : Granite Four (#13550)
* wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : use std::find for seq_nodes in llama_rs_cache

* llama : state checkpoints for recurrent models

* llama : correctly handle more edge cases for the rs cache

* llama : rename many llama_kv_cache_* functions

* llama : remove useless return value for some llama_cache_* functions

* llama : rethink recurrent state cell counts

* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot

* llama : support Jamba

* llama : fix BERT inference without KV cache

* convert-hf : check for unprocessed Jamba experts

* convert-hf : support Mini-Jamba conversion

* llama : fix Jamba quantization sanity checks

* llama : sequence-length-aware batch splitting

* llama : use equal-sequence-length sub-batches for recurrent models

* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch

* llama : fix batch split output count for embeddings

* llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.

* llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.

* llama : avoid copies for simple batch splits

* llama : use im2col and mul_mat to perform convolution for Mamba

This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.

* llama : fix .base() compilation error on Windows

* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.

* llama : rename llama_cache to llama_past

This can be changed back later if the name change is wrong.
I was renaming the functions anyway to generalize kv-cache-related
functions to hybrid and recurrent model architectures.
I think llama_past is a better name than llama_cache for a combined
kv cache and recurrent state cache, because the states it contains
pretty much always come before the newly-added ones for any particular
sequence. Also 'llama_past_clear' sounds more obvious in what it does
than 'llama_kv_cache_clear'. The future is what the models generate.
(For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

* examples : replace llama_kv_cache_seq_* with llama_past_seq_*

* mamba : fix non-contiguous usage of ggml_silu

* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : session saving and reloading for hybrid models

* convert_hf : fix Jamba conversion

* llama : fix mixed signedness comparison

* llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

* llama : begin renaming llama_past back to llama_kv_cache

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* llama : remove implicit recurrent state rollbacks

* llama : partially apply clang-format style

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* feat: Add conversion for Bamba models

This is borrowed and adapted from the original implementation
https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add Granite 4 conversion

This is a manual copy from my draft branch
https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb bamba through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add bamba to llama_arch_is_hybrid_recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add optional mamba ssm_in bias tensor

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add template specialization for get_arr to load a vector<uint32_t> for layer index arr in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use an explicit bool to determine mamaba vs mamba2

This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture `if` statement inside the mamba
implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Isolate mamba(2) and granite attention layer building in static methods

This will allow these layer-builder methods to be used from other build
structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer sizes in granite build_attention_layer

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First (broken) pass at end-to-end Bamba implementation

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Only do Granite multipliers if set

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Pull granite ffn portion into a static function and reuse in hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(py): Allow gguf duplicate keys if they match by value and type

This is helpful for hybrid models that want to do gguf param setting by
calling multiple parent classes without needing to make those parent
classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(py): Simplify granitemoehybrid conversion to use parents better

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add GRANITE_MOE_HYBRID through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Support GRANITE_MOE_HYBRID in llama-model

This re-uses the Bamba code paths heavily and simply adds the missing parts
for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix flake8 errors

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix recurrent cache get after rebase

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins

The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from `self->` everywhere, but this is still a
bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Implement the full copy-paste version to duplicate the layer builders

This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of `build_rs` /
`build_attn`. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid

I've got back-and-forth a lot about how/if to try to implement reuse of the
"child model" layer types for hybrid models. At the end of the day, I think
hybrid models are their own beast and even if their layers are inspired by
other models, they should maintain control of their own layer building (in
other words, the copy-paste method). Given that, the name should reflect
that this is not a generic hybrid model builder, but rather a granite-
specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts
at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* memory : correctly handle failure in apply()

ggml-ci

* style: Remove TODO for adding first hybrid models to the switch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* docs: Fix comment about duplicate key check

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Conform to standard way of initializing inp_out_ids

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* convert : fix jamba conv1d shape squeezing

* fix: Fix input initialization in granite_hybrid after removal of hybrid inputs

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use llm_graph_context_mamba in llm_build_granite_hybrid

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins

The key is for the mixin classes (llm_graph_context_mamba,
llm_graph_context_granite) to use virtual inheritance from
llm_graph_context. This allows the common members to exist only once in the
class hierarchy. The downside is that llm_graph_context will be
re-initialized once for each parent (ie 2x for single mixin, 3x for two
mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).

* model : add Jamba to Mamba-specific hparams printing

* fix: Fix input setup after upstream merge

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* jamba : remove redundant nullptr initializations

* model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Add support for dense FFN in GraniteMoeHybrid

This was already partially supported via reusing the granite ffn builder,
and there may be models that leverage this architecture going forward. The
naming is a bit odd, but in the transformers version, it reuses the same
model class and simply has zero regular experts and a single shared expert
(which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add support for dense FFN tensor names on c++ side

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use child inputs for Falcon H1 after merge resolution

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary prefix on tensor constants

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : make falcon-h1 use shared mamba2 layer builder

* memory : avoid referring to KV in recurrent cache logs

* fix: Revert order changes for Falcon H1 to stay consistent with upstream

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

* refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid

The only key difference is the use of rope which is now set via
rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove use of diamond inheritance

Per PR discussion, it's simpler to keep this with basic inheritance and not
introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Log mamba params for Granite Hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused ssm_in_b

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused template expansion for get_arr

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Review cleanup in convert_hf_to_gguf

The gist is to be explicit about which base class is being used with the
multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Undo hidden warnings about duplicate identical keys in add_key_value

After further discussion, this encourages sloppy overwriting in the model
converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: If not using ROPE, context is "infinite"

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* doc: Add a comment outlining expected duplicate key warnings

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary duplicate keys in converter

Co-authored-by: Francis Couture-Harpin <git@compilade.net>

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-11 02:20:13 +02:00
rmatif 6bdda13981 opencl: add tiled mul_mat_f16_f32 (#14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-10 14:58:12 -07:00
lhez 0b8855775c opencl: add set_rows for f16 and f32 (#14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Ryan Mangeno 4bb625b713 Smoldocling support (#14597)
* support for smoldocling

* fixed merge conflicts

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>

* merge conflicts

* pre tokenizer merge fix

* convert : fix smollm3 jinja template (#14586)

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* support for smoldocling

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* fixed merge conflicts

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* safetensors tensor mapping

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>

* added back accidental removal of clean spaces for hunyuan

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* updated hash and reordererd model list

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update include/llama.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* removed old tensor name

* removed tensor mappings -> handled by smolvlm

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: ryan-mangeno <ryanmangeno@gmail.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: compilade <git@compilade.net>
2025-07-10 19:41:00 +02:00
Aman Gupta 11ee0fea2a Docs: script to auto-generate ggml operations docs (#14598)
* Docs: script to auto-generate ggml operations docs

* Review: formatting changes + change github action

* Use built-in types instead of typing

* docs : add BLAS and Metal ops

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-10 23:29:01 +08:00
Eric Zhang a457551332 cmake : do not search for curl libraries by ourselves (#14613)
* cmake : do not search for curl libraries by ourselves

* run : do not search for curl libraries by ourselves
2025-07-10 15:29:05 +03:00
Akarshan Biswas 704bb7a71c SYCL: Initial set_rows kernel implementation (#14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-10 09:29:38 +01:00
Xuan-Son Nguyen 435a6d10d6 llama : minor coding style fix for smollm3 (#14605) 2025-07-10 10:00:20 +03:00
Eric Zhang f9a867f592 cmake : bump llguidance version to v1.0.1 (#14609) 2025-07-10 08:19:37 +03:00
Eric Zhang ac44eb6c80 cmake : llguidance build parser library only (#14608) 2025-07-10 08:19:13 +03:00
compilade a57d1bcb3c cuda : support Falcon-H1 state size for SSM_SCAN (#14602) 2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen cb9178f885 llama : remove llm_graph_input_one (#14603) 2025-07-09 23:09:28 +02:00
compilade 4a5686da22 llama : support Jamba hybrid Transformer-Mamba models (#7531)
* wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : use std::find for seq_nodes in llama_rs_cache

* llama : state checkpoints for recurrent models

* llama : correctly handle more edge cases for the rs cache

* llama : rename many llama_kv_cache_* functions

* llama : remove useless return value for some llama_cache_* functions

* llama : rethink recurrent state cell counts

* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot

* llama : support Jamba

* llama : fix BERT inference without KV cache

* convert-hf : check for unprocessed Jamba experts

* convert-hf : support Mini-Jamba conversion

* llama : fix Jamba quantization sanity checks

* llama : sequence-length-aware batch splitting

* llama : use equal-sequence-length sub-batches for recurrent models

* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch

* llama : fix batch split output count for embeddings

* llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.

* llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.

* llama : avoid copies for simple batch splits

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.

* llama : fix .base() compilation error on Windows

* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.

* mamba : fix non-contiguous usage of ggml_silu

* llama : session saving and reloading for hybrid models

* convert_hf : fix Jamba conversion

* llama : fix mixed signedness comparison

* llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

* llama : begin renaming llama_past back to llama_kv_cache

* llama : remove implicit recurrent state rollbacks

* llama : partially apply clang-format style

* convert : fix jamba conv1d shape squeezing

* graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).

* model : add Jamba to Mamba-specific hparams printing

* jamba : remove redundant nullptr initializations

* model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : make falcon-h1 use shared mamba2 layer builder

* memory : avoid referring to KV in recurrent cache logs

* gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-07-09 14:59:57 -04:00
Xuan-Son Nguyen 98bab638fb ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Miaoqian Lin 26a48ad699 ggml : prevent integer overflow in gguf tensor size calculation (#14595) 2025-07-09 14:33:53 +02:00
Dowon ffd59e7d18 model : add skt/A.X-4.0 model vocabulary (#14589) 2025-07-09 11:22:31 +03:00
Sigbjørn Skjæret 105554595f llama : remove unintended whitespace (#14592) 2025-07-09 10:19:50 +02:00
ibrahim khadraoui 04655063c4 model : add support for Falcon-H1 family (#14534)
* v1

* push more fixes

* another fix

* fix

* more fixes

* minor fix

* more cleaning on python code

* python fixes

* changed precision for multipliers float 32->64

* fixes

* another fix

* fix

* pre-norm -> norm

* fix

* Revert "fix"

This reverts commit 243e4d1a50.

* fix

* small fix ffn_norm

* try

* mix instead of max

* fix vocab size

* conflict solve

* fixed multipliers

* falcon-h1 specefic vocab resolved

* read arch from gguf.MODEL_ARCH

* mamba_d_ssm added to d_inner find_hparam

* remove unused functions from gguf_writer.py

* override modify_tensors instead of get_tensors

* fix conversion and d_inner

* added some cb functions for debugging puposes

* inp_out_ids moved outside of layers loop

* mup_vec create as float64

* fix rope_theta

* injected mup

* clean ups

* rm extra space

* rm unused MAMBA_CHUNK_SIZE

* rm unused key

* add bos False

* changed ROPE_TYPE

* cleaning debugging stuff

* cleaning debug quant

* fix comment

* some cleanups

* some cleanups

* Update src/llama-model-loader.cpp

* more cleanups

* moe cleanuips

* d_ssm -> d_inner;

* cleaning unused hparams

* cleanup

* more cleanups

* more cleanups on python conversion;

* minor cleanups

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* remove todo

* added falcon-h1

* tensor not required

* clean

* remove unneeded attributes

* more cleanups and fixed conversion

* remove final_norm

* flake8 fixes

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* flake8 fixes

* Update src/llama-hparams.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* added hashes

* Update src/llama-arch.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update the update file

* Revert "update the update file"

This reverts commit 082ab4ad2a.

* fix: address suggestions

* fix: update convert_hf_to_gguf.py

* Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* d_inner fixed

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* reshaping ssm_norm for 34B

* removing generate_mup

* remove duplicates metadata keys

* rm comment

* final comment

* fix unused args

* fix constants

* fix bad merge

* Update src/llama-model.cpp

Co-authored-by: compilade <git@compilade.net>

* falcon-h1: remove unused ssm_in_b and bad merge

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* falcon-h1: fix last comment

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* falcon-h1: revert add_add_bos(False)

* falcon-h1: fix tied weights

* falcon-h1: remove whitespace

* falcon-h1: fix wrong size param

* falcon-h1: fix whitespace issues

---------

Co-authored-by: younesbelkada <younes.belkada@tii.ae>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: compilade <git@compilade.net>
2025-07-09 10:03:49 +02:00
Xuan-Son Nguyen 20b7bf8a32 convert : fix smollm3 jinja template (#14586) 2025-07-09 09:26:13 +03:00
Jeff Bolz 6efcd65945 vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
stevenkuang 699f4392a3 model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
2025-07-08 18:29:29 +02:00
Xuan-Son Nguyen 08382869a2 model : add SmolLM3 (#14581)
* Init - first pass.

* Model -> ModelBase.

* fix errors in conversion.

* Update the graph.

* up.

* up.

* wip

* cgraph ok

* rm redundant code

---------

Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>
2025-07-08 18:07:01 +02:00
compilade bb4f7a9e4e memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models
were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.
2025-07-08 18:37:47 +03:00
Jeff Bolz b8eeb8741d vulkan : fix rope with partial rotation and non-cont src (#14582) 2025-07-08 15:21:21 +02:00
Alawode Oluwandabira 17a1f0d2d4 server: Add ability to mount server at prefix (#14544)
* Add server_prefix

* Correct server path env

* Rename cli flag to --api-prefix

* Change all to api_prefix
2025-07-08 11:47:33 +03:00
Xuan-Son Nguyen 8f22dc0a53 model : add hunyuan moe (#14425)
* model : add hunyuan moe

* tokenizer ok

* fix tensor name

* cgraph init

* chat template

* wip

* almost working

* skip embed, fix bos

* cleanup

* yarn scaling

* cleanup

* correct rope type

* failed token fix

* ntk alpha freq_base

* tokenization working

* cleanup and pr changes

* vocab_size sanity check

* ntk alpha generic

* Update convert_hf_to_gguf.py

* Apply suggestions from code review

* fix regression

* fix style

---------

Co-authored-by: kooshi <1934337+kooshi@users.noreply.github.com>
2025-07-08 11:24:06 +03:00
Jeff Bolz 53903ae6fa vulkan: increase timeout for CI (#14574) 2025-07-08 09:38:31 +02:00
Georgi Gerganov 4d0dcd4a06 cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-08 10:15:21 +03:00
Aman Gupta 75c91de6e9 CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00
R0CKSTAR 68155c66f0 musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-08 07:58:30 +08:00
Sigbjørn Skjæret e1a7059053 llama : fix incorrect minicpm3 v_states shape (#14571) 2025-07-07 23:35:35 +02:00
Sigbjørn Skjæret 12f55c302b llama : remove ggml_cont where possible (#14568) 2025-07-07 21:35:08 +02:00
Aman Gupta b9c3eefde1 CUDA: add bf16 and i32 to getrows (#14529) 2025-07-07 21:45:43 +08:00
Eve 6491d6e4f1 vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-06 12:29:36 +02:00
Jeff Bolz e592be1575 vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-06 10:08:16 +02:00
Jeff Bolz a0374a67e2 vulkan: Handle updated FA dim2/3 definition (#14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret ddef99522d server : fix assistant prefilling when content is an array (#14360) 2025-07-05 09:17:14 +02:00
Sigbjørn Skjæret 6681688146 opencl: add GELU_ERF (#14476) 2025-07-04 23:24:56 -07:00
Georgi Gerganov bac8bed248 eval-callback : check for empty input (#14539) 2025-07-05 07:18:09 +03:00
R0CKSTAR b81510a7b7 test-backend-ops: add support for specifying output format (#14368)
* test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* refactor

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* remove visitor nonsense

* remove visitor comment

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2025-07-05 12:10:53 +08:00
Georgi Gerganov ef797db357 metal : disable fast math in all quantize kernels (#14528)
ggml-ci
2025-07-04 19:19:09 +03:00
Georgi Gerganov 67d1ef23c6 batch : add optional for sequential equal split (#14511)
ggml-ci
2025-07-04 09:08:59 +03:00
Georgi Gerganov 7b50f7c025 graph : prepare for 4D mask (#14515)
ggml-ci
2025-07-04 09:05:36 +03:00
Georgi Gerganov c79184d2d1 batch : add n_used count (#14512)
ggml-ci
2025-07-04 09:04:59 +03:00
luyhcsu 499a8f5a78 CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-04 11:50:07 +08:00
Sigbjørn Skjæret 28657a8229 ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445) 2025-07-03 23:07:22 +02:00
lhez bee28421be opencl : broadcast for soft_max (#14510) 2025-07-03 20:22:24 +02:00
Jeff Bolz 2b72bedec1 vulkan: support mixed/deepseekR1 FA head sizes (#14509)
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
2025-07-03 20:21:14 +02:00
Johannes Gäßler c8c4495b8d ggml: backward pass for split swiglu (#14483) 2025-07-03 17:05:18 +02:00
Nicolò Scipione 7b63a71a6b Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-07-03 11:00:03 +02:00
Xuan-Son Nguyen 0c2ee38ab7 convert : correct gemma 3n conversion (#14450)
* convert : correct gemma 3n conversion

* rm redundant code
2025-07-03 10:03:06 +02:00
Georgi Gerganov a70c8a0c4b kv-cache : use ggml_set_rows (#14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
2025-07-03 10:53:35 +03:00
Georgi Gerganov 9067487c44 ggml : fix FA mask dim 2 and 3 (#14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
2025-07-03 10:46:57 +03:00
Georgi Gerganov d4cdd9c1c3 ggml : remove kompute backend (#14501)
ggml-ci
2025-07-03 07:48:32 +03:00
Aman Gupta 55c2646b45 CUDA: add dynamic shared mem to softmax, refactor general usage (#14497) 2025-07-03 07:45:11 +08:00
Sigbjørn Skjæret e75ba4c043 gguf-py : add support for chat template jinja files (#14508)
* add support for chat template jinja files

* remove gemma3n hack
2025-07-02 21:02:35 +02:00
compilade 5d46babdc2 llama : initial Mamba-2 support (#9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
2025-07-02 13:10:24 -04:00
Georgi Gerganov e17991c466 sync : ggml
ggml-ci
2025-07-02 20:08:45 +03:00
Daniel Bevenius c46944aa25 ggml : add version function to get lib version (ggml/1286)
* ggml : add version function to get lib version

This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.

The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.

Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```

* ggml : add ggml_commit()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-02 20:08:45 +03:00
Rotem Dan f3ed38d793 Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309) 2025-07-02 18:37:16 +02:00
Aman Gupta 55a1c5a5fd CUDA: add softmax broadcast (#14475)
* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output
2025-07-02 15:48:33 +03:00
Johannes Gäßler 12a81af45f CUDA: broadcasting for FlashAttention mask (#14500) 2025-07-02 15:48:33 +03:00
Jeff Bolz 8875523eb3 vulkan: support softmax/FA batch and broadcast (#14449) 2025-07-02 15:48:33 +03:00
Georgi Gerganov ec68e84c32 ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
2025-07-02 15:48:33 +03:00
zhouwg 307e79d33d opencl : fix possible buffer overflow in dump_tensor (#14490) 2025-07-02 14:38:10 +02:00
Georgi Gerganov d7f5f4e578 simple-chat : fix context-exceeded condition (#14494)
* simple-chat : fix context-exceeded condition

ggml-ci

* cont : fix n_ctx_used computation

ggml-ci
2025-07-02 14:12:07 +03:00
Eric Zhang c8a4e470f6 opencl : skip empty nodes on cgraph compute (#14491) 2025-07-02 13:00:04 +02:00
lhez 603e43dc91 opencl : update upscale to support align corners (#14488) 2025-07-02 09:07:42 +02:00
Sigbjørn Skjæret 611ba4b264 ci : add OpenCL to labeler workflow (#14496) 2025-07-02 09:02:51 +02:00
Eric Zhang 85841e121d github : add OpenCL backend to issue templates (#14492) 2025-07-02 08:41:35 +03:00
Björn Ganster 68b3cd6514 ggml : Callback before abort (#14481)
* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.

* Return previous callback to allow callback chaining

* style fixes

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-02 08:19:31 +03:00
Georgi Gerganov de56944147 ci : disable fast-math for Metal GHA CI (#14478)
* ci : disable fast-math for Metal GHA CI

ggml-ci

* cont : remove -g flag

ggml-ci
2025-07-01 18:04:08 +03:00
Grzegorz Grasza 1b2aaf28ac Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
2025-07-01 15:44:11 +02:00
Chenguang Li 343b6e94b6 CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
* [CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon <757486878@qq.com>

* Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon <757486878@qq.com>

* fix editorconfig

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-07-01 16:47:30 +08:00
Jeff Bolz 6a746cf9c4 vulkan: Split large mul_mat_id to fit in shared memory (#14451) 2025-07-01 10:43:08 +02:00
Sigbjørn Skjæret eff5e45443 add GELU_ERF (#14455) 2025-07-01 10:14:21 +02:00
Georgi Gerganov a6a47958a1 ggml : remove trailing whitespace (#0) 2025-07-01 11:06:39 +03:00
Georgi Gerganov f61c05d4b1 sync : ggml
ggml-ci
2025-07-01 11:06:39 +03:00
Acly 431b2c24f3 ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
* add "align corners" mode for bilinear upscale, and allow downscaling
* add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
* test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
2025-07-01 11:06:39 +03:00
Daniel Bevenius 497be7c01d ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable `best_mad` to `best_error` in the
`make_qkx2_quants` function.

The motivation for this is that the name `best_mad` can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
2025-07-01 11:06:39 +03:00
lhez 79b33b2317 opencl : add GEGLU, REGLU, SWIGLU (#14456) 2025-07-01 09:19:16 +02:00
Aman Gupta 0a5a3b5cdf Add Conv2d for CPU (#14388)
* Conv2D: Add CPU version

* Half decent

* Tiled approach for F32

* remove file

* Fix tests

* Support F16 operations

* add assert about size

* Review: further formatting fixes, add assert and use CPU version of fp32->fp16
2025-06-30 23:57:04 +08:00
Georgi Gerganov 745f11fed0 memory : correctly handle failure in apply() (#14438)
ggml-ci
2025-06-30 18:03:03 +03:00
Georgi Gerganov 5dd942de59 metal : disable fast-math for some cpy kernels (#14460)
* metal : disable fast-math for some cpy kernels

ggml-ci

* cont : disable for q4_1

ggml-ci

* cont : disable for iq4_nl

ggml-ci
2025-06-30 17:04:05 +03:00
Romain Biessy a7417f5594 ggml-cpu: sycl: Re-enable exp f16 (#14462) 2025-06-30 14:52:02 +02:00
Diego Devesa eb3fa2913e test-backend-ops : disable llama test (#14461) 2025-06-30 12:43:15 +02:00
xiaobing318 c839a2da1a cmake : Remove redundant include path in CMakeLists.txt (#14452)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

* Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
2025-06-30 12:48:24 +03:00
Vedran Miletić e9b6350e61 scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
matteo caf5681fcb server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
* initial commit for handling extra template kwargs

* enable_thinking and assistant prefill cannot be enabled at the same time

* can set chat_template_kwargs in command line

* added doc

* fixed formatting

* add support for extra context in generic template init

* coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* coding standard:  common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix merge conflict

* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)

* normalize environment variable name

* simplify code

* prefill cannot be used with thinking models

* compatibility with the new reasoning-budget parameter

* fix prefill for non thinking models

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
2025-06-29 20:02:53 +02:00
Renat 83790b0e7e server : fix appearance of the chats list context menu for Safari (#14322) 2025-06-29 19:29:57 +02:00
Akarshan Biswas f47c1d7106 SYCL: disable faulty fp16 exp kernel (#14395)
* SYCL: disable faulty fp16 CPU exponent for now

* Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec3.

* SYCL: disable faulty fp16 CPU exponent for now

* Fix logic of disabling exponent kernel
2025-06-29 21:07:58 +05:30
Sigbjørn Skjæret a5d1fb6212 ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443) 2025-06-29 14:38:10 +02:00
Sigbjørn Skjæret a0535ffa0d ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
* implement unary REGLU/GEGLU/SWIGLU cpu ops

* relax constraints

* duplicate shape of source

* fix ggml_vec_geglu_f16

* special case gated ops

* implement unary REGLU/GEGLU/SWIGLU cuda ops

* tighten constraints again

* refactor into GGML_GLU_OP

* metal : add glu kernels

ggml-ci

* add CUDA_GLU_BLOCK_SIZE [no ci]

* more constraints and use 64bit ints

ggml-ci

* 64bit multiplication [no ci]

* implement swapped variants (cpu/cuda)

* update comment [no ci]

ggml-ci

* Vulkan: Add GLU ops and shaders

* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

* ggml : implement GLU for split up/gate (#14181)

* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>

* GGML: increase OP count in assertion

* Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

* vulkan: Increase workgroup size for GLU, for performance (#14345)

* vulkan: Increase workgroup size for GLU, for performance

* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup

* merge fix

* metal : add support for split and swap

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-06-29 11:04:10 +02:00
Jeff Bolz bd9c981d72 vulkan: Add fusion support for RMS_NORM+MUL (#14366)
* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-06-29 09:43:36 +02:00
Aman Gupta 27208bf657 CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched

* Review: add type traits and make function more generic

* Review: make check more explicit, add back comments, and fix formatting

* Review: fix formatting, remove useless type conversion, fix naming for bools
2025-06-29 01:30:53 +08:00
Jeff Bolz 63a7bb3c7e vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378) 2025-06-28 17:36:40 +02:00
Jeff Bolz 00d5282c7f vulkan: lock accesses of pinned_memory vector (#14333) 2025-06-28 17:17:09 +02:00
Weizhao Ouyang 566c16fcce model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>
2025-06-28 16:08:21 +02:00
Xinpeng Dou b25e92774e fix async_mode bug (#14432) 2025-06-28 17:35:41 +08:00
Sigbjørn Skjæret 6609507a91 ci : fix windows build and release (#14431) 2025-06-28 09:57:07 +02:00
Jeff Bolz ceb1bf5a34 vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
2025-06-27 22:35:30 -05:00
Georgi Gerganov 72babea5de graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
2025-06-27 21:42:02 +03:00
Georgi Gerganov 43678060c1 recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
2025-06-27 17:55:45 +03:00
Radoslav Gerganov 8d94219a4a ggml : add ggml_set_rows (#14274)
* ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.

ref: #8366

* use I64 for indices

* ggml : add repeat impl for i64

* ggml : add ggml_is_contiguous_rows

* ggml : ggml_set_rows support broadcast

* ggml : ggml_set_rows support quantized dst

ggml-ci

* ggml : support GGML_TYPE_F32 ".from_float" trait

* ggml : ggml_set_rows update comment + better index name

* tests : add ggml_set_rows

* metal : add ggml_set_rows implementation

ggml-ci

* ggml : simplify forward_dup_f32

* ggml : fix supports_op

* tests : add comment to set_rows

* ggml : leave the repeat_i64 for a separate PR

ggml-ci

* ggml : set_rows use std::min instead of MIN

* ggml : better error message for set_rows unsupported type

* metal : perform op->type check only once

* tests : more consistent implementation + more tests

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-27 16:41:40 +03:00
Sigbjørn Skjæret f667f1e624 convert : fix broken sentencepiece vocab (#14416) 2025-06-27 10:42:19 +02:00
Xuan-Son Nguyen 8846aace49 model : gemma3n text-only (#14400)
* gemma3n

* add llm_graph_input_one
2025-06-26 20:34:02 +03:00
bandoti a01047b041 cmake: regen vulkan shaders when shaders-gen sources change (#14398)
* Add shaders-gen sources as target deps
2025-06-26 13:46:53 -03:00
Sigbjørn Skjæret b25346221d llama : return mistral-v7-tekken as default template only (#14390) 2025-06-26 15:01:14 +02:00
Georgi Gerganov e8215dbb96 metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
2025-06-26 15:51:19 +03:00
Georgi Gerganov 5783ae4359 metal : batch rows copy in a single threadgroup (#14384)
* metal : batch rows copy in a single threadgroup

ggml-ci

* metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci
2025-06-26 15:50:15 +03:00
Aaron Teo bf5bcd0b85 docs: update s390x documentation + add faq (#14389)
* docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-26 12:41:41 +02:00
R0CKSTAR 716301d1b0 musa: enable fp16 mma (all) and cublas on qy2 (#13842)
* musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-06-26 12:11:59 +08:00
Aaron Teo 60ef23d6c1 ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
* ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 4a9f60c201)

* ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8d4a7987f9)

* ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0ff0d65162)

* ggml-cpu: better variable names

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 2f58bbcbb8)

* docs: update s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 01b929491b)

* ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with
    gdb. we will need to investigate further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rework noop

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add missing __func__

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: diagnose why __NNPA__ macro is not being defined

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update macro tests

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: test more macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add debug prints

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 157f856c34)

* ggml-cpu: move things around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa5.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 9e40d984ad)

* ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f7.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* fix ggml time initialization

* fix f32_f16 table init

* remove extra line

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: slaren <slarengh@gmail.com>
2025-06-25 23:49:04 +02:00
Sigbjørn Skjæret b193d53069 ggml : do not output unprintable characters on GGUF load failure (#14381) 2025-06-25 23:26:51 +02:00
Anton Mitkov 2bf9d539dd sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973) 2025-06-25 18:09:55 +02:00
lhez 73e53dc834 opencl: ref count ggml_backend_opencl_context and refactor profiling (#14254)
* Move profiling info into `ggml_backend_opencl_context`
* Add `enqueue_ndrange_kernel` to launch kernel
2025-06-24 11:46:25 -07:00
Georgi Gerganov 62af464227 batch : fix check for empty sequences in memory (#14364)
* batch : fix check for empty sequences in memory

ggml-ci

* cont : reuse the var

ggml-ci
2025-06-24 18:26:30 +03:00
Mathieu Baudier c148cf1946 cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362) 2025-06-24 15:05:31 +02:00
Nigel Bosch 1b809cee22 server : move no API key doc to /health (#14352) 2025-06-24 10:59:11 +02:00
Sigbjørn Skjæret abf241045d main : honor --verbose-prompt on interactive prompts (#14350) 2025-06-24 09:31:00 +02:00
Bartowski 901e20bbe5 jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja (#14349)
This will allow the use of tools on the llama-server
2025-06-24 09:17:58 +03:00
uvos 0142961a2e CUDA/HIP: optimize mmv paths taken for HIP devices (#14324)
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-06-24 01:12:56 +02:00
bandoti ce82bd0117 ci: add workflow for relocatable cmake package (#14346) 2025-06-23 15:30:51 -03:00
Jeff Bolz bf2a99e3cb vulkan: update windows SDK in release.yml (#14344) 2025-06-23 15:44:48 +02:00
Molly Sophia 72c6bc3f3d llama : better rwkv chat template and add missing inputs.use_jinja setting (#14336)
* llama-cli : add missing `inputs.use_jinja` setting

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama : better legacy chat template for rwkv

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-06-23 19:56:19 +08:00
Johannes Gäßler defe2158dd CUDA: mul_mat_v support for batch sizes > 1 (#14262)
* CUDA: mul_mat_v support for batch sizes > 1

* use 64 bit math for initial offset calculation
2025-06-23 13:11:31 +02:00
Georgi Gerganov 7b50d589a8 kv-cells : fix tracking of seq_pos (#14339)
* kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

* cont : improve error message

ggml-ci

* cont : add more comments
2025-06-23 12:27:35 +03:00
Jeff Bolz 3a9457df96 vulkan: update windows SDK in CI (#14334) 2025-06-23 10:19:24 +02:00
Ed Addario fa4a9f2a1c quantize : handle user-defined pruning of whole layers (blocks) (#13037) 2025-06-22 23:16:26 +02:00
Sigbjørn Skjæret 238005c2dc gguf-py : fix SpecialVocab parsing when post_processor is null (#14330) 2025-06-22 19:46:17 +02:00
Ruikai Peng 66aba7aca9 run : avoid double tokenization (#14327)
* run : avoid double tokenization by adopting common_tokenize heuristic

* build : fix windows gcc and clang warnings

* lint : fixed trailing whitepace

* run : fix is_first flag
2025-06-23 01:28:06 +08:00
Georgi Gerganov f1f5e82df6 examples : fix is_first logic for tokenization (#14329)
ggml-ci
2025-06-22 20:10:07 +03:00
uvos af3373f1ad HIP: enable vec fattn on RDNA4 (#14323) 2025-06-22 16:51:23 +02:00
yuiseki 5d5c066de8 mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326)
Mistral Small 2506 models using Pixtral vision encoder were running out
of GPU memory when processing images larger than 1024x1024 pixels due to
exponential memory growth from unlimited image size.

This fix applies the same 1024x1024 limit used by Qwen2VL models to
prevent OOM issues while maintaining compatibility with existing models.
2025-06-22 14:44:57 +02:00
Sigbjørn Skjæret 40bfa04c95 common : use std::string_view now that we target c++17 (#14319) 2025-06-22 08:37:43 +03:00
Aman Gupta aa064b2eb7 CUDA: add mean operation (#14313)
* CUDA: add mean operation

* add back sum_rows_f32_cuda

* Review: early exit if col!=0
2025-06-22 12:39:54 +08:00
Sigbjørn Skjæret aa0ef5c578 gguf-py : fix Qwen3-Embedding eos token (#14314) 2025-06-21 18:12:05 +02:00
Markus Tavenrath bb16041cae Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
* Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.

* remove #ifdef for debug utils and add queue marker.
2025-06-21 08:17:12 +02:00
Sigbjørn Skjæret 58cba76a9a gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312) 2025-06-21 07:33:21 +02:00
Georgi Gerganov 67ae5312e2 metal : fix thread-safety (#14300)
ggml-ci
2025-06-21 08:04:18 +03:00
Georgi Gerganov 692e3cdd0a memory : rename interface to llama_memory_context_i (#14296)
* memory : rename interface to llama_memory_context_i

ggml-ci

* cont : fix comments

* cont : use "mctx" for referencing a memory context

ggml-ci
2025-06-21 08:03:46 +03:00
Daniel Han b23fa0b3f4 convert : fix Llama 4 conversion (#14311) 2025-06-21 06:32:01 +02:00
Georgi Gerganov 06cbedfca1 sync : ggml
ggml-ci
2025-06-20 21:02:47 +03:00
Acly b7147673f2 Add ggml_roll (ggml/1274)
* ggml : add ggml_roll

* use set/get_op_params & std::min
2025-06-20 21:02:47 +03:00
David Chiu d860dd99a4 docs : fix the link to llama.h (#14293) 2025-06-20 19:43:35 +02:00
Aman Gupta c959f462a0 CUDA: add conv_2d_transpose (#14287)
* CUDA: add conv_2d_transpose

* remove direct include of cuda_fp16

* Review: add brackets for readability, remove ggml_set_param and add asserts
2025-06-20 22:48:24 +08:00
Sigbjørn Skjæret 22015b2092 lint : remove trailing whitepace (#14304) 2025-06-20 16:37:44 +02:00
Ruikai Peng dd6e6d0b6a vocab : prevent tokenizer overflow (#14301)
* vocab : prevent stack overflow in tokenize

* vocab : return error instead of aborting on oversized token count

* vocab : INT32_MIN from llama_tokenize on overflow
2025-06-20 07:13:06 -07:00
Nicolò Scipione 8308f98c7f sycl: add usage of enqueue_functions extension (#14244)
* Add header and namespace to use enqueue_functions extension

* Convert submit and parallel_for to use new extension in convert.cpp

* Convert submit and parallel_for to use extension in ggml-sycl.cpp

* Convert submit and parallel_for to use extension in gla.cpp

* Convert submit and parallel_for in mmq.cpp

* Convert submit and parallel_for in mmvq.cpp

* Convert submit and parallel_for in remaining files

* Convert all simple parallel_for to nd_launch from enqueue_functions
extension

* Wrapping extension in general function

Create a general function that enable the enqueue_functions extension if
it is enable in the compiler, otherwise call the general SYCL function
to launch kernels.

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-06-20 15:07:21 +02:00
Christian Kastner 6369be0735 Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
* Add PowerPC feature detection and scoring

* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC

* ggml-cpu: Delay some initializations until function is called

When using GGML_BACKEND_DL=ON, these initializations might use
instructions that are not supported by the current CPU.

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-06-20 14:17:32 +02:00
Sigbjørn Skjæret 88fc854b4b llama : improve sep token handling (#14272) 2025-06-20 14:04:09 +02:00
Diego Devesa e28c1b93fd cuda : synchronize graph capture and cublas handle destruction (#14288)
Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread
2025-06-20 13:57:36 +02:00
Georgi Gerganov d27b3ca175 ggml : fix repack work size for mul_mat_id (#14292)
ggml-ci
2025-06-20 11:19:15 +03:00
Charles Xu 9230dbe2c7 ggml: Update KleidiAI to v1.9.0 (#14277) 2025-06-20 10:51:01 +03:00
Georgi Gerganov 812939a9e9 model : more uniform output id handling (#14275)
* model : more uniform output id handling

ggml-ci

* cont : revert n_outputs < n_tokens optimization

ggml-ci

* cont : fix out_ids initialization

ggml-ci
2025-06-20 10:50:27 +03:00
Georgi Gerganov 4c9fdfbe15 ubatch : new splitting logic (#14217)
ggml-ci
2025-06-20 10:14:14 +03:00
Aman Gupta 9eaa51e7f0 CUDA: add conv_2d_dw (#14265)
* CUDA: add conv_2d_dw

* better naming

* simplify using template

* Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const
2025-06-20 09:50:24 +08:00
Diego Devesa 8f71d0f3e8 ggml-cpu : remove unnecesary arm feature detection (#14281)
Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.
2025-06-19 21:24:14 +02:00
Alex Trotta 381174bbda gguf-py : make sentencepiece optional (#14200)
* Make sentencepiece optional

* Bump to 0.18.0

* Bump patch instead of minor

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2025-06-19 15:56:12 +02:00
aa956 d67341dc18 server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
2025-06-19 16:01:03 +03:00
fanyang 456af35eb7 build : suppress gcc15 compile warnings (#14261)
* Change _contains_any() substrs to std::string_view and fix the find comparison logic.
2025-06-19 14:49:48 +02:00
Anton Mitkov 600e3e9b50 sycl: Cleanup codepaths in Get Rows in sycl backend (#14215)
Addresses unused reorder path
2025-06-19 11:40:21 +01:00
bashayer hijji fffcce535e llama-bench : add --no-warmup flag (#14224) (#14270)
Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

- Add no_warmup boolean field to cmd_params struct

- Add --no-warmup command-line argument parsing

- Add help text documentation for the new flag

- Wrap existing warmup logic in conditional check

- Maintain full backward compatibility (warmup enabled by default)

Addresses #14224
2025-06-19 12:24:12 +02:00
pqnet 5fc7856815 convert : fix remote option in Windows (#14100) 2025-06-19 12:21:40 +02:00
Aaron Teo faed5a5f5d llamafile : support s390x SIMD instruction set (#14273) 2025-06-19 11:48:54 +02:00
0cc4m 10bb545c5b Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249) 2025-06-19 09:15:42 +02:00
Gabe Goodhart edc4a29eff memory : Hybrid recurrent cache (#13979)
* feat: Add llama_model_is_hybrid API call

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in
llama-arch with llama_model_is_recurrent delegating to
llm_arch_is_recurrent. The same split is done for hybird. This is needed
because there are places where the llama_model has not yet been initialized
but we need to check if the model is recurrent (specifically for the
per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for attention layer indices hparam

Branch: GraniteFour

* feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: rename *_is_hybrid -> *_is_hybrid_recurrent

The implementation of the hybrid cache intentionally does not specify the
types of the child caches, so there was a naming mismatch with these
predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add layer filter to recurrent cache

Branch: HybridCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer sizing everywhere in kv caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First pass at llama_kv_cache_hybrid_recurrent

This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.

This is a rewrite of the more generic approach in the original hybrid cache
PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Construct hybrid recurrent cache for hybrid recurrent models

This includes a refactor of the create_memory logic to avoid needing to use
the arch enum explicitly unless a model needs explicit cache instantiation
logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix wrong bool condition for split equal in hybrid cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix shift logic to defer to unified cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Support hybrid recurrent in llama-graph

NOTE: I intentionally did not add support for s_mask since it will be going
away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix logic for initializing inputs and attn layers for hybrid caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update recurrent cache for changes to remove intermediate kv_cache interface

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix status for init_update sig for recurrent cache state

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing padding to n_ctx for hybrid cache construction

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update clear signature for data argument after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove errant virtual destructor leftover from previous impl attempt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove n_embd_k/v_s from unified cache

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

* refactor: Remove layer index from n_embd_k/v_s

Now that it's not used at all in the unified cache, we don't need to use
the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove n_embd_k/v_gqa from recurrent cache

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow custom layer filters for hybrid recurrent

This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove logits_all after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove llama_model_is_hybrid_Recurrent public API

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use llama_memory_state_ptr for child states in hybrid memory state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per-
layer components are created for attention layers and recurrent layers. The
main changes are:

- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/
llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/
llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to
llm_graph_input_attn_kv_unified
- Add a build_attn override that takes
llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

This makes the two paradigms fully consistent. The main drawback is the
code duplication in the build_attn and build_rs implementations where the
only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix resize vs reserve and skip null tensors in size computation

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: @younesbelkada

* fix: Fix initialization of child states

Since initially writing this PR, the logic in the child state types changed
such that using the "init full" signature and keeping the ubatches on the
parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use a common build_recurrent_state method that is cache-agnostic

This reduces the code duplication between the different build_rs impls and
also retains a similar signature to the previous build_recurrent_state
method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* recurrent : rework graph inputs + add TODOs

ggml-ci

* refactor: Make status and child states const in hybrid and iswa

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache

This removes the notion of "kv" from the interface names for these memory
types. There are still many references to kv in the implementation of the
recurrent memory which will need further adjustment.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor!: Rename all k/v related values for recurrent/hybrid to r/s

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more
generic "mem_" prefix. The specifics of "k" (key) translate to "r"
(recurrent state) and "v" (value) translate to "s" (state-space embedding
states).

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refacor: _recurrent -> _recr for brevity

It just _happens_ to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix spacing for ref

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: recurrent_layer() -> is_recurrent()

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix spacing for size_s_bytes declaration

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 08:08:14 +03:00
Georgi Gerganov ed3290ab34 metal : add mean kernel (#14267)
* metal : add mean kernel

ggml-ci

* cont : dedup implementation

ggml-ci
2025-06-19 08:05:21 +03:00
Aaron Teo 8d94713654 docs: add s390x build documentation (#14264)
* docs: add s390x-specific build docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add s390x model conversion steps

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: s390x build indent

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update hyperlinks for s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update llama.h docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: s390x add accelerator and perf optimizations

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: s390x indent blocks

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: revert block indentation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add support information for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: s390x reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: remove indentation for accelerator section s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: remove redundant words s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: reword for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: s390x reword simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: fix trailing whitespace for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-18 18:10:26 +01:00
Aaron Teo 50d2227953 ggml-cpu: reduce asm calls for hsum (#14037)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-18 18:10:08 +01:00
Aaron Teo 6231c5cd6d ggml-cpu: fix uncaught underscore terminators (#14023)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-18 18:06:49 +01:00
Charles Xu ef035803eb ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258) 2025-06-18 12:40:07 +01:00
Xuan-Son Nguyen 413977de32 mtmd : refactor llava-uhd preprocessing logic (#14247)
* mtmd : refactor llava-uhd preprocessing logic

* fix editorconfig
2025-06-18 10:43:57 +02:00
Xuan-Son Nguyen 95402553a5 llama-chat : fix multiple system message for gemma, orion (#14246) 2025-06-18 09:58:43 +02:00
Sigbjørn Skjæret 3865cff4f5 convert : fix null head_dim AutoConfig regression (#14248) 2025-06-18 09:52:07 +02:00
Georgi Gerganov d03172cc79 sync : ggml
ggml-ci
2025-06-18 09:59:21 +03:00
Daniel Bevenius dd8e59f443 ggml : disable warnings for tests when using MSVC (ggml/1273)
* ggml : disable warnings for tests when using MSVC

This commit disables warnings for tests on windows when using MSVC.

The motivation for this is that this brings the build output more
inline with what Linux/MacOS systems produce.

There is still one warning generated for the tests which is:
```console
  Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line  warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
  test-arange.cpp
  test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe
```

* ggml : fix typo in tests disable list
2025-06-18 09:59:21 +03:00
Daniel Bevenius bbe98d2784 ggml : remove unused ggml_context_container (ggml/1272)
This commit removes the unused `ggml_context_container` structure from
the ggml library. It looks like the usage of this struct was removed in
Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc
ggml_contexts on the heap (whisper/2525)").

The motivation for this changes is to improve code clarity/readability.
2025-06-18 09:59:21 +03:00
Daniel Bevenius c2056ed6d4 examples : include examples in msvc disable warn (ggml/1270)
This commit adds the examples in the "list" of targets to ignore MSVC
warnings.

The motivation for this is that currently the examples generate a number
of warnings that are ignore/disabled for the core ggml project. This
makes for a cleaner output when building.
2025-06-18 09:59:21 +03:00
bandoti c46503014d cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
* Remove step-targets from vulkan-shaders-gen

* Unset DESTDIR when building vulkan-shaders-gen
2025-06-17 22:33:25 +02:00
xctan 860a9e4eef ggml-cpu : remove the weak alias trick (#14221) 2025-06-17 12:58:32 +03:00
R0CKSTAR fe9d60e74a musa: fix build warning (unused variable) (#14231)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-17 17:48:08 +08:00
Sigbjørn Skjæret e434e69183 common : suggest --jinja when autodetection fails (#14222) 2025-06-16 21:58:42 +02:00
Georgi Gerganov 89fea80d29 server : fix incorrect usage of llama_get_embeddings() (#14225)
* server : fix incorrect usage of llama_get_embeddings()

ggml-ci

* cont : fix the fix

ggml-ci
2025-06-16 22:33:27 +03:00
Diego Devesa 6adc3c3ebc llama : add thread safety test (#14035)
* llama : add thread safety test

* llamafile : remove global state

* llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 08:11:43 -07:00
bandoti 0dbcabde8c cmake: clean up external project logic for vulkan-shaders-gen (#14179)
* Remove install step for vulkan-shaders-gen

* Add install step to normalize msvc with make

* Regenerate modified shaders at build-time
2025-06-16 10:32:13 -03:00
Đinh Trọng Huy ad590be98c model : add NeoBERT (#14164)
* convert neobert model to gguf

* add inference graph

* fix flake8 lint

* followed reviewer suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* follow reviewers suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* override NeoBERT feed-forward length

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 14:53:41 +02:00
uvos 7d6d91babf HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202) 2025-06-16 13:47:38 +02:00
Georgi Gerganov d3e64b9f49 llama : rework embeddings logic (#14208)
* llama : rework embeddings logic

ggml-ci

* cont : fix rerank

ggml-ci

* cont : engrish [no ci]

* cont : fix rerank

ggml-ci

* server : support both embeddings and completions with single model

ggml-ci

* cont : avoid embeddings_org

ggml-ci
2025-06-16 14:14:00 +03:00
Charles Xu 3ba0d843c6 ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206) 2025-06-16 11:47:57 +02:00
Bartowski 0bf49eb668 convert : remove arcee change in convert_hf_to_gguf_update.py (#14207) 2025-06-16 10:16:06 +02:00
Đinh Trọng Huy 4ad243677b gguf-py : allow key override when adding value to GGUFWriter (#14194)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-06-16 09:20:59 +02:00
Jeff Bolz c89c2d1ab9 vulkan: mutex around vkQueueSubmit (#14127)
This fixes the remaining crash in test-thread-safety on my system.
2025-06-16 08:21:08 +02:00
xctan 3555b3004b ggml-cpu : rework weak alias on apple targets (#14146)
* ggml-cpu : rework weak alias on apple targets

* fix powerpc detection

* fix ppc detection

* fix powerpc detection on darwin
2025-06-16 13:54:15 +08:00
Bartowski d7da8dc83a model : Add support for Arcee AI's upcoming AFM model (#14185)
* Add Arcee AFM support

* Add draft update code

* Fix linter and update URL, may still not be final

* Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Remote accidental blank line

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-06-16 01:04:06 +02:00
Eric Curtin cd355eda7d server : When listening on a unix domain socket don't print http:// and port (#14180)
Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-06-15 23:36:22 +02:00
Ed Addario 30e5b01de2 quantize : change int to unsigned int for KV overrides (#14197) 2025-06-15 18:53:45 +02:00
uvos e54b394082 CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196) 2025-06-15 17:30:13 +02:00
uvos 2c2caa4443 HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRONT_SIZE__ (#14183) 2025-06-15 15:45:27 +02:00
Georgi Gerganov 5fce5f948d kv-cache : fix use-after-move of defrag info (#14189)
ggml-ci
2025-06-15 10:52:11 +03:00
Mikko Juola 9ae4143bc6 model : add dots.llm1 architecture support (#14044) (#14118)
Adds:

* Dots1Model to convert_hf_to_gguf.py

* Computation graph code to llama-model.cpp

* Chat template to llama-chat.cpp to detect this model's template.

---

The model is called "dots.llm1" (I decided to shorten it to dots1 or
DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this
architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

* https://huggingface.co/rednote-hilab/dots.llm1.inst

* https://huggingface.co/rednote-hilab/dots.llm1.base

The model architecture is a combination of Qwen and Deepseek parts, as
seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
2025-06-15 09:52:06 +02:00
Georgi Gerganov c311ac664d cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggml-ci
2025-06-15 10:08:58 +03:00
Georgi Gerganov b9912ac570 batch : auto-gen positions + verify multi-sequence input (#14177)
* batch : verify multi-sequence input batches

ggml-ci

* cont : auto-gen positions + verify multi-seq input

ggml-ci

* cont : first print debug info, then perform validation

ggml-ci

* cont : fix position auto-gen + add comments

ggml-ci
2025-06-15 09:18:37 +03:00
Pepijn de Vos 00ba772610 docs : remove WIP since PR has been merged (#13912) 2025-06-15 08:06:37 +02:00
Piotr 3cb203c89f llama-chat : Do not throw when tool parsing fails (#14012)
Currently when a model generates output which looks like a tool call,
but is invalid an exception is thrown and not handled, causing the cli
or llama-server to bail. Instead, handle the chat parser exception and
simply return the generated text in such cases.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-14 17:25:15 +01:00
Aman Gupta 2e42be42bd compare-llama-bench: add option to plot (#14169)
* compare llama-bench: add option to plot

* Address review comments: convert case + add type hints

* Add matplotlib to requirements

* fix tests

* Improve comment and fix assert condition for test

* Add back default test_name, add --plot_log_scale

* use log_scale regardless of x_values
2025-06-14 10:34:20 +02:00
Georgi Gerganov fb85a288d7 vocab : fix build (#14175)
ggml-ci
2025-06-13 20:03:05 +03:00
Svetlozar Georgiev 40643edb86 sycl: fix docker image (#14144) 2025-06-13 18:32:56 +02:00
Guy Goldenberg 3cfbbdb44e Merge commit from fork
* vocab : prevent integer overflow during load

* Add static cast and GGML_ABORT

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-13 19:20:25 +03:00
Georgi Gerganov 80709b70a2 batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
* batch : add LLAMA_BATCH_DEBUG environment variable

ggml-ci

* cont : improve seq_id display
2025-06-13 18:35:00 +03:00
ddpasa 26ff3685bf docs : Update multimodal.md (#14122)
* Update multimodal.md

* Update multimodal.md
2025-06-13 15:17:53 +02:00
Georgi Gerganov 60c666347b batch : rework llama_batch_allocr (#14153)
* batch : rework llama_batch_allocr

ggml-ci

* cont : move validation inside class

ggml-ci

* cont : move output counting to class

ggml-ci

* cont : minor

ggml-ci

* batch : add TODOs

ggml-ci
2025-06-13 13:47:55 +03:00
Georgi Gerganov b7cc7745e3 readme : remove survey link (#14168) 2025-06-13 11:55:44 +03:00
Christian Kastner cc8d081879 cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT

* cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
2025-06-13 10:38:52 +02:00
Đinh Trọng Huy d714dadb57 pooling : make cls_b and cls_out_b optional (#14165)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-06-13 11:34:08 +03:00
Georgi Gerganov ffad043973 server : fix SWA condition for full context reprocess (#14163)
ggml-ci
2025-06-13 11:18:25 +03:00
Anton Mitkov 0889eba570 sycl: Adding additional cpy dbg print output (#14034) 2025-06-13 08:51:39 +01:00
Ewan Crawford c61285e739 SYCL: Bump oneMath commit (#14152)
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669
which adds SYCL-Graph support for recording CUDA BLAS commands.

With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.

```
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
2025-06-13 08:45:37 +01:00
Christian Kastner 09cf2c7c65 cmake : Improve build-info.cpp generation (#14156)
* cmake: Simplify build-info.cpp generation

The rebuild of build-info.cpp still gets triggered when .git/index gets
changes.

* cmake: generate build-info.cpp in build dir
2025-06-13 09:51:34 +03:00
Georgi Gerganov c33fe8b8c4 vocab : prevent heap overflow when vocab is too small (#14145)
ggml-ci
2025-06-13 08:03:54 +03:00
Anton Mitkov ed52f3668e sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125) 2025-06-12 15:15:11 +02:00
Georgi Gerganov a681b4ba83 readme : remove project status link (#14149) 2025-06-12 14:43:09 +03:00
Georgi Gerganov 7d516443dd server : re-enable SWA speculative decoding (#14131)
ggml-ci
2025-06-12 11:51:38 +03:00
Georgi Gerganov f6e1a7aa87 context : simplify output counting logic during decode (#14142)
* batch : remove logits_all flag

ggml-ci

* context : simplify output counting logic during decode

ggml-ci

* cont : fix comments
2025-06-12 11:50:01 +03:00
Georgi Gerganov c3ee46fab4 batch : remove logits_all flag (#14141)
ggml-ci
2025-06-12 11:49:26 +03:00
Georgi Gerganov e2c0b6e46a cmake : handle whitepsaces in path during metal build (#14126)
* cmake : handle whitepsaces in path during metal build

ggml-ci

* cont : proper fix

ggml-ci

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-06-12 10:14:24 +03:00
Georgi Gerganov 9596506965 kv-cache : fix split_equal handling in unified implementation (#14130)
ggml-ci
2025-06-12 10:02:15 +03:00
compilade a20b2b05bc context : round n_tokens to next multiple of n_seqs when reserving (#14140)
This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0.
2025-06-12 02:56:04 -04:00
bandoti 2e89f76b7a common: fix issue with regex_escape routine on windows (#14133) 2025-06-11 17:19:44 -03:00
Christian Kastner 532802f938 Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)
* ggml-cpu: Factor out feature detection build from x86

* ggml-cpu: Add ARM feature detection and scoring

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time
activation of features, we rely on GGML_USE_<FEAT> which need to be set
in cmake, instead of GGML_<FEAT> that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH,
rather than with individual flags.

* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM

Like x86, however to pass around arch flags within cmake, we use
GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>.

Some features are optional, so we may need to build multiple backends
per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring
function sort out which one can be used.

* ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always
being executed as the else-branch of GGML_NATIVE=OFF. The branch is
moved to an elseif-branch which restores the previous behavior.
2025-06-11 21:07:44 +02:00
Sigbjørn Skjæret d4e0d95cf5 chore : clean up relative source dir paths (#14128) 2025-06-11 19:04:23 +02:00
Sigbjørn Skjæret cc66a7f78f tests : add test-tokenizers-repo (#14017) 2025-06-11 17:16:32 +02:00
Jeff Bolz bd248d4dc7 vulkan: Better thread-safety for command pools/buffers (#14116)
This change moves the command pool/buffer tracking into a vk_command_pool
structure. There are two instances per context (for compute+transfer) and
two instances per device for operations that don't go through a context.
This should prevent separate contexts from stomping on each other.
2025-06-11 09:48:52 -05:00
Aman 7781e5fe99 webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
* webui: Wrap long numbers instead of infinite horizontal scroll

* Use tailwind class

* update index.html.gz
2025-06-11 16:42:25 +02:00
Georgi Gerganov 89a184fa71 kv-cache : relax SWA masking condition (#14119)
ggml-ci
2025-06-11 16:48:45 +03:00
Taylor 2baf07727f server : pass default --keep argument (#14120) 2025-06-11 13:43:43 +03:00
Georgi Gerganov 7ae2932116 kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121) 2025-06-11 12:52:45 +03:00
Jeff Bolz 1f7d50b293 vulkan: Track descriptor pools/sets per-context (#14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8)
and move it to the vk_device. Move all the descriptor pool and set tracking to
the context - none of it is specific to pipelines anymore. It has a single vector
of pools and vector of sets, and a single counter to track requests and a single
counter to track use.
2025-06-11 07:19:25 +02:00
lhez 4c763c8d1b opencl: add mul_mv_id_q4_0_f32_8x_flat (#14003) 2025-06-10 16:55:58 -07:00
compilade dad5c44398 kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
* kv-cache : avoid modifying recurrent cells when setting inputs

* kv-cache : remove inp_s_mask

It was replaced with equivalent and simpler functionality
with rs_z (the first zeroed state) and the already-existing inp_s_copy.

* kv-cache : fix non-consecutive token pos warning for recurrent models

The problem was apparently caused by how the tail cells were swapped.

* graph : simplify logic for recurrent state copies

* kv-cache : use cell without src refs for rs_z in recurrent cache

* llama-graph : fix recurrent state copy

The `state_copy` shuffle assumes everything is moved at once,
which is not true when `states_extra` is copied back to the cache
before copying the range of states between `head` and `head + n_seqs`.
This is only a problem if any of the cells in [`head`, `head + n_seqs`)
have an `src` in [`head + n_seqs`, `head + n_kv`),
which does happen when `n_ubatch > 1` in the `llama-parallel` example.

Changing the order of the operations avoids the potential overwrite
before use, although when copies are avoided (like with Mamba2),
this will require further changes.

* llama-graph : rename n_state to state_size in build_recurrent_state

This naming should reduce confusion between the state size
and the number of states.
2025-06-10 18:20:14 -04:00
Sigbjørn Skjæret 55f6b9fa65 convert : fix duplicate key DeepSeek-R1 conversion error (#14103) 2025-06-10 23:29:52 +02:00
Sigbjørn Skjæret 3678b838bb llama : support GEGLU for jina-bert-v2 (#14090) 2025-06-10 18:02:08 +02:00
Jeff Bolz 652b70e667 vulkan: force device 0 in CI (#14106) 2025-06-10 10:53:47 -05:00
Juk Armstrong 3a12db23b6 Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104) 2025-06-10 16:48:07 +01:00
Georgi Gerganov ae92c1855b sync : ggml
ggml-ci
2025-06-10 18:39:33 +03:00
Georgi Gerganov b7ce1ad1e3 ggml : fix weak alias win32 (whisper/0)
ggml-ci
2025-06-10 18:39:33 +03:00
0cc4m 97340b4c99 Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099) 2025-06-10 13:01:33 +01:00
Isaac McFadyen 2bb0467043 rpc : nicer error messages for RPC server crash (#14076) 2025-06-10 09:41:01 +03:00
Georgi Gerganov b8e2194efc sync : ggml
ggml-ci
2025-06-10 09:21:56 +03:00
Kai Pastor 1a3b5e80f7 Add in-build ggml::ggml ALIAS library (ggml/1260)
Enable uniform linking with subproject and with find_package.
2025-06-10 09:21:56 +03:00
Georgi Gerganov 1f63e75f3b metal : use less stack memory in FA kernel (#14088)
* metal : use less stack memory in FA kernel

ggml-ci

* cont : fix BF16 variant
2025-06-09 23:05:02 +03:00
Georgi Gerganov 40cbf571c9 kv-cache : fix shift and defrag logic (#14081)
* kv-cache : fix shift

ggml-ci

* cont : reset shift[i]

ggml-ci

* cont : fix defrag erasing cells that didn't move

ggml-ci
2025-06-09 23:04:35 +03:00
Diego Devesa 7f4fbe5183 llama : allow building all tests on windows when not using shared libs (#13980)
* llama : allow building all tests on windows when not using shared libraries

* add static windows build to ci

* tests : enable debug logs for test-chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-09 20:03:09 +02:00
xctan f470bc36be ggml-cpu : split arch-specific implementations (#13892)
* move ggml-cpu-aarch64 to repack

* split quantize_row_q8_0/1

* split helper functions

* split ggml_vec_dot_q4_0_q8_0

* split ggml_vec_dot_q4_1_q8_1

* split ggml_vec_dot_q5_0_q8_0

* split ggml_vec_dot_q5_1_q8_1

* split ggml_vec_dot_q8_0_q8_0

* split ggml_vec_dot_tq1_0_q8_K

* split ggml_vec_dot_tq2_0_q8_K

* split ggml_vec_dot_q2_K_q8_K

* split ggml_vec_dot_q3_K_q8_K

* split ggml_vec_dot_q4_K_q8_K

* split ggml_vec_dot_q5_K_q8_K

* split ggml_vec_dot_q6_K_q8_K

* split ggml_vec_dot_iq2_xxs_q8_K

* split ggml_vec_dot_iq2_xs_q8_K

* split ggml_vec_dot_iq2_s_q8_K

* split ggml_vec_dot_iq3_xxs_q8_K

* split ggml_vec_dot_iq3_s_q8_K

* split ggml_vec_dot_iq1_s_q8_K

* split ggml_vec_dot_iq1_m_q8_K

* split ggml_vec_dot_iq4_nl_q8_0

* split ggml_vec_dot_iq4_xs_q8_K

* fix typos

* fix missing prototypes

* rename ggml-cpu-quants.c

* rename ggml-cpu-traits

* rename arm folder

* move cpu-feats-x86.cpp

* rename ggml-cpu-hbm

* update arm detection macro in quants.c

* move iq quant tables

* split ggml_quantize_mat_q8_0/K

* split ggml_gemv_*

* split ggml_gemm_*

* rename namespace aarch64 to repack

* use weak aliases to replace test macros

* rename GGML_CPU_AARCH64 to GGML_CPU_REPACK

* rename more aarch64 to repack

* clean up rebase leftover

* fix compilation errors

* remove trailing spaces

* try to fix clang compilation errors

* try to fix clang compilation errors again

* try to fix clang compilation errors, 3rd attempt

* try to fix clang compilation errors, 4th attempt

* try to fix clang compilation errors, 5th attempt

* try to fix clang compilation errors, 6th attempt

* try to fix clang compilation errors, 7th attempt

* try to fix clang compilation errors, 8th attempt

* try to fix clang compilation errors, 9th attempt

* more cleanup

* fix compilation errors

* fix apple targets

* fix a typo in arm version of ggml_vec_dot_q4_K_q8_K

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-09 16:47:13 +02:00
Diego Devesa 8f47e25f56 cuda : fix device sync on buffer clear (#14033) 2025-06-09 16:36:26 +02:00
Georgi Gerganov 201b31dc2e graph : fix geglu (#14077)
ggml-ci
2025-06-09 17:17:31 +03:00
Xinpeng Dou e21d2d4ae2 CANN: Simplify the environment variable setting(#13104)
* Simplify the environment variable setting to specify the memory pool type.

* Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.

* update

* fix CI

* update

* delete whitespace

* fix according to review

* update CANN.md

* update CANN.md
2025-06-09 19:47:39 +08:00
R0CKSTAR dc0623fddb webui: fix sidebar being covered by main content (#14082)
* webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* webui: update index.html.gz

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-09 12:01:17 +02:00
Georgi Gerganov 87d34b381d server : fix LRU check (#14079)
ggml-ci
2025-06-09 12:57:58 +03:00
Nicolò Scipione b460d16ae8 sycl: Add reorder to Q6_K mmvq implementation (#13885)
* Add Reorder to Q6_K mmvq implementation

* Address PR comments: clean up comments

* Remove unused parameter after refactoring q4_k

* Adding inline to function and removing unnecessary reference to int

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-06-09 11:47:07 +02:00
Đinh Trọng Huy 91a8ee6a6f add geglu activation function (#14074)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-06-09 05:15:31 +01:00
Yuanhao Ji 056eb74534 CANN: Enable labeler for Ascend NPU (#13914) 2025-06-09 11:20:06 +08:00
Diego Devesa 247e5c6e44 cuda : fix buffer type check with integrated GPUs (#14069) 2025-06-08 11:39:56 -07:00
吴小白 5787b5da57 ci: add LoongArch cross-compile build (#13944) 2025-06-07 10:39:11 -03:00
Akarshan Biswas 228f34c9ce SYCL: Implement few same quantized type copy kernels (#13739)
* SYCL: Implement few same quantized type copy kernels

* Use memcpy for copying contiguous tensors

ggml-ci

* feat(sycl): add contiguous tensor copy support and device checks

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

* refactor: replace specific block copy functions with template

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

* Exclude BF16 support for COPY tensors for now
ggml-ci

* perf: adjust SYCL copy kernel block sizes for efficiency

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
2025-06-07 18:58:20 +05:30
Sigbjørn Skjæret 0974ad7a7c llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050) 2025-06-07 14:13:12 +02:00
Georgi Gerganov 745aa5319b llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov 487a5e0401 context : fix SWA-related warning for multiple sequences (#14045) 2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret d17a809ef0 llama : support multiple classifier outputs and labels (#13940) 2025-06-06 09:03:25 +02:00
Sigbjørn Skjæret 1caae7fc6c gguf-py : add add_classifier_output_labels method to writer (#14031)
* add add_classifier_output_labels

* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
Masato Nakasaka 669c13e0f6 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
2025-06-05 16:00:29 +02:00
pockers21 146b88e8b3 ci: fix CUDA build failure on autodl cloud machines (#14005)
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.

Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-06-05 16:25:29 +03:00
Georgi Gerganov 7f37b6cf1e memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
2025-06-05 15:29:22 +03:00
Diego Devesa 3a077146a4 llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) 2025-06-05 11:57:42 +02:00
Olexandr88 d01d112abb readme : add badge (#13938) 2025-06-05 10:50:55 +03:00
Sigbjørn Skjæret 9f47fa5792 vocab : warn about missing mask token (#14022) 2025-06-05 09:29:18 +02:00
Georgi Gerganov 9e31bec4fd context : fix pos_min initialization upon error decode (#14008)
ggml-ci
2025-06-05 09:06:29 +03:00
Jeff Bolz 5a8ae3053c vulkan: automatically deduce size of push constants (#13936) 2025-06-05 07:17:58 +02:00
Ervin Áron Tasnádi 0d3984424f ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
2025-06-04 22:02:00 +02:00
Georgi Gerganov 3e63a58ef7 kv-cache : refactor the update/defrag mechanism (#13988)
* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci
2025-06-04 18:58:20 +03:00
Diego Devesa 2589ad3704 ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997) 2025-06-04 15:37:40 +02:00
Diego Devesa 482548716f releases : use dl backend for linux release, remove arm64 linux release (#13996) 2025-06-04 13:15:54 +02:00
Xuan-Son Nguyen 3ac67535c8 llama-graph : use ggml_repeat_4d (#13998) 2025-06-04 10:11:26 +02:00
Johannes Gäßler 0b4be4c435 CUDA: fix FTZ in FA for Gemma 3 (#13991) 2025-06-04 08:57:05 +02:00
Georgi Gerganov e0e806f52e kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
ggml-ci
2025-06-04 09:50:32 +03:00
Jeff Bolz 7e00e60ef8 vulkan: fix warnings in perf logger querypool code (#13937) 2025-06-03 20:30:22 +02:00
Xuan-Son Nguyen ea1431b0fa docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
2025-06-03 13:09:36 +02:00
lhez 71e74a3ac9 opencl: add backend_synchronize (#13939)
* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.
2025-06-02 16:54:58 -07:00
rmatif bfb1e012a0 OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes
2025-06-02 16:53:36 -07:00
Georgi Gerganov 3637576288 server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
2025-06-02 21:34:40 +03:00
Georgi Gerganov ea394d7ab1 metal : use F32 accumulators in FA kernels (#13975)
ggml-ci
2025-06-02 21:33:40 +03:00
Georgi Gerganov 5582c49c39 gemma : more consistent attention scaling for v2 and v3 (#13951)
* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling
2025-06-02 20:54:26 +03:00
Olivier Chafik c9bbc77931 server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Xuan-Son Nguyen bfd322796c mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-02 16:29:28 +02:00
shalinib-ibm 093e3f1feb cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".

This patch provides a fix by first converting the string to uppercase before applying the regex.

Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com>
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>
2025-06-02 15:18:36 +03:00
Atharva Dubey 663445b0de sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
* [WIP]: fuse q8 quantization and reorder

* wip2: fuse q8 quantization and reorder

* working q8 reorder commit

* restored common.hpp

* remove debug prints

* remove unnecessary headers and remove trailing whitespace

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>
2025-06-02 10:12:20 +01:00
Johannes Gäßler 7675c555a1 gguf: fix failure on version == 0 (#13956) 2025-06-01 18:08:05 +02:00
Sigbjørn Skjæret 5e1c3aed40 convert : fix nomic-bert-moe mask token (#13757) 2025-06-01 18:07:21 +02:00
Sigbjørn Skjæret c496fe0b1d convert : fix vocab padding code for bert models (#13954) 2025-06-01 17:23:11 +02:00
Aaron Teo e57bb87ced ggml: check if non-native endian model is being loaded (#13943)
* gguf: prevent non-native endian models from being loaded

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: update error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: make the non-native endian check more verbose

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_assert location

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: reword the endianness check error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-01 16:53:57 +02:00
Georgi Gerganov f3a4b1659c sync : ggml
ggml-ci
2025-06-01 13:43:57 +03:00
Kai Pastor 108009f5c7 vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 13:43:57 +03:00
Kai Pastor d337252acf cmake : Fix broken CMake error messages (ggml/1252) 2025-06-01 13:43:57 +03:00
Radoslav Gerganov af6f91db47 ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit 9d0762e.

closes: #1235
2025-06-01 13:43:57 +03:00
Georgi Gerganov a7b8d35f78 sync : whisper.cpp (ggml/1250)
* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com>
2025-06-01 13:43:57 +03:00
Radoslav Gerganov 6eba72b71c ggml : install dynamic backends (ggml/1240)
* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
Daniel Tang fedf034a98 ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
ddh0 8726392d3d readme : update bindings (#13950) 2025-06-01 11:44:30 +03:00
Georgi Gerganov c04621711a parallel : fix n_junk == 0 (#13952) 2025-06-01 11:42:16 +03:00
Georgi Gerganov 0fc16b42e8 kv-cache : split implementation in separate sources (#13920)
ggml-ci
2025-06-01 11:39:27 +03:00
Max Krasnyansky 053b1539c0 threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 15:39:19 -07:00
Jiří Podivín b3a89c3d9e docs : Note about necessity of having libcurl installed for standard build. (#13945)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2025-05-31 18:58:35 +02:00
Olivier Chafik e15898d1c7 server: allow unclosed thinking tags (#13931) 2025-05-31 08:26:10 -07:00
Georgi Gerganov 803f8baf4f llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov 3600cc2886 llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
igardev c7e0a2054b webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00
Georgi Gerganov 3f55f781f1 llama : auto-batch preparation (#13845)
* llama : auto-batch

ggml-ci

* context : simplify if branching
2025-05-31 12:55:57 +03:00
Xuan-Son Nguyen 51fa76f172 mtmd : drop _shared from libmtmd name, merge helpers into libmtmd (⚠️ breaking change) (#13917)
* mtmd : fix missing public header

* no object

* apply suggestion from Georgi

* rm mtmd-helper, merge it to mtmd

* missing vendor include dir
2025-05-31 10:14:29 +02:00
Georgi Gerganov 12d0188c0d kv-cache : refactor + add llama_memory_state_i (#13746)
* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci
2025-05-31 10:24:04 +03:00
Shawn yang eb3949938e CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-05-31 08:48:04 +02:00
Johannes Gäßler e562eece7c CUDA: fix typo in FlashAttention code (#13926) 2025-05-30 21:22:03 +02:00
Diego Devesa b47ab7b8e9 sched : avoid changing cur_copy when a graph is already allocated (#13922) 2025-05-30 18:56:19 +02:00
Georgi Gerganov dd665cc9d4 parallel : increase the variability of the prompt lengths (#13927)
ggml-ci
2025-05-30 19:38:07 +03:00
Diego Devesa df0c0c7d02 cuda : prevent using split buffers with 3d/4d matrices (#13919) 2025-05-30 16:37:18 +02:00
Akarshan Biswas b49a8ff96b SYCL: Add mrope kernel (#13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
2025-05-30 19:40:57 +05:30
Georgi Gerganov 53f925074d sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Sigbjørn Skjæret db38704f01 convert : fix rwkv bos/eos token (#13844) 2025-05-30 14:50:43 +02:00
Xuan-Son Nguyen 07e4351ce6 convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
2025-05-30 12:24:37 +02:00
Đinh Trọng Huy 291f2b6913 llama : add support for DistilBert (#13907)
* add distilbert

* small fixes

* add note for LLM_ARCH_DISTIL_BERT

* Use MODEL_ARCH.BERT for DistilBert

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-30 11:56:02 +02:00
zhangkaihuo 2c90da4c7e llama : use llm_build_granite for minicpm (#13911) 2025-05-30 10:31:48 +02:00
Christian Kastner ec9e0301fe cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (#13890) 2025-05-30 01:28:54 +02:00
Sigbjørn Skjæret e83ba3e460 llama : add support for jina-reranker-v2 (#13900) 2025-05-29 21:42:31 +02:00
Sigbjørn Skjæret 2b131621e6 gguf-py : add support for sub_type (in arrays) in GGUFWriter add_key_value method (#13561) 2025-05-29 15:36:05 +02:00
Yibo Cai 54a2c7a8cd arm64: optimize q4_k_q8_k kernel with i8mm (#13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
2025-05-29 14:39:20 +03:00
Christian Kastner 21fcc21ad5 cmake: Factor out CPU architecture detection (#13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of ggml/src/CMakeLists.txt

* Switch arch detection over to new function
2025-05-29 12:50:25 +02:00
Vineel Abhinav dd8ba93416 ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (#13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-29 12:18:43 +03:00
Georgi Gerganov 66c92061f5 tests : remove json.hpp from a test (#13880)
ggml-ci
2025-05-29 12:17:16 +03:00
Sigbjørn Skjæret 5ca82fc1d7 convert : workaround for AutoConfig dummy labels (#13881) 2025-05-29 10:00:57 +02:00
Sigbjørn Skjæret 6385b843a8 llama : add RobertaForSequenceClassification reranker support (#13875) 2025-05-29 08:15:01 +02:00
Vineel Abhinav 1b8fb8152d ggml: aarch64: Implement SVE F32 kernels for vector functions (#13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
2025-05-29 09:01:33 +03:00
Beinsezii 53ae30640e gguf-py : fix SafetensorRemote return on undefined size (< 0) (#13841) 2025-05-28 23:50:20 +02:00
Xuan-Son Nguyen 763d06edb7 llama : fix KV shift for qwen2vl (#13870)
* llama : fix KV shift for qwen2vl

* add ref to the PR
2025-05-28 22:35:31 +02:00
Xuan-Son Nguyen 10961339b2 mtmd : move helpers to dedicated library (⚠️ breaking change) (#13866)
* mtmd : move helpers to dedicated library

* fix server build

* rm leftover cmakelist code
2025-05-28 22:35:22 +02:00
bandoti d98f2a35fc ci: disable LLAMA_CURL for Linux cross-builds (#13871) 2025-05-28 15:46:47 -03:00
Đinh Trọng Huy e0e3aa231d llama : add support for BertForSequenceClassification reranker (#13858)
* convert: add support for BertForSequenceClassification

* add support for reranking using BertForSequenceClassification

* merge checks of eos and sep

* fix lint

---------

Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 19:01:58 +02:00
Đinh Trọng Huy aa6dff05be convert: small addition to support LlamaModel (#13838)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
2025-05-28 16:34:18 +02:00
Sky c962ae3382 server: fix remove 'image_url'/'input_audio' json-object effectlly for 'llama_params' in multimodal-model-mode (#13853)
[fix]: remove 'image_url'/'input_audio' effectlly for 'llama_params' in multimodal-model-mode
2025-05-28 16:33:54 +02:00
Xuan-Son Nguyen a3938fb53d convert : fix qwen omni conversion (#13859)
* convert : fix qwen omni conversion

* fix typo
2025-05-28 16:12:35 +02:00
Alex Fanthome f7873fc698 tests : change umlaut test (#11600) 2025-05-28 15:49:28 +02:00
Johannes Gäßler a68247439b CUDA: fix FA tg at long context for CC >= 8.9 (#13852) 2025-05-28 13:33:37 +02:00
Xuan-Son Nguyen 26b79b6cb3 convert : fix tensor naming conflict for llama 4 vision (#13836)
* convert : fix tensor naming conflict for llama 4 vision

* add comment
2025-05-28 10:05:54 +02:00
leo-pony 1e8659e65a CANN: Add SOC TYPE printing in cmake configuration (#13837) 2025-05-28 11:54:20 +08:00
lhez a3c30846e4 opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm (#13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
2025-05-27 12:56:08 -07:00
lhez 1701d4c54f opencl: mark mul_mat f32f32 as supporting non-contiguous tensors (#13790) 2025-05-27 12:53:14 -07:00
Jeff Bolz bef8176387 vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817)
Also change it to be controlled by an env var rather than cmake flag
2025-05-27 18:39:07 +02:00
Georgi Gerganov 34b7c0439e cmake : add llama-cparams.cpp to build (#13832) 2025-05-27 19:08:44 +03:00
Akarshan Biswas f3101a8cc6 SYCL: add gelu_erf kernel (#13749)
* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
2025-05-27 20:52:59 +05:30
Georgi Gerganov 1c49c70d07 sync : ggml 2025-05-27 18:05:33 +03:00
Xuan-Son Nguyen a8ea03d8ad ggml : add ggml_repeat_4d (#13824) 2025-05-27 15:53:55 +02:00
xctan 05f6ac6283 ggml : riscv: add xtheadvector support (#13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
2025-05-27 16:21:36 +03:00
Xuan-Son Nguyen bc583e3c63 mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) (#13784)
* mtmd : allow multiple modalities at the same time

* refactor mtmd tokenizer

* fix compile

* ok, missing SinusoidsPositionEmbedding

* first working version

* fix style

* more strict validate of n_embd

* refactor if..else to switch

* fix regression

* add test for 3B

* update docs

* fix tokenizing with add_special

* add more tests

* fix test case "huge"

* rm redundant code

* set_position_mrope_1d rm n_tokens
2025-05-27 14:06:10 +02:00
bandoti 72b090da2c docs: remove link for llama-cli function calling (#13810) 2025-05-27 08:52:40 -03:00
Christian Kastner 7fe03e7446 ggml-cpu: x86 feature detection is specific to x86 (#13811) 2025-05-27 13:18:39 +02:00
Diego Devesa 952f3953c1 ggml : allow CUDA graphs when using pipeline parallelism (#13814) 2025-05-27 13:05:18 +02:00
Georgi Gerganov 81713121ee kv-cells : track min/max used cells and per-sequence positions (#13808)
* kv-cells : track min/max used cells and per-sequence positions

ggml-ci

* kv-cells : fix pos-modification updates for seq_pos

ggml-ci

* kv-cells : add comments

ggml-ci
2025-05-27 13:49:41 +03:00
Georgi Gerganov f9cd68398b sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
2025-05-27 12:07:52 +03:00
Georgi Gerganov 4f81b33e32 llama : validate seq id batch input (#13809)
* llama : validate seq id batch input

ggml-ci

* cont : fix the fix

ggml-ci
2025-05-27 09:40:59 +03:00
Olivier Chafik cdf94a1802 server: --offline mode (#13804)
* server: --offline mode (env: LLAMA_OFFLINE)

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 22:34:27 +01:00
Georgi Gerganov a26c4cc11e scripts : add option to compare commits in Debug (#13806)
* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS
2025-05-26 22:24:01 +03:00
Georgi Gerganov 4265a87b59 cuda : avoid cuGetErrorString (#13791)
ggml-ci
2025-05-26 22:14:52 +03:00
Akarshan Biswas 6f180b915c SYCL: Add non contiguous support in RMS_NORM and NORM kernels (#13611)
* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div
2025-05-26 21:10:36 +05:30
Olivier Chafik 03f582ae8f server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
standby24x7 88c125f2ac examples/training: Fix file name in README (#13803)
This patch fixes binary file names in README.md.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2025-05-26 16:55:24 +02:00
Olivier Chafik d74e94c1b3 server: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik f13847cfb5 server: fix regression on streamed non-chat completion w/ stops (#13785)
* more forgiving message diffs: partial stop words aren't erased, full stops are

* Add (slow) server test for completion + stream + stop
2025-05-26 14:16:37 +01:00
Georgi Gerganov 79c137f776 examples : allow extracting embeddings from decoder contexts (#13797)
ggml-ci
2025-05-26 14:03:54 +03:00
Georgi Gerganov 22229314fc llama : clarify deprecation message (#13794) 2025-05-26 12:57:50 +03:00
Romain Biessy 9012eb9b45 sycl: Add more debug prints (#13640) 2025-05-26 10:28:53 +02:00
Jeff Bolz fef693dc6b vulkan: mark IM2COL as supporting non-contig (#13783) 2025-05-26 06:02:07 +02:00
Bizhao Shi 2d38b6e400 CANN: Add the basic supports of Flash Attention kernel (#13627)
* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline
2025-05-26 10:20:18 +08:00
Olivier Chafik e121edc432 server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Xuan-Son Nguyen 2f099b510f webui : bump max upload file size to 500MB (#13779) 2025-05-25 18:02:18 +01:00
Sigbjørn Skjæret aa50ba462f tests : improve UGM tokenizer test coverage (#13773) 2025-05-25 16:22:29 +02:00
Georgi Gerganov de2ef53a4b kv-cache : rework kv_cell (#13706)
* kv-cache : rework kv_cell

ggml-ci

* kv-cells : use "shift" instead of "delta" consistently

ggml-ci

* llama : add llama_max_parallel_sequences()

ggml-ci

* kv-cells : update comments [no ci]

* context : fail upon construction if sequences exceed max value

ggml-ci

* kv-cells : get_pos() -> pos_get() + comments

ggml-ci

* kv-cells : fix tracking of "used" cells

ggml-ci
2025-05-25 16:34:36 +03:00
Percy Piper c508256db2 rpc : Fix build on OpenBSD (#13541) 2025-05-25 15:35:53 +03:00
Xuan-Son Nguyen 40aaa8a403 mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760)
* mtmd : add Qwen2-Audio support

* small clean up

* update discussion link

* clarify mtmd_get_output_embd

* clarification in multimodal.md

* fix ultravox bug

* ggml_cont
2025-05-25 14:06:32 +02:00
ddpasa a08c1d2845 docs : add Moondream2 pre-quantized link (#13745)
* Multimodal: Added Moondream2 model and fixed ggml.org link

* Apply suggestions from code review

---------

Co-authored-by: name <none@none.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-25 14:04:49 +02:00
Olivier Chafik d785f9c1fd server: fix/test add_generation_prompt (#13770)
Co-authored-by: ochafik <ochafik@google.com>
2025-05-25 10:45:49 +01:00
Piotr Jasiukajtis 4032ca4066 llama : add support for Qwen3 MoE tied word embeddings (#13768) 2025-05-25 10:29:43 +02:00
Akarshan Biswas 515fdbf7ed SYCL: revert "sycl: simplify bin_bcast_kernel (#13383)" (#13752)
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b0.

ggml-ci
2025-05-25 10:08:37 +03:00
Olivier Chafik f5cd27b71d server: streaming of tool calls and thoughts when --jinja is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Diego Devesa a2d02d5793 releases : bundle llvm omp library in windows release (#13763) 2025-05-25 00:55:16 +02:00
Diego Devesa 17fc817b58 releases : enable openmp in windows cpu backend build (#13756) 2025-05-24 22:27:03 +02:00
Diego Devesa 2bd1b30f69 ggml-cpu : set openmp wait time if not set (#13758) 2025-05-24 22:26:47 +02:00
0cc4m 259469c4b5 Move GLM4 f32 attention fix to the correct function (#13750) 2025-05-24 16:49:12 +02:00
Xuan-Son Nguyen 4c32832c59 ggml : add ggml_gelu_erf() CUDA kernel (#13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
2025-05-24 13:06:47 +02:00
Sigbjørn Skjæret c3a2624339 vocab : fix ugm tokenizer precision (#13743) 2025-05-24 12:29:09 +02:00
Johannes Gäßler ffd0eae60b CUDA: fix race condition in FA vector kernels (#13742) 2025-05-24 11:46:19 +02:00
Diego Devesa b775345d78 ci : enable winget package updates (#13734) 2025-05-23 23:14:00 +03:00
Diego Devesa a70a8a69c2 ci : add winget package updater (#13732) 2025-05-23 22:09:38 +02:00
Georgi Gerganov d13d0f6135 hparams : initialize arrays (#13728)
ggml-ci
2025-05-23 20:16:13 +03:00
Xuan-Son Nguyen 8a2afb7520 llama : allow custom list of swa_layers (#13726) 2025-05-23 17:07:04 +02:00
Xuan-Son Nguyen 9ecf3e66a3 server : support audio input (#13714)
* server : support audio input

* add audio support on webui
2025-05-23 11:03:47 +02:00
Chenguang Li faaaff5f94 CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-23 16:47:53 +08:00
Xuan-Son Nguyen e16c4731c7 ggml : fix the order of ggml_unary_op (#13718) 2025-05-23 08:12:48 +02:00
Jeff Bolz 1dcd01960c vulkan: support CPY from any type to itself (#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-23 06:45:02 +02:00
Jeff Bolz c10ed6cbcc vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696) 2025-05-23 06:33:45 +02:00
Judd a127ff1780 use LOG_WARN to replace std::cerr (#13657) 2025-05-23 06:33:08 +02:00
Diego Devesa 3079e9ac8e release : fix windows hip release (#13707)
* release : fix windows hip release

* make single hip release with multiple targets
2025-05-23 00:21:37 +02:00
Georgi Gerganov 8a1d206f1d tts : fix n_ubatch + make WavTokenizer cache-less (#13713)
ggml-ci
2025-05-22 22:21:07 +03:00
Xuan-Son Nguyen 797990c4bc mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
Aaron Teo ab86335760 common: Include torch package for s390x (#13699)
* common: update requirements.txt to include pytorch nightly for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* common: fix torch installation via pip for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-22 21:31:29 +03:00
Georgi Gerganov cc74d5be99 server : pad small embedding batches (#13692)
ggml-ci
2025-05-22 16:33:39 +03:00
Sigbjørn Skjæret 5be24af73d gguf-py : correct charsmap parameter typing (#13701) 2025-05-22 14:25:05 +02:00
Nicolò Scipione d394a9aedc sycl : Remove waits from function calls (#13702)
* removes the waits in async memcpy functions
2025-05-22 12:54:43 +01:00
Ewan Crawford 6b56a64690 SYCL: Avoid using with SYCL-Graph for unsupported nodes (#13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458)
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-22 16:24:09 +08:00
Henry Linjamäki a4e8912dfd opencl: Add support for multiple devices (#12622)
* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.
2025-05-21 16:21:45 -07:00
Henry Linjamäki edbf42edfd opencl: fix couple crashes (#12795)
* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-21 13:21:17 -07:00
Diego Devesa d643bb2c79 releases : build CPU backend separately (windows) (#13642) 2025-05-21 22:09:57 +02:00
Georgi Gerganov 8e186ef0e7 hparams : support models for which all layers use SWA (#13682)
ggml-ci
2025-05-21 20:00:49 +03:00
Georgi Gerganov 5fbfe384d4 server : improve error reporting (#13680) 2025-05-21 19:46:56 +03:00
antichristHater c76532e7ba convert : add qwen2vl support for unsloth merges (#13686) 2025-05-21 18:40:35 +02:00
Sigbjørn Skjæret 2aa777d86d examples : switch retrieval to llama_encode (#13685)
* switch retrieval to llama_encode

* enable --no-warmup for retrieval
2025-05-21 16:57:38 +02:00
Emmanuel Ferdman eb0f5c28d3 gguf-py : display the invalid gguf type (#13687)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-05-21 16:33:54 +02:00
Xuan-Son Nguyen cf4cb59e64 ggml : add ggml_gelu_erf() (#13667)
* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order
2025-05-21 16:26:33 +02:00
Robin Davidsson 0d5c742161 server : Add the endpoints /api/tags and /api/chat (#13659)
* Add the endpoints /api/tags and /api/chat

Add the endpoints /api/tags and /api/chat, and improved the model metadata response

* Remove trailing whitespaces

* Removed code that is not needed for copilot to work.
2025-05-21 15:15:27 +02:00
Dorin-Andrei Geman 42158ae2e8 server : fix first message identification (#13634)
* server : fix first message identification

When using the OpenAI SDK (https://github.com/openai/openai-node/blob/master/src/lib/ChatCompletionStream.ts#L623-L626) we noticed that the expected assistant role is missing in the first streaming message. Fix this by correctly checking for the first message.

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

* server : Fix checks for first role message for stream=True

Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Signed-off-by: Dorin Geman <dorin.geman@docker.com>

---------

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Co-authored-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-21 15:07:57 +02:00
Georgi Gerganov 797f2ac062 kv-cache : simplify the interface (#13660)
* kv-cache : simplify the interface

ggml-ci

* context : revert llama_batch_allocr position change

ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov b44890df2e model : disable SWA for Phi models (#13676)
* model : disable SWA for Phi models

ggml-ci

* model : update warning message

* model : print warning only if n_swa > 0

* model : fix typo
2025-05-21 13:09:21 +03:00
R0CKSTAR 33983057d0 musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-21 09:58:49 +08:00
Eve fb1cab201c vulkan: fix warnings (#13626)
* small fixes

* remove ifdef
2025-05-20 21:35:16 +00:00
l3utterfly b7a17463ec mtmd-helper : bug fix to token batching in mtmd (#13650)
* Update mtmd-helper.cpp

* Update tools/mtmd/mtmd-helper.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-20 18:55:30 +02:00
Georgi Gerganov be0239693c model : fix llama4 graph (#13663)
ggml-ci
2025-05-20 19:21:04 +03:00
Georgi Gerganov a4090d1174 llama : remove llama_kv_cache_view API + remove deprecated (#13653)
ggml-ci
2025-05-20 16:13:16 +03:00
Johannes Gäßler b69f1647f9 CUDA: skip fully masked-out KV in FA vec kernel (#13584)
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-20 14:45:07 +02:00
Sigbjørn Skjæret 759e37b0d8 tests : avoid github urls due to throttling (#13654) 2025-05-20 12:03:17 +02:00
Svetlozar Georgiev 4245e622e0 sycl: disable reorder for sycl mulmat (#13536) 2025-05-20 11:34:15 +02:00
0cc4m c9c64dee57 Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639) 2025-05-20 10:11:56 +02:00
Georgi Gerganov c00a2634be metal : fix typo in FA kernel comments (#13651) 2025-05-20 10:41:40 +03:00
Georgi Gerganov e298d2fbd0 kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
2025-05-20 08:05:46 +03:00
Xinpeng Dou f0adb80bf7 CANN: Update CANN model support (#13162)
* Update CANN model support status

* Update of model support

* update

* update

* update

* fix format of CANN.md

* fix format of CANN.md

* fix format of CANN.md
2025-05-20 11:43:43 +08:00
Nicolò Scipione f7c9429c85 sycl : Overcoming workaround for mmap() allocation on Windows (#13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-20 08:54:43 +08:00
psocolovsky 1dfbf2cf3a common : add load_progress_callback (#13617) 2025-05-19 21:17:36 +02:00
0cc4m 8960efd0a6 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) 2025-05-19 17:54:08 +02:00
Alberto Cabrera Pérez 725f23f1f3 sycl : backend documentation review (#13544)
* sycl: reviewing and updating docs

* Updates Runtime error codes

* Improves OOM troubleshooting entry

* Added a llama 3 sample

* Updated supported models

* Updated releases table
2025-05-19 14:38:20 +01:00
Xuan-Son Nguyen 92ecdcc06a mtmd : add vision support for llama 4 (#13282)
* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality
2025-05-19 13:04:14 +02:00
Alberto Cabrera Pérez f71f40a284 ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532) 2025-05-19 11:46:09 +01:00
Georgi Gerganov d30cb5a7fa sync : ggml
ggml-ci
2025-05-19 13:29:56 +03:00
Johannes Gäßler 6c35981a64 mnist: fix segmentation fault (ggml/1227) 2025-05-19 13:29:56 +03:00
Diego Devesa 8b5e19aea6 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) 2025-05-19 13:29:56 +03:00
Daniel Tang 60aea028b5 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
Nick 9c55e5c5c2 fix: check model pointer validity before use (#13631) 2025-05-19 13:25:41 +03:00
Chenguang Li 33d7aed4a8 CANN: Support MOE Model MUL_MAT_ID (#13042)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:21:17 +08:00
Isaac McFadyen 6a2bc8bfb7 server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
2025-05-17 23:59:48 +02:00
Gilad S. e3a7cf6c5b cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-17 15:26:43 -03:00
Georgi Gerganov 518329b2d4 parallel : add option for non-shared and larger prompts (#13598)
* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci
2025-05-17 12:58:55 +03:00
Jeff Bolz 2f5a4e1e09 vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-17 09:14:55 +02:00
Jeff Bolz 4f41ee11d6 vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) 2025-05-17 08:35:47 +02:00
Z 3e0be1cace llguidance : official v0.7.20 release (no actual changes) [noci] (#13594) 2025-05-16 22:56:28 +02:00
Xuan-Son Nguyen 6aa892ec2a server : do not return error out of context (with ctx shift disabled) (#13577) 2025-05-16 21:50:00 +02:00
Xuan-Son Nguyen aea9f8b4e7 webui : improve accessibility for visually impaired people (#13551)
* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content
2025-05-16 21:49:01 +02:00
Xuan-Son Nguyen 06c1e4abc1 readme : add list of dependencies and their license (#13591) 2025-05-16 20:04:18 +02:00
Diego Devesa 415e40a357 releases : use arm version of curl for arm releases (#13592) 2025-05-16 19:36:51 +02:00
Georgi Gerganov 654a67794f metal : add FA-vec kernel for head size 64 (#13583)
ggml-ci
2025-05-16 20:32:58 +03:00
Diego Devesa 5364ae4ba5 llama : print hint when loading a model when no backends are loaded (#13589) 2025-05-16 16:38:07 +02:00
Sigbjørn Skjæret 7c07ac244d ci : add ppc64el to build-linux-cross (#13575) 2025-05-16 14:54:23 +02:00
Łukasz Ślusarczyk 0a338ed013 sycl : fixed compilation warnings (#13582) 2025-05-16 18:15:29 +08:00
Olivier Chafik bc098c3cf0 minja: sync (qwen3) (#13573)
* minja: sync https://github.com/google/minja/commit/f06140fa52fd140fe38e531ec373d8dc9c86aa06

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <ochafik@google.com>
2025-05-15 23:29:10 +01:00
Diego Devesa c6a2c9e741 gguf : use ggml log system (#13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
2025-05-15 19:13:11 +02:00
Daniel Tang 07ad2b6db3 gguf-py : fix disconnect-before-connect in editor-gui (#13569)
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
Xuan-Son Nguyen c531edfa34 convert : fix conversion for llama 4 (#13567) 2025-05-15 17:40:07 +02:00
Atharva Dubey 02cdd2d8b0 sycl: simplify bin_bcast_kernel (#13383) 2025-05-15 17:39:52 +02:00
Svetlozar Georgiev 64bb51cf90 sycl: reordered Q4_K MMVQ (#13109) 2025-05-15 17:35:44 +02:00
Łukasz Ślusarczyk 9c404ed54c sycl: use oneDNN for matrices multiplication (#12972) 2025-05-15 16:53:41 +02:00
Diego Devesa 6c8b91500e llama-bench : fix -ot with dl backends (#13563) 2025-05-15 15:46:55 +02:00
Xuan-Son Nguyen 3cc1f1f1d2 webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)
* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages
2025-05-15 14:24:50 +02:00
Piotr Wilkin (ilintar) c753d7bed0 server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540) 2025-05-15 08:40:58 +02:00
Georgi Gerganov b2838049cc bench : handle decode errors (#13548)
ggml-ci
2025-05-15 05:57:02 +03:00
Olivier Chafik aa48e373f2 server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Georgi Gerganov e3a9421b78 kv-cache : fix out-of-bounds view during reserve graph (#13547)
* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
Yibo Cai 5ab5d5fb25 arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-14 21:53:52 +02:00
Olivier Chafik 3198405e98 common: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Sigbjørn Skjæret f5170c1d7a editorconfig : fix trailing whitespace from #13542 (#13546) 2025-05-14 21:22:49 +03:00
Gilad S. 017f10b5fa fix: crash when calling llama_state_get_size on a context without a KV cache (#13542) 2025-05-14 19:18:18 +03:00
Johannes Gäßler 4696d56749 CUDA: fix crash on large batch size for quant. MoE (#13537) 2025-05-14 16:41:02 +02:00
Diego Devesa b7d2672082 llama : fix quantize with dl backends (#13539) 2025-05-14 16:12:36 +02:00
Johannes Gäßler 6da34fa276 CUDA: faster Deepseek FA, add Turing support (#13435) 2025-05-14 16:08:20 +02:00
Gabe Goodhart 5e7d95e22e fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-14 15:53:59 +03:00
Georgi Gerganov 053174436f server : passthrough the /models endpoint during loading (#13535)
* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field
2025-05-14 15:42:10 +03:00
Xuan-Son Nguyen 360a9c98e1 server : fix cache_tokens bug with no cache_prompt (#13533) 2025-05-14 13:35:07 +02:00
bandoti 09d13d94fb cmake: simplify vulkan shader test logic (#13263) 2025-05-14 07:53:57 -03:00
Jeff Bolz 24e86cae72 vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-14 11:55:26 +02:00
Xuan-Son Nguyen bb1681fbd5 webui : use fflate for more deterministic gzip compress (#13525)
* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako
2025-05-14 10:26:12 +02:00
Luca Stefani d486dd3e8e webui: Allow pasting file from clipboard (#13526)
* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00
ddpasa 21ca987fba docs: Update link to ggml-org in multimodal.md (#13513)
* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-14 09:59:12 +02:00
Sigbjørn Skjæret be1d4a13db scripts : fix compare-llama-bench.py show parameter (#13514) 2025-05-14 08:41:01 +02:00
Jeff Bolz ab3971f2a0 vulkan: workaround FA compile failures on macos (#13517) 2025-05-14 06:15:50 +02:00
Ed Addario e5c834f718 quantize : improve tensor-type pattern matching (#13033) 2025-05-13 19:12:31 +02:00
Xuan-Son Nguyen 71bdbdb587 clip : clip.h become private API (⚠️ breaking change) (#13510) 2025-05-13 17:07:21 +02:00
Georgi Gerganov f0995d28ce metal : use FA-vec kernel up to batch size 20 (#13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
2025-05-13 18:04:39 +03:00
Georgi Gerganov c252e0c409 metal : optimize multi-sequence FA vec kernel (#13493)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci
2025-05-13 18:04:00 +03:00
Dan Johansson 4f711afed5 ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-05-13 18:02:28 +03:00
Georgi Gerganov b89d605a91 batched-bench : fix pp batch contents (#13492) 2025-05-13 18:01:53 +03:00
Xuan-Son Nguyen b4726345ac mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (#13460)
* mtmd : remove libllava, remove clip-quantize-cli

* rm clip_model_quantize
2025-05-13 15:33:58 +02:00
Sigbjørn Skjæret bf79371120 scripts : support arbitrary input file formats in compare-llama-bench.py (#13455) 2025-05-13 15:31:12 +02:00
Gabe Goodhart d590cd4c24 model : Granite MoE shared (#13269)
* feat: Add GGUF conversion for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: hparam and arch plumbing for granitemoeshared

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Split MoE fused tensors for shared experts in conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First WIP cut at model arch in cpp

The hparam and architecture plumbing should be correct, but the
implementation of the shared experts seems to still be broken.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Cleaner (maybe more correct?) splitting for gate/up

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix the input to the shared experts

I had misread that the shared experts take the inputs _before_ the standard
MoE layer and was feeding the output of the MoE to the shared experts.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Avoid architecture-specific checks for Granite MoE Shared

This is a cleaner way that will allow more flexibility in architecture
strings going forward.

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Split granite architectures out of llm_build_llama

This helps de-clutter the llama-family graph construction and allows
granite to diverge further (in preparation for Granite 4).

NOTE: I removed the granite scale factors from llm_build_deci because they
appear to only be there as copy-paste from llm_build_llama. The HF config
does not seem to set those values:
https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix compiler warning about uninitialized inp_pos

This should not have been reachable, but it warns on some compliers

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consoladate GraniteMoEShared into GraniteMoE for conversion

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side

Branch: GraniteMoEShared

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-05-13 15:12:01 +02:00
Georgi Gerganov 1e2809bc4b sync : ggml 2025-05-13 14:02:28 +03:00
Diego Devesa cf0a43bb64 llama-bench : add defrag-thold, check for invalid ranges (#13487) 2025-05-13 00:31:37 +02:00
lhez f0d46ef157 opencl: remove unnecessary assert for add (#13257) 2025-05-12 13:13:49 -07:00
Xuan-Son Nguyen de4c07f937 clip : cap max image size 1024 for qwen vl model (#13478) 2025-05-12 15:06:51 +02:00
Johannes Gäßler 10d2af0eaa llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Georgi Gerganov 064cc596ac context : fix state io for memory-less contexts (#13470)
ggml-ci
2025-05-12 15:12:27 +03:00
Anudit Nagar 91159ee9df server : allow content to be null in oaicompat_completion_params_parse (#13477) 2025-05-12 13:56:42 +02:00
Diego Devesa 22cdab343b llama-bench : accept ranges for integer parameters (#13410) 2025-05-12 13:08:22 +02:00
Dan Johansson a71a4075cd ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (#13053)
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * code review fixes

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

---------

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-12 13:06:19 +02:00
Johannes Gäßler 95e18884fc CUDA: fix misaligned synchronization in FA (#13469) 2025-05-12 10:51:21 +02:00
Xuan-Son Nguyen df8491922f ggml : add mrope kernel for metal (#13457) 2025-05-12 10:29:13 +02:00
Atharva Dubey 14492144c2 enable dpcpp nightly builds with libraries (#13406) 2025-05-12 13:15:32 +08:00
City c104023994 mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (#13459) 2025-05-12 00:39:06 +02:00
Anthony Umfer 9a390c4829 tools : fix uninitialized llama_batch in server (#13436)
* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free()

* Update tools/server/server.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* use C++11 initializer syntax

* switch from Copy-list-initialization to Direct-list-initialization

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-11 17:08:26 +02:00
Sigbjørn Skjæret 09232370fc scripts : exit compare-llama-bench.py gracefully when there's nothing to compare (#13451) 2025-05-11 16:20:39 +02:00
Johannes Gäßler 7474e00b34 CUDA: fix crash with partial offloading of MoE (#13439) 2025-05-11 16:09:33 +02:00
David Huang 7f323a589f Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
City 3eac209319 mtmd : support InternVL 3 38B and 78B mmproj (#13443)
* Support InternVL 3 38B and 78B mmproj

* Swap norms in clip.cpp

* Group variables together
2025-05-11 11:35:52 +02:00
Xuan-Son Nguyen a634d75d1b mtmd : move helpers to dedicated file (#13442)
* mtmd : move helpers to dedicated file

* fix windows build

* rm redundant include
2025-05-11 11:34:23 +02:00
Thomas Germer 62d4250e52 docs : Fix typo in InternVL3 model name (#13440) 2025-05-10 22:26:46 +02:00
Johannes Gäßler 0208355f42 CUDA: fix race conditions FlashAttention kernels (#13438) 2025-05-10 22:22:48 +02:00
Sigbjørn Skjæret d2a4ef05c6 vocab : add ByteDance-Seed/Seed-Coder (#13423) 2025-05-10 22:08:07 +02:00
Xuan-Son Nguyen 15e6125a39 mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (#13434)
* mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl

* fix typo
2025-05-10 19:57:54 +02:00
Xuan-Son Nguyen 3b24d26c22 server : update docs (#13432) 2025-05-10 18:44:49 +02:00
Sigbjørn Skjæret 43dfd741a5 llguidance : set tokenizer slices to default (#13424) 2025-05-10 17:19:52 +02:00
Thammachart Chinvarapon b064a51a4e ci: free_disk_space flag enabled for intel variant (#13426)
before cleanup: 20G
after cleanup: 44G
after all built and pushed: 24G

https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245
2025-05-10 16:34:48 +02:00
Xuan-Son Nguyen 053367d149 mtmd : support InternVL 2.5 and 3 (#13422)
* convert : internvl support

* InternVL3-1B working

* fix regression

* rm mobilevlm from test

* fix conversion

* add test for internvl

* add to list of pre-quant

* restore boi/eoi check

* add clarify comment for norm eps
2025-05-10 16:26:42 +02:00
Johannes Gäßler d8919424f1 CUDA: fix FlashAttention on Turing (#13415) 2025-05-10 09:16:52 +02:00
Xuan-Son Nguyen 7fef11766c arg : add env var to control mmproj (#13416)
* arg : add env var to control mmproj

* small note about -hf --mmproj
2025-05-10 08:16:29 +02:00
Jeff Bolz dc1d2adfc0 vulkan: scalar flash attention implementation (#13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-10 08:07:07 +02:00
Helton Reis 7c28a74e07 chore(llguidance): use tagged version that does not break the build (#13413) 2025-05-09 23:15:39 +03:00
Xuan-Son Nguyen 33eff40240 server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-09 19:29:37 +02:00
Alberto Cabrera Pérez 17512a94d6 sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (#12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-09 16:34:08 +01:00
Georgi Gerganov 611aa914ef metal : optimize MoE for large batches (#13388)
ggml-ci
2025-05-09 15:14:56 +03:00
Johannes Gäßler 0cf6725e9f CUDA: FA support for Deepseek (Ampere or newer) (#13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
Diego Devesa 27ebfcacba llama : do not crash if there is no CPU backend (#13395)
* llama : do not crash if there is no CPU backend

* add checks to examples
2025-05-09 13:02:07 +02:00
Johannes Gäßler 5c86c9ed3e CUDA: fix crash on large batch size for MoE models (#13384) 2025-05-09 12:14:04 +02:00
Bartowski efb8b47eda imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (#13389)
* Add --parse-special for enabling parsing of special tokens in imatrix calculation

* whitespace
2025-05-09 11:53:58 +02:00
R0CKSTAR 0527771dd8 llama-run: add support for downloading models from ModelScope (#13370)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-09 10:25:50 +01:00
Xuan-Son Nguyen 2189fd3b63 mtmd : fix batch_view for m-rope (#13397)
* mtmd : fix batch_view for m-rope

* nits : fix comment
2025-05-09 11:18:02 +02:00
Xuan-Son Nguyen 3f96aeff39 llama : one-off chat template fix for Mistral-Small-2503 (#13398)
* llama : one-off chat template fix for Mistral-Small-2503

* update readme

* add mistral-v7-tekken
2025-05-09 11:17:51 +02:00
Radoslav Gerganov b486ba05bf rpc : add rpc_msg_set_tensor_hash_req (#13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
2025-05-09 10:31:07 +03:00
Jeff Bolz 02115dcd9a vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-09 09:23:41 +02:00
Xuan-Son Nguyen d9c4accaff server : (webui) rename has_multimodal --> modalities (#13393)
* server : (webui) rename has_multimodal --> modalities

* allow converting SVG to PNG

* less complicated code
2025-05-09 09:06:37 +02:00
Diego Devesa 15e03282bb ci : limit write permission to only the release step + fixes (#13392)
* ci : limit write permission to only the release step

* fix win cuda file name

* fix license file copy on multi-config generators
2025-05-08 23:45:22 +02:00
Matt Clayton f05a6d71a0 mtmd : Expose helper_decode_image_chunk (#13366)
* mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free

* Slim down

* Cleanups
2025-05-08 20:25:39 +02:00
Xuan-Son Nguyen ee01d71e58 server : (webui) fix a very small misalignment (#13387)
* server : (webui) fix a very small misalignment

* restore font-bold
2025-05-08 18:51:45 +02:00
Xuan-Son Nguyen 8c83449cb7 server : (webui) revamp the input area, plus many small UI improvements (#13365)
* rework the input area

* process selected file

* change all icons to heroicons

* fix thought process collapse

* move conversation more menu to sidebar

* sun icon --> moon icon

* rm default system message

* stricter upload file check, only allow image if server has mtmd

* build it

* add renaming

* better autoscroll

* build

* add conversation group

* fix scroll

* extra context first, then user input in the end

* fix <hr> tag

* clean up a bit

* build

* add mb-3 for <pre>

* throttle adjustTextareaHeight to make it less laggy

* (nits) missing padding in sidebar

* rm stray console log
2025-05-08 15:37:29 +02:00
Sigbjørn Skjæret 1a844be132 convert : support rope_scaling type and rope_type (#13349) 2025-05-08 15:34:29 +02:00
welix 0ccc121354 mtmd : fix the calculation of n_tokens for smolvlm (#13381)
Co-authored-by: Taichi Nishimura <Taichi.A.Nishimura@sony.com>
2025-05-08 15:03:53 +02:00
Georgi Gerganov 6562e5a4d6 context : allow cache-less context for embeddings (#13108)
* context : allow cache-less context for embeddings

ggml-ci

* context : enable reranking with encode()

ggml-ci

* context : encode() clears embd_seq

ggml-ci

* examples : use llama_encode() when appropriate

ggml-ci

* models : nomic bert moe does not require KV cache

* llama : update comments for llama_decode/llama_encode

ggml-ci

* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Georgi Gerganov 51fb96b1ff context : remove logits_all flag (#13284)
* context : remove logits_all flag

ggml-ci

* llama : remove logits_all flag + reorder llama_context_params

ggml-ci
2025-05-08 14:26:50 +03:00
Diego Devesa 70a6991edf ci : move release workflow to a separate file (#13362) 2025-05-08 13:15:28 +02:00
Diego Devesa f061021206 llama : print size and type of overridden tensors (#13364) 2025-05-08 13:15:15 +02:00
Alberto Cabrera Pérez 8733e0cf6e sycl: addressing non-contiguous src1 mul_mats (nc and batched) (#13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
2025-05-08 10:08:01 +01:00
Diego Devesa 814f795e06 docker : disable arm64 and intel images (#13356) 2025-05-07 16:36:33 +02:00
Georgi Gerganov d879433824 sync : ggml
ggml-ci
2025-05-07 17:28:36 +03:00
Daniel Bevenius 13b0a04597 whisper: remove MSVC warnings pragmas (whisper/3090)
* ggml : remove MSVC warnings pragmas

This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.

* whisper : remove MSVC warning pragmas

This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
2025-05-07 17:28:36 +03:00
Jared Tweed bba9d945c1 cmake : removed stdc++fs (whisper/3097)
* removed stdc++fs

* kept line, but removed stdc++fs
2025-05-07 17:28:36 +03:00
Sigbjørn Skjæret bc4e1128f7 llama : deci : support ffn-free with attention (#13296) 2025-05-07 12:49:27 +02:00
Ycros 39e73ae0d6 common : Add a warning when we can't match samplers from a string or char. (#13330) 2025-05-07 11:23:28 +03:00
R0CKSTAR 1f73301b63 cuda : remove nrows_x in mul_mat_q_process_tile (#13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 09:48:23 +02:00
Georgi Gerganov 4773d7a02f examples : remove infill (#13283)
ggml-ci
2025-05-07 10:28:02 +03:00
piDack 6c7fd67b64 llama : support tie embedding for chatglm models (#13328) 2025-05-07 09:23:11 +02:00
Johannes Gäßler 141a908a59 CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (#13135) 2025-05-06 23:35:51 +02:00
Xuan-Son Nguyen 32916a4907 clip : refactor graph builder (#13321)
* mtmd : refactor graph builder

* fix qwen2vl

* clean up siglip cgraph

* pixtral migrated

* move minicpmv to a dedicated build function

* move max_feature_layer to build_llava

* use build_attn for minicpm resampler

* fix windows build

* add comment for batch_size

* also support tinygemma3 test model

* qwen2vl does not use RMS norm

* fix qwen2vl norm (2)
2025-05-06 22:40:24 +02:00
DocShotgun ffc727203a sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) 2025-05-06 22:36:24 +02:00
oobabooga 91a86a6f35 sampling : don't consider -infinity values in top_n_sigma (#13344) 2025-05-06 20:24:15 +02:00
Diego Devesa f4ed10b69c cmake : remove arm64 msvc presets (#13342) 2025-05-06 20:15:31 +02:00
Akarshan Biswas 1e333d5bba SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (#13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
2025-05-06 20:27:06 +05:30
Xuan-Son Nguyen 2f54e348ad llama : fix build_ffn without gate (#13336)
* llama : fix build_ffn without gate

* fix build on windows

* Revert "fix build on windows"

This reverts commit fc420d3c7e.
2025-05-06 14:25:40 +02:00
Johannes Gäßler 2356fb1d53 CUDA: fix bad asserts for partial offload (#13337) 2025-05-06 13:58:51 +02:00
Sigbjørn Skjæret 764b85627b convert : qwen2/3moe : set yarn metadata if present (#13331)
* set yarn metadata if present

* add comment about enabling YaRN

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
2025-05-06 11:12:06 +02:00
Johannes Gäßler 15a28ec8c7 CUDA: fix --split-mode row for MMQ (#13323) 2025-05-06 08:36:46 +02:00
compilade a7366faa5b gguf-py : avoid requiring pyside6 for other scripts (#13036)
- gguf-py : remove gguf-py/gguf/scripts/__init__.py because it's not needed

Implicit namespaces are supported since Python 3.3 (https://peps.python.org/pep-0420/),
and the entrypoints in pyproject.toml can directly refer to the main functions.
2025-05-05 22:27:31 -04:00
Johannes Gäßler 9070365020 CUDA: fix logic for clearing padding with -ngl 0 (#13320) 2025-05-05 22:32:13 +02:00
oobabooga 233461f812 sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (#13264)
* sampling: add Top-nσ sampler to `llama-server` and sampler ordering

* revert: sampler ordering

* revert: VS' crappy auto-formatting

* revert: VS' crappy auto-formatting pt.2

* revert: my crappy eye sight...

* sampling: add XTC to Top-nσ sampler chain

* sampling: add Dyna. Temp. to Top-nσ sampler chain

* sampling: actually remove Top-nσ from sampler(oops)

* Integrate top_n_sigma into main sampler chain

* Define COMMON_SAMPLER_TYPE_TOP_N_SIGMA

* Formatting

* Lint

* Exit early in the sampler if nsigma < 0

---------

Co-authored-by: CasualAutopsy <casual_autopsy@outlook.com>
2025-05-05 22:12:19 +02:00
igardev b34c859146 server : Webui - change setText command from parent window to also send the message. (#13309)
* setText command from parent window for llama-vscode now sends the message automatically.

* Upgrade packages versions to fix vulnerabilities with "npm audit fix" command.

* Fix code formatting.

* Add index.html.gz changes.

* Revert "Upgrade packages versions to fix vulnerabilities with "npm audit fix" command."

This reverts commit 67687b7fda.

* easier approach

* add setTimeout

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-05 16:03:31 +02:00
Xuan-Son Nguyen 9b61acf060 mtmd : rename llava directory to mtmd (#13311)
* mv llava to mtmd

* change ref everywhere
2025-05-05 16:02:55 +02:00
Xuan-Son Nguyen 5215b91e93 clip : fix confused naming ffn_up and ffn_down (#13290)
* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff
2025-05-05 12:54:44 +02:00
Sigbjørn Skjæret ae803bfc3d convert : bailingmoe : set yarn metadata if present (#13312) 2025-05-05 12:34:26 +02:00
Akarshan Biswas 66645a5285 SYCL: Disable mul_mat kernels for noncontiguous tensor b (#13308)
ggml-ci
2025-05-05 13:39:10 +05:30
Xuan-Son Nguyen 27aa259532 mtmd : add C public API (#13184)
* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* add const to various places

* add warning about breaking changes

* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
Diego Devesa 9fdfcdaedd rpc : use backend registry, support dl backends (#13304) 2025-05-04 21:25:43 +02:00
Aaron Teo 6eb7d25c70 ggml : activate s390x simd for Q3_K (#13301)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-04 19:49:12 +02:00
Diego Devesa 86bd60d3fe llava/mtmd : fixes to fully support dl backends (#13303) 2025-05-04 17:05:20 +02:00
Diego Devesa 9f2da5871f llama : build windows releases with dl backends (#13220) 2025-05-04 14:20:49 +02:00
Johannes Gäßler 93c4e23905 CUDA: fix race condition in MMQ stream-k fixup (#13299) 2025-05-04 14:16:39 +02:00
Johannes Gäßler 8afbd96818 CUDA: fix race condition in MMQ ids_dst (#13294) 2025-05-04 13:58:38 +02:00
Jeff Bolz 8ae5ebcf85 vulkan: Additional type support for unary, binary, and copy (#13266)
Support f16->f32 copy.
Support f16->f16 and f32->f32 unary ops.
Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
2025-05-04 07:17:16 +02:00
Johannes Gäßler 3e959f0976 imatrix: fix oob writes if src1 is not contiguous (#13286) 2025-05-04 00:50:37 +02:00
Xuan-Son Nguyen 36667c8edc clip : revert the change of BOI/EOI token for GLM-edge (⚠️ breaking change) (#13259) 2025-05-03 20:07:54 +02:00
ymcki 3bf785f3ef llama : Llama-3_1-Nemotron-Ultra-253B-v1 support (#12843) 2025-05-03 17:39:51 +02:00
Diego Devesa 1d36b3670b llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Georgi Gerganov b34443923c sync : ggml (#13268)
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
Georgi Gerganov a75cb30dc9 context : fix reorder logic (#13267)
ggml-ci
2025-05-02 20:54:13 +03:00
shalinib-ibm 3f3769ba76 ggml : Enable MMA for BF16 in llamafile_sgemm (#13148)
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-05-02 19:53:12 +03:00
Jared Van Bortel 2f567611c0 llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (#13245) 2025-05-02 11:42:30 -04:00
Jared Van Bortel 7d2123484e convert : use correct context length for nomic-embed-text-v2 (#13216) 2025-05-02 11:41:54 -04:00
Xuan-Son Nguyen 074e42ab31 convert : converting mmproj for Qwen2/2.5VL from convert_hf_to_gguf (#13209)
* wip

* qwen2.5vl ok

* vision: fix models missing "text_config"

* add test

* fix test repo name

* fix 32B model

* Revert "fix 32B model"

This reverts commit 651752f1ae.

* clarify about 32B

* rm qwen surgery script

* update llava/readme

* move V_ENC_EMBD_PATCH handling to Qwen2VLVisionModel
2025-05-02 17:17:15 +02:00
Georgi Gerganov c642bc014c kv-cache : separate recurrent vs non-recurrent impl (#12799)
* kv-cache : serparate recurrent vs non-recurrent impl (wip)

ggml-ci

* kv-cache : init -> contructor + add llama_memory_params

ggml-ci

* kv-cache : fix callback reference

ggml-ci

* context : llama_kv_cache -> llama_memory_i

ggml-ci

* context : move memory creation logic to model

ggml-ci

* llama : remove reference of memory during encode

ggml-ci

* kv-cache : hide padding details in the implementation

ggml-ci

* kv-cache : add ubatch_next()

ggml-ci

* context : simplify sbatch logic

ggml-ci

* kv-cache : hide defrag logic in the implementation

ggml-ci

* context : hide kv cache details in implementation

ggml-ci

* build : fix

ggml-ci

* cont : another fix

ggml-ci

* kv-cache : simplify interface (wip)

ggml-ci

* kv-cache : use separate KV cell structs for unified/recurrent

ggml-ci

* kv-cache : clean-up

ggml-ci

* model : better llama_model::create_model() signature

ggml-ci

* kv-cache : fix recurrent seq_rm()

ggml-ci

* kv-cache : replace `struct callbacks` with `llama_model &`

ggml-ci

* kv-cache : replace `struct graph_params` with `llama_context &`

ggml-ci

* kv-cache : fix offload check

ggml-ci

* context : avoid passing unique_ptr

ggml-ci

* kv-cache : avoid using the backends from the llama_context

ref #13113

ggml-ci

* kv-cache : more consistent debug logs [no ci]

* kv-cache : do not pass the full llama_context for kv graphs

ggml-ci

* kv-cache : remove comment

* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext

ggml-ci

* kv-cache : fix recurrent multi-user case

ggml-ci

* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
Sigbjørn Skjæret cb06a3c363 llama : orion rope type is neox (#13261) 2025-05-02 12:44:24 +02:00
Sigbjørn Skjæret 626083faf7 llama : plamo rope type is neox (#13260) 2025-05-02 12:40:56 +02:00
piDack 2af6880178 llama-chat : reset glmedge chat template (#13253)
* reset glmedge chat template

* fix glmedge chat template
2025-05-02 11:06:09 +02:00
Shakil Ahmed e84773ab60 mtmd-cli : fix out_of_range when input image path is empty (#13244)
* fix out_of_range error  to keep the chat loop running

* Update examples/llava/mtmd-cli.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mtmd-cli : load image right away

* add a new line for readability

* rm printf

* Update examples/llava/mtmd-cli.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/llava/mtmd-cli.cpp

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-02 10:20:27 +02:00
Georgi Gerganov fab647e884 server : add cache reuse card link to help (#13230)
* server : add cache reuse card link to help

* args : use short url
2025-05-02 09:48:31 +03:00
Xuan-Son Nguyen dcf886007d convert : explicitly disable trust_remote_code for AutoConfig (#13246) 2025-05-02 08:45:10 +02:00
bandoti d24d592808 ci: fix cross-compile sync issues (#12804) 2025-05-01 19:06:39 -03:00
Justin Santa Barbara 8efbdadc61 rpc : avoid uninitialized memory in serialize_tensor (#13210)
Zero out the name and padding buffers.
2025-05-01 23:32:11 +02:00
Jesse Gross f057808ffa ggml: Don't assert fail when tensor data changes (#13222)
The following scenario will cause an assertion failure in the graph
allocator:
 - Build and allocate a graph containing a tensor with a non-NULL data
   pointer
 - Build and allocate a new graph where that data is NULL

Result:
ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed

This happens during revalidation because we think that memory should
have been previously allocated based on the current graph but in
reality the previous graph was different. In this situation, we
should do a full reallocation pass.
2025-05-01 22:46:10 +02:00
Diego Devesa d7a14c42a1 build : fix build info on windows (#13239)
* build : fix build info on windows

* fix cuda host compiler msg
2025-05-01 21:48:08 +02:00
Loïc Carrère b6e4ff69b8 clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP image size (#13237) 2025-05-01 21:32:21 +02:00
matteo e0f572c846 llama-chat : update GLM4 chat template (#13238)
* update GLM4 chat template

* Update chat template

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-01 21:16:38 +02:00
Jeff Bolz 79f26e9e12 vulkan: Add bfloat16 support (#12554)
* vulkan: Add bfloat16 support

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.

* vulkan: Support bf16 tensors without the bf16 extension or coopmat support

Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.

* vulkan: bfloat16 fixes (really works without bfloat16 support now)

* vulkan: fix spirv-val failure and reenable -O
2025-05-01 20:49:39 +02:00
Jeff Bolz fc727bcdd5 vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191)
* vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader
2025-05-01 20:19:31 +02:00
Johannes Gäßler b0ecbd434b test: non-cont. b in test-backend-ops -o MUL_MAT (#13187) 2025-05-01 20:18:56 +02:00
Georgi Gerganov b1dd4d08e8 sync : ggml
ggml-ci
2025-05-01 20:15:34 +03:00
Daniel Bevenius 99881f77d8 whisper : add check that target name exists (whisper/3103)
This commit adds a check to makes sure that the target exists before
trying to add compile options to ignore warnings when using MSVC.

The motivation for this is currently the build is broken depending on
the cmake options provided. With this fix it should be possible to build
even if the targets are not actually available.

Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104
2025-05-01 20:15:34 +03:00
Daniel Bevenius b5769d92b4 ggml : suppress Windows compiler warnings (whisper/3075)
* whisper: suppress Windows compiler warnings

This commit disables compiler warnings on window using MSVC.

The motivation for these changes is that some compilers generate
warnings for these conversion, for example Windows MSVC, and
there are quite a few of them. This makes it a little difficult to
spot new warnings that may be introduced and also can be difficult
for users/embedders of ggml where these warnings are hard to separate
from their own warnings.

* squash! whisper: suppress Windows compiler warnings

Move ggml related warnings into ggml. This commit also fixes the
indentation and adds a missing whitespace to the if statement.
2025-05-01 20:15:34 +03:00
Xuan-Son Nguyen 8936784f7a mtmd : add **vision** support for Mistral Small 3.1 (#13231)
* convert ok

* load ok, missing patch merger

* ah sheet it works

* update llava/readme

* add test

* fix test
2025-05-01 17:05:42 +02:00
Xuan-Son Nguyen 13c9a3319b arg : remove CURLINFO_EFFECTIVE_METHOD (#13228) 2025-05-01 10:23:25 +02:00
Jared Van Bortel a70183eb00 llama-model : fix the reported size class for nomic-embed-text-v2-moe (#13223) 2025-05-01 10:09:41 +03:00
Georgi Gerganov 8d33d740c3 sync : ggml 2025-05-01 10:00:39 +03:00
Diego Devesa 4254bb4951 ggml : fix ggml_gallocr_ptr type (ggml/1205) 2025-05-01 09:58:44 +03:00
Georgi Gerganov 9998540149 cuda : fix unused variable compile warning (whisper/0)
ggml-ci
2025-05-01 09:58:44 +03:00
Johannes Gäßler e1e8e0991f CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199) 2025-04-30 23:12:59 +02:00
Xuan-Son Nguyen 6f67cf1f48 arg : -hf do not fail if url mismatch (#13219)
* arg : -hf do not fail if url mismatch

* do not return if cannot parse metadata json
2025-04-30 21:29:15 +01:00
ddh0 16a457facd fix typo: n_ctx_pre_seq -> n_ctx_per_seq (#13221) 2025-04-30 21:28:43 +01:00
Xuan-Son Nguyen 3e168bede4 convert : improve model arch handling (#13122)
* convert : improve model arch handling

* use AutoConfig

* rm trust_remote_code

* Update convert_hf_to_gguf.py

* fix self.block_count for vision

* fix NomicBertModel
2025-04-30 16:56:24 +02:00
Tatsuya Tanaka ceda28ef8e llava : remove duplicate include (#13207) 2025-04-30 15:25:20 +02:00
Olivier Chafik 3b127c7385 common : add -jf / --json-schema-file flag (#12011) 2025-04-30 14:52:35 +02:00
Jeff Bolz e5007a5edf vulkan: use uint array index to avoid glslang bug (#13193) 2025-04-30 14:38:37 +02:00
shalinib-ibm 416313773b ggml : fix ppc64le build (#13176)
Build fails with compilation error on power pc.
This patch fixes the same.

Tested with unit tests run via
 --build <build_dir> && cd <build_dir> && make test

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-04-30 13:17:08 +02:00
Xuan-Son Nguyen 07c2e2f76c convert : correct typo image_mean --> image_std (#13208) 2025-04-30 13:06:15 +02:00
Aaron Teo 44cd8d91ff feat(ggml-cpu): enable z17 compile (#13182)
z17 compilation requires GCC 15.1.0 and onwards

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-04-30 10:47:35 +01:00
Xuan-Son Nguyen 5933e6fdc9 arg : allow using -hf offline (#13202)
* arg : allow using -hf offline

* add more comments in code [no ci]
2025-04-30 10:46:32 +02:00
Xuan-Son Nguyen da84c04d8f docker : do not build tests (#13204)
* docker : do not build tests

* include "ggml-cpu.h"
2025-04-30 10:44:07 +02:00
xiaofei a0f7016d17 rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
2025-04-30 09:29:22 +03:00
Johannes Gäßler 19e899ce21 scripts: n_depth for compare-llama-bench [no ci] (#13201) 2025-04-29 23:32:04 +02:00
matteo e2e1ddb93a server : Prefilling assistant message in openai compatible API (#13174)
* Prefilling assistant message in openai compatible API

* fixed indentation

* fixed code convention

* simplify method usage

* no more than one assistant message at end of messages

* merge checks into prefill code

* Update examples/server/utils.hpp

---------

Co-authored-by: matteo <matteo@naspc.lan>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-29 20:33:10 +02:00
Georgi Gerganov d9d398f84f sampling : when top-k <= 0 -> noop (#13173)
ggml-ci
2025-04-29 20:22:57 +03:00
Alberto Cabrera Pérez 5a63980117 llama-bench: fixed size of fields to correctly map to values (#13183) 2025-04-29 17:24:36 +02:00
Johannes Gäßler cdf76586b2 CUDA: fix non-cont. inputs for batched mat mul (#13155) 2025-04-29 16:00:27 +02:00
Sigbjørn Skjæret 7d3af70b08 llama : llm_type order by size (#13177) 2025-04-29 13:25:53 +02:00
Xuan-Son Nguyen 00e3e5a194 mtmd : add qwen2vl and qwen2.5vl (#13141)
* llava : add clip_n_output_tokens, deprecate clip_n_patches

* mtmd : add qwen2vl and qwen2.5vl

* decode_embd_batch::set_position_...

* working version

* deprecate llama-qwen2vl-cli

* correct order W, H of clip_embd_nbytes_by_img

* edit existing line in hot topics
2025-04-29 11:47:04 +02:00
Sigbjørn Skjæret e98b3692be llama : set qwen3 model type sizes (#13175) 2025-04-29 11:00:31 +02:00
Xuan-Son Nguyen b6ce7430b7 llama-graph : fix text position for mrope (#13159)
* llama-graph : fix text position for mrope

* fix typo

* explicitly set 4th dim in the loop
2025-04-29 09:45:49 +03:00
AT 5f5e39e1ba model : Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture (#12466)
* Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture

- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

* fix tokenizer

* don't rename this tensor

---------

Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2025-04-28 22:52:15 +03:00
Xuan-Son Nguyen eaea325324 clip : fix model size display (#13153) 2025-04-28 21:23:19 +02:00
Ville Vesilehto 43ddab6eee fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
2025-04-28 21:00:20 +03:00
Vishal Agarwal 1831f538f7 llama-bench: add -d depth arg (#13096)
* add depth param

* update llama-bench README and add depth param

* llama-bench: default params for depth arg for faster execution

* Update examples/llama-bench/README.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix buffer print ub

* use user provided args

* remove extra whitespaces

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-04-28 16:50:39 +02:00
Xuan-Son Nguyen 4e87962e34 mtmd : fix glm-edge redundant token count (#13139)
* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl
2025-04-28 16:12:56 +02:00
pockers21 fb0471d175 context : do not clear output buffer on reserve (#13152)
Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-04-28 16:45:40 +03:00
Xuan-Son Nguyen d2b2031e5f llama : (mrope) allow using normal 1D position for text token (#13138)
* llama : (mrope) use normal position for text token

* rm n_pos_per_embd from llm_graph_input_attn_temp
2025-04-28 14:20:56 +02:00
Xuan-Son Nguyen 5fa9e63be8 clip : refactor set input for cgraph + fix qwen2.5vl input (#13136)
* clip : refactor set input for cgraph

* more strict assert

* minicpmv : use clip_n_mmproj_embd instead of copying the same code everywhere

* split qwen2 and qwen2.5 code blocks

* minor style fix
2025-04-28 12:18:59 +02:00
Akarshan Biswas a4c340f974 SYCL: Add all missing unary kernels (#13074)
* SYCL: Add all missing unary kernels

ggml-ci

* decouple kernel launch range from data size using strided loop

* use ciel_div helper for num_blocks
ggml-ci

* clean auto imported header files
2025-04-28 11:33:25 +02:00
Georgi Gerganov d0a417f3c7 readme : update hot topics (#13150) 2025-04-28 12:10:18 +03:00
Georgi Gerganov 43f2b07193 common : fix noreturn compile warning (#13151)
ggml-ci
2025-04-28 11:57:19 +03:00
Xuan-Son Nguyen e5d6c2554e llama-chat : fix typo GML --> GLM (#13143) 2025-04-28 10:11:58 +02:00
R0CKSTAR f0dd6a1926 musa: fix typo in cc control (#13144)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-28 09:33:28 +02:00
Johannes Gäßler 69699be48a CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (#13137) 2025-04-28 09:29:26 +02:00
Xuan-Son Nguyen 85f36e5e71 arg : fix unused variable (#13142) 2025-04-28 08:16:59 +03:00
4onen c0a97b762e llama-bench : Add --override-tensors arg (#12922)
* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
2025-04-27 23:48:26 +02:00
matteo ced44be342 llama-chat : fix wrong template in GLM4-0414 (#13140)
* fix wrong template in GLM4-0414

* fix spaces

* no bos token since it is already in the template

* moved the chatgml4 check to higher priority

* restored template for old GLM models

* moved the GLM4 template check in the correct place with correct check
2025-04-27 21:57:32 +02:00
R0CKSTAR e291450b76 musa: fix build warning (#13129)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-27 13:22:49 +02:00
LostRuins Concedo 59e991c23c Fixes Qwen2.5VL segfault during inference with https://github.com/ggml-org/llama.cpp/pull/12402 as has_qwen2vl_merger migration was incomplete (#13133) 2025-04-27 12:43:37 +02:00
HimariO ca2bb89eac clip : Add Qwen2.5VL support (#12402)
* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-27 10:10:34 +02:00
Xuan-Son Nguyen 2d451c8059 common : add common_remote_get_content (#13123)
* common : add common_remote_get_content

* support max size and timeout

* add tests
2025-04-26 22:58:12 +02:00
Xuan-Son Nguyen 4753791e70 clip : improve projector naming (#13118)
* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused
2025-04-26 22:39:47 +02:00
SXX 77d5e9a76a ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (#13107)
* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
2025-04-26 16:05:31 +02:00
frob d5fe4e81bd grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-04-26 10:10:20 +02:00
Diego Devesa 295354ea68 llama : fix K-shift with quantized K and BLAS backend (#13113) 2025-04-25 19:40:11 +02:00
City 558a764713 Force FP32 compute in GLM4 FFN Down (#13101)
* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872732.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-04-25 14:38:34 +02:00
Xuan-Son Nguyen edb18b6e8f clip : fix pixtral on some GPU backends (#13097)
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
2025-04-25 14:31:42 +02:00
Neo Zhang Jianyu 514c45608f change the reorder tensor from init to execute OP (#13003) 2025-04-25 17:37:51 +08:00
Radoslav Gerganov 553a5c3a9f rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
2025-04-25 10:08:08 +03:00
Xuan-Son Nguyen 13be08daf9 clip : remove boi/eoi embeddings for GLM-edge model (#13081) 2025-04-24 22:17:04 +02:00
Georgi Gerganov 226251ed56 embeddings : fix batch sizes (#13076)
ggml-ci
2025-04-24 22:29:22 +03:00
Georgi Gerganov 87616f0680 ggml : fix trailing whitespaces (#0) 2025-04-24 17:32:47 +03:00
Georgi Gerganov 63b4911494 sync : ggml
ggml-ci
2025-04-24 17:32:47 +03:00
Acly c6e8cc28c1 ggml : Depthwise 2D convolution (ggml/1152)
* ggml-cpu : kernels for faster depthwise 2D convolution

* fix compile: remove static after moving to ops.cpp

* add dilation for depthwise_conv_2d

* review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace

* review: rename depthwise_conv_2d -> conv_2d_dw everywhere
2025-04-24 17:32:47 +03:00
Johannes Gäßler b10d8bfdb1 CUDA: use switch statements in constexpr functions (#13095) 2025-04-24 15:57:10 +02:00
Georgi Gerganov 13b4548877 cmake : do not include ./src as public for libllama (#13062)
* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci
2025-04-24 16:00:10 +03:00
Georgi Gerganov 572b3141d3 clang-tidy : disable warning about missing math parenthesis (#13091) 2025-04-24 15:44:05 +03:00
Xuan-Son Nguyen 7c727fbe39 arg : add --no-mmproj-offload (#13093)
* arg : add --no-mmproj-offload

* Update common/arg.cpp
2025-04-24 14:04:14 +02:00
Xuan-Son Nguyen 80982e815e arg : clean up handling --mmproj with -hf (#13082)
* arg : clean up handling --mmproj with -hf

* rm change about no_mmproj

* Revert "rm change about no_mmproj"

This reverts commit 2cac8e0efb.

* handle no_mmproj explicitly

* skip download mmproj on examples not using it
2025-04-24 12:14:13 +02:00
Georgi Gerganov 7604a7d6b8 metal : fix floating-point range of attention scores in FA kernels (#13090)
ggml-ci
2025-04-24 10:38:30 +03:00
Eve b3b6d862cf vulkan: matmul gcn tuning (#13016)
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-04-24 09:18:33 +02:00
pl752 5630406959 llama-mtmd-cli: Sigint rework in mtmd vision example (#13080)
* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-23 23:32:35 +02:00
Xuan-Son Nguyen ecda2ec4b3 mtmd : Support Pixtral 12B (#13065)
* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script
2025-04-23 20:21:59 +02:00
piDack eb1776b15a convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (#13021)
* append mult-eos,half-rope,bos to GLM4-0414

* remove unset var
2025-04-23 16:59:14 +02:00
Radoslav Gerganov 2cca6c01e4 rpc : add command line option for number of threads for the CPU backend (#13060)
closes #13051
2025-04-23 10:32:49 +03:00
Johannes Gäßler 658987cfc9 CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs
2025-04-22 21:27:40 +02:00
Xuan-Son Nguyen dc39a5e7a8 mtmd : support SmolVLM (version 1 and 2) (#13050)
* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test
2025-04-22 16:24:54 +02:00
Georgi Gerganov ab47dec3d3 security : add note about RPC and server functionality (#13061)
* security : add note about RPC functionality

* security : add note about llama-server
2025-04-22 16:16:10 +03:00
Georgi Gerganov 7b53389c24 metal : add memory pool for temp allocs (#12850)
* metal : add memory pool for temp allocs (wip) [no ci]

* cont : free buffers from the heap

* cont : resize heap [no ci]

* cont : refactor heap [no ci]

* cont : heap for each cmd buffer [no ci]

* cont : fix free

* wip

* cont : fix alignment [no ci]

* cont : not working .. [no ci]

* cont : heap allocation now works [no ci]

* cont : use MTLHeapTypePlacement

ggml-ci

* metal : use dynamic MTLHeap allocations

ggml-ci

* metal : add comments

* metal : disable softmax use of mem_pool

ggml-ci

* metal : final touches
2025-04-22 16:15:51 +03:00
Xuan-Son Nguyen 243453533e llava : update documentations (#13055)
* llava : update documentations

* fix typo
2025-04-22 10:37:00 +02:00
Diego Devesa 1d735c0b4f ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (#12871)
* ggml : add SSE 4.2 variant for CPUs without AVX

* ggml : add x64 base ABI variant
2025-04-21 18:13:51 +02:00
Akarshan Biswas 5368ddda7a SYCL: Add non-contiguous support in ROPE (#12993)
ggml-ci
2025-04-21 19:13:30 +05:30
Xuan-Son Nguyen 84a9bf2fc2 mtmd : merge llava, gemma3 and minicpmv CLI into single llama-mtmd-cli (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-21 15:32:58 +02:00
Xuan-Son Nguyen 2016f07bd1 convert : experimental support for --mmproj flag (#13023)
* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* fix Mistral3Model

* fix typo

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2025-04-20 23:29:36 +02:00
Jeffrey Morgan 6602304814 llava: fix errors in clip.h on certain compilers (#13030) 2025-04-20 12:15:41 +02:00
Jeff Bolz 66168204be vulkan: support noncontiguous rms_norm (#13031) 2025-04-20 10:50:02 +02:00
Jeffrey Morgan 4ba9d711ba metal: add neg operator (#13029) 2025-04-20 08:28:40 +03:00
bandoti 00137157fc Disable CI cross-compile builds (#13022) 2025-04-19 18:05:03 +02:00
Sigbjørn Skjæret fb28f4f80e gguf-py : fix upload python package workflow (#13020) 2025-04-19 16:26:38 +02:00
Xuan-Son Nguyen 37b9f0d29d clip : refactor, add image_manipulation and llava_uhd classes (#13011)
* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
2025-04-19 09:15:45 +02:00
Daniel Tang 6408210082 main : Fix Ctrl+D/newline handling (#12951)
This restores the behavior from #491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (#1040).

This also actually implements #587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes #12949
2025-04-18 22:02:55 +02:00
Chris Thompson aff9d107b0 gguf-py : GGUF Editor GUI - Python + Qt6 (#12930) 2025-04-18 20:30:41 +02:00
Xuan-Son Nguyen 35370ba945 server : use std::move whenever possible (#12936)
* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-18 19:58:12 +02:00
Akarshan Biswas 8d66005763 SYCL: Refactor and enable FP16 in binary broadcast OPs (#12975)
* SYCL: refactor move to a separate file

* Fix binbcast

* Remove duplicates

* fix include formatting

* fix typo
2025-04-18 15:57:56 +02:00
Xuan-Son Nguyen b9154ecff9 mtmd : add methods to access mtmd_image_tokens (#12906)
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
2025-04-18 10:04:51 +02:00
Radoslav Gerganov 2db9ba1464 rpc : add RPC_CMD_HELLO (#12955)
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
2025-04-18 10:13:42 +03:00
Georgi Gerganov 2f74c354c0 graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Alan Gray 207c22ec2d ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970) 2025-04-17 15:19:42 +02:00
hipudding 7a395f67a7 CANN: Add support for async operator submission (#12864)
Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
2025-04-17 20:34:16 +08:00
Mikko Juola 971f245b3b llama : recognize IBM Granite 3.3 FIM tokens (#12988)
The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```
2025-04-17 11:37:05 +03:00
kimminsu 12b17501e6 opencl: fix incorrect local_size index in profiling log (#12868) 2025-04-16 14:25:57 -07:00
Jeff Bolz 015022bb53 vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
Chenguang Li b43d89e311 CANN: Add 310P operator support check (#12962) 2025-04-16 16:21:05 +08:00
lhez 80f19b4186 opencl: split ggml-opencl.cl into multiple files and cleanup (#12886)
* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
2025-04-15 12:26:00 -07:00
Georgi Gerganov f8f820cc4d metal : add FA-vec kernels for head size 96 (#12952)
ggml-ci
2025-04-15 14:45:05 +03:00
hipudding 54a7272043 CANN: Add x86 build ci (#12950)
* CANN: Add x86 build ci

* CANN: fix code format
2025-04-15 12:08:55 +01:00
David Huang 84778e9770 CUDA/HIP: Share the same unified memory allocation logic. (#12934)
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
2025-04-15 11:20:38 +02:00
Akarshan Biswas 510676475f SYCL: Add ROPE vision kernel (#12887)
* SYCL: Add ROPE vision kernel

* Add comment about rope mode
2025-04-15 10:37:42 +02:00
Juk Armstrong daa422881a llama : DeepSeek V2/V3 MLA implementation (#12801)
* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Srihari-mcw eccc7a1602 ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (#12829)
* Add AVX512 implementation of GEMM - q4kx8

* Update changes to remove unnecessary whitespaces
2025-04-15 09:22:36 +03:00
Chenguang Li 0019279bb5 CANN: Opt ROPE optimization (#12865)
* [CANN]Opt ROPE optimization

* [CANN]Codestyle adjustment

* [CANN]Fix the ROPE precision issue

* [CANN]codestyle fix

* [CANN]add rope unsupport case

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-15 10:09:35 +08:00
Xinpeng Dou b0c75ac9f9 CANN: Optimize CANN buffer pool memory management (#12875)
Multiple optional memory pools are provided for CANN, including VMM, 
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL 
   is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined, 
   the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
2025-04-15 10:04:24 +08:00
Russyyds d6d2c2ab8c Add performance print for gemma3 in example (#12929) 2025-04-14 19:18:20 +02:00
Akarshan Biswas 75afa0ae31 SYCL: Fix im2col (#12910)
* SYCL: Fix im2col

* restore local workgroup size adjustments for large inputs

* restore format
2025-04-14 14:23:53 +02:00
Radoslav Gerganov c772d54926 rpc : use ggml_context_ptr (#12938) 2025-04-14 13:59:34 +03:00
Neo Zhang Jianyu 81c7e64fc2 dsiable curl lib check, this action is missed by commit bd3f59f812 (#12761) (#12937) 2025-04-14 18:19:07 +08:00
Georgi Gerganov 526739b879 sync : ggml
ggml-ci
2025-04-14 09:26:15 +03:00
cmdr2 a25355e264 cpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running test-backend-ops with only the CPU backend (ggml/1190) 2025-04-14 09:26:15 +03:00
SXX e959d32b1c ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register (#12773)
* ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register

* simplifies the codebase by removing redundant functions
2025-04-14 08:47:55 +03:00
Alan Gray 307bfa253d ggml: disable CUDA graphs for unsupported DUP and CONT node types (#12891)
Fixes #12798
2025-04-13 23:12:21 +02:00
Ed Addario 71e90e8813 quantize: Handle user-defined quantization levels for additional tensors (#12511)
* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' coding guidelines

* Update descriptions to match existing style

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' guidelines

* Implement general --tensor-type instead of tensor-specific command option

* Fix implied type bug

* Restore missing #includes

* Add regex capability for tensor selection

* Refactor function name and update ALLOWED_TENSOR_TYPE

* Add missing #include

* Handle edge case when tensor name is cls.output

* Minor logging improvement
2025-04-13 21:29:28 +03:00
Prajwal B Mehendarkar bc091a4dc5 common : Define cache directory on AIX (#12915) 2025-04-12 17:33:39 +02:00
Jeff Bolz a4837577aa vulkan: use aligned loads for flash attention mask (#12853)
Rewrite the stride logic for the mask tensor in the FA shader to force the
stride to be aligned, to allow using more efficient loads.
2025-04-12 10:44:48 +02:00
Matt Clayton e59ea539b8 llava: Fix cpu-only clip image encoding sefault (#12907)
* llava: Fix cpu-only clip image encoding

* clip : no smart ptr for ggml_backend_t

* Fix for backend_ptr push_back

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-12 07:29:03 +02:00
Georgi Gerganov c94085df28 server : add VSCode's Github Copilot Chat support (#12896)
* server : add VSCode's Github Copilot Chat support

* cont : update handler name
2025-04-11 23:37:41 +03:00
yuri@FreeBSD e8a62631b3 rpc : Set cache directory in rpc-server.cpp on FreeBSD (#12903) 2025-04-11 22:04:14 +02:00
Olivier Chafik b6930ebc42 tool-call: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900)
* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)

* test all chat formats w/o tools
2025-04-11 21:47:52 +02:00
yuri@FreeBSD 68b08f36d0 common : Define cache directory on FreeBSD (#12892) 2025-04-11 21:45:44 +02:00
Ewan Crawford 578754b315 sycl: Support sycl_ext_oneapi_limited_graph (#12873)
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
2025-04-11 15:32:14 +02:00
tastelikefeet b2034c2b55 contrib: support modelscope community (#12664)
* support download from modelscope

* support login

* remove comments

* add arguments

* fix code

* fix win32

* test passed

* fix readme

* revert readme

* change to MODEL_ENDPOINT

* revert tail line

* fix readme

* refactor model endpoint

* remove blank line

* fix header

* fix as comments

* update comment

* update readme

---------

Co-authored-by: tastelikefeet <yuze.zyz@alibaba-inc/com>
2025-04-11 14:01:56 +02:00
Yuxuan Zhang 06bb53ad9b llama-model : add Glm4Model implementation for GLM-4-0414 (#12867)
* GLM-4-0414

* use original one

* Using with tensor map

* fix bug

* change order

* change order

* format with flask8
2025-04-11 12:10:10 +02:00
Xuan-Son Nguyen 0c50923944 clip : use smart pointer (⚠️ breaking change) (#12869)
* clip : use smart pointers

* fix warmup

* add forward declaration

* misisng include

* fix include (2)

* composite

* simplify batch ptr

* fix conflict
2025-04-11 12:09:39 +02:00
Akarshan Biswas fccf9cae83 SYCL: Add fp16 type support to unary op kernels (#12788)
* SYCL: Add fp16 support to some elementwise OP kernels

* remove comment

ggml-ci

* Use static_cast directly

* remove not needed cast from tanh

* Use static cast and remove unneeded castings

* Adjust device_support_op for unary OPs

* Use cast_data and typed_data struct to deduplicate casting code
2025-04-11 16:03:50 +08:00
Daniel Han ec6c09d0fa convert : Llama4 RoPE fix (#12889) 2025-04-11 09:49:09 +02:00
R0CKSTAR 8ac9f5d765 ci : Replace freediskspace to free_disk_space in docker.yml (#12861)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-11 09:26:17 +02:00
Daniel Bevenius 12e9158f25 xcf : add check for visionos build version (#12854)
This commit adds a check for the visionos build version used with vtool
in build-xcframework.sh. The script now checks the Xcode version and
determines whether to use "xros" or "visionos" for the build version.

This commit also uses xcrun for the vtool so that the version of vtool
in xcode command line tools is used instead of the one in the system
path.

Refs: https://github.com/ggml-org/whisper.cpp/pull/2994#issuecomment-2773292223
2025-04-11 09:24:34 +02:00
Xuan-Son Nguyen 5b1f13cb64 convert : proper tensor name mapping for llama4 (#12870)
* Llama-4 mapping

* remove hacky renaming

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2025-04-11 09:23:37 +02:00
Xuan-Son Nguyen 8b91d5355a llama : correct rms norm for llama 4 (#12882) 2025-04-11 08:49:50 +02:00
Aaron Teo 0fed24c347 ggml: fix compilation error s390x (#12848)
* ggml: fixes #12846 compilation error

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>

* ggml: add documentation for code change

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>

* ggml: refactor to type-cast and update documentation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>

* ggml: update documentation to provide full issue link

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>

---------

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>
2025-04-11 08:20:07 +03:00
Georgi Gerganov 47ba87d0a4 sync : ggml 2025-04-11 00:17:47 +03:00
Georgi Gerganov 1d2b613445 tests : fix init order (#0)
ggml-ci
2025-04-11 00:17:47 +03:00
Georgi Gerganov eb420e1148 sync : ggml
ggml-ci
2025-04-11 00:17:47 +03:00
cmdr2 cb79c2e7fa ggml: don't include arm_neon.h when using CUDA 12 with ARM Neon (ggml/1187)
fix #1186
2025-04-11 00:17:47 +03:00
Diego Devesa fe92821ea9 ggml : add bilinear upscale support (ggml/1185) 2025-04-11 00:17:47 +03:00
Diego Devesa 459895c326 ggml : add more generic custom op, remove deprecated custom ops (ggml/1183)
* ggml : add more generic ggml_custom op

* ggml : remove deprecated custom ops
2025-04-11 00:17:47 +03:00
Georgi Gerganov e4bf72d631 scripts : fix sync-ggml-am.sh 2025-04-11 00:17:47 +03:00
Xuan-Son Nguyen 8b9cc7cdd8 llava : introduce libmtmd (#12849)
* wip llava2

* migrated gemma3 to llava2

* add timings

* correct pre/postfix

* fix missing include

* fix compilation unused var warn

* update llava2_tokenize

* change name llava2 --> mtmd

* improve api

* refine helpers

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-10 22:57:16 +02:00
Xuan-Son Nguyen 64eda5deb9 convert : ability to lazy-load safetensors remotely without downloading to disk (#12820)
* gguf util : add SafetensorRemote

* fix style

* convert: add --remote option

* convert : allow using lazy remote tensors

It's a bit slow for now since everything is blocking and single-threaded.

* correct metadata.name

* small style fix

* support HF_TOKEN

* convert : use writeable buffer for remote lazy tensors

* convert : fix flake8 lint regarding lamdba assigment

* multithreaded download

* multithread: print debug

* fix style

* Revert "multithreaded download"

This reverts commit 42fc895ace.

* bring back _get_request_headers

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-04-10 17:24:44 +02:00
Chenguang Li fe5b78c896 CANN: Support more ops (#12841)
* [CANN]Support Opt LOG && MEAN && PAD_REFLECT_1D

* [CANN]Support COUNT_EQUAL && STEP && SGN

* [CANN]codestyle adjustment

* [CANN]codestyle adjustment

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-10 08:51:52 +08:00
Prajwal B Mehendarkar 11d07e1e69 Fixes #12823 (#12830)
* Including limits file on AIX

* Fixes #12823
2025-04-10 01:18:01 +02:00
Rudi Servo b0091ecc1e docker : added all CPU to GPU images (#12749) 2025-04-10 01:17:12 +02:00
Piotr Kubaj 31f7803bc4 ggml-cpu-impl.h: do not redefine bool on POWER9 (#12856)
error: unknown type name '_Bool'
2025-04-10 01:00:34 +02:00
Piotr Kubaj 2391506ace ggml-impl.h: fix build on POWER9 (#12855)
error: ISO C++17 does not allow 'register' storage class specifier
2025-04-10 01:00:25 +02:00
Bo Zheng d3bd7193ba llama : Support Qwen3 and Qwen3MoE (#12828)
* add qwen3 & qwen3moe support.

* fix

---------

Co-authored-by: bozheng-hit <dsoul0621@gmail.com>
2025-04-09 11:47:36 +02:00
R0CKSTAR d9a63b2f2e musa: enable freediskspace for docker image build (#12839)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-09 11:22:30 +02:00
Romain Biessy 8ed71242f4 sycl: update documentation to use -no-cnv (#12845) 2025-04-09 11:22:04 +02:00
Plamen Minev 381603a775 ci: detach common from the library (#12827)
* fix: detach common from the library

* fix: building chat test template
2025-04-09 10:11:11 +02:00
Xuan-Son Nguyen 65a69e6e1b clip : do not print ftype (#12832) 2025-04-09 10:09:53 +02:00
Georgi Gerganov 47277d6d1d readme : add rpc backend (#12842) 2025-04-09 10:54:42 +03:00
Chenguang Li 6e1c4cebdb CANN: Support Opt CONV_TRANSPOSE_1D and ELU (#12786)
* [CANN] Support ELU and CONV_TRANSPOSE_1D

* [CANN]Modification review comments

* [CANN]Modification review comments

* [CANN]name adjustment

* [CANN]remove lambda used in template

* [CANN]Use std::func instead of template

* [CANN]Modify the code according to the review comments

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
2025-04-09 14:04:14 +08:00
Jeff Bolz 0090950f67 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-09 07:25:08 +02:00
Jeff Bolz 7ecd780b1a vulkan: Use fp16 for the flash attention P*V multiplication (#12783)
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
2025-04-09 07:12:57 +02:00
Sigbjørn Skjæret 7538246e7c cuda : add f32 to bf16 copy op (#12806)
This allows BF16 KV-cache on CUDA.
2025-04-08 23:21:31 +02:00
Matt Clayton b32efad2bc llava: improve clip_ctx destructor to not memleak load_image_size (#12834) 2025-04-08 22:01:58 +02:00
Georgi Gerganov a19b5cef16 llama : fix FA when KV cache is not used (i.e. embeddings) (#12825)
* ggml : FA supports F32 V

* graph : cast KV to F16 when the KV cache is not used

ggml-ci

* server : add test that exercises embeddings with FA enabled

ggml-ci
2025-04-08 19:54:51 +03:00
Xuan-Son Nguyen 78a1ba0a4f server : fix thread.join() on exit (#12831) 2025-04-08 18:37:06 +02:00
dm4 2dabf759e7 llava: add more helper functions to check projector types in clip context (#12824)
Signed-off-by: dm4 <sunrisedm4@gmail.com>
2025-04-08 15:49:13 +02:00
Prajwal B Mehendarkar 1d343b4069 arg : Including limits file on AIX (#12822) 2025-04-08 14:30:59 +02:00
characharm 8ca6e1c3a4 server : webui : Improve Chat Input with Auto-Sizing Textarea (#12785)
* Update ChatScreen.tsx

* useAutosizeTextarea.ts

useAutosizeTextarea to encapsulate the logic.

* Implement responsive auto-sizing chat textarea

Replaces the manual textarea resizing with an automatic height adjustment based on content.

- `useChatTextarea` hook to manage textarea state and auto-sizing logic via refs, preserving the optimization
- Textarea now grows vertically up to a maximum height (`lg:max-h-48`) on large screens (lg breakpoint and up).
- Disables auto-sizing and enables manual vertical resizing (`resize-vertical`) on smaller screens for better mobile usability.
- Aligns the "Send" button to the bottom of the textarea (`items-end`) for consistent positioning during resize.

* -update compressed index.html.gz after npm run build
-refactor: replace OptimizedTextareaValue with AutosizeTextareaApi in VSCode context hook

* chore: normalize line endings to LF
refactor: AutosizeTextareaApi -> chatTextareaApi

* refactor: Rename interface to PascalCase

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-08 11:14:59 +02:00
Neo Zhang Jianyu 656babd6c2 Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (#12812)
* Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…"

This reverts commit 518a01480e.

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

* rm tail space
2025-04-08 15:03:21 +08:00
compilade a226bc7a9a gguf-py : support lazy tensor splitting (#12809)
* gguf-py : support lazy tensor splitting

Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.

* gguf-py : fix flake8 lint
2025-04-08 09:03:07 +02:00
Xuan-Son Nguyen 1466621e73 llama : Support llama 4 text-only (#12791)
* llama4 conversion

* initial support, no chat template

* clean up a bit

* fix tokenizer conversion

* correct hparams

* try this

* fix shexp

* ffn_inp_normed

* chat template

* clean up model conversion

* add_bos

* add scale_before_ffn

* fix order

* weight_before_ffn

* llm_graph_input_attn_temp

* add chunk attn mask

* build_inp_attn_scale()

* add comment about ggml_repeat

* clarify comments

* fix build
2025-04-07 23:06:44 +02:00
lhez 82974011f3 opencl: better identify Adreno GPU (#12760) 2025-04-07 13:22:54 -07:00
stduhpf 4ccea213bc hellaswag: display estimated score confidence interval (#12797) 2025-04-07 18:47:08 +03:00
Georgi Gerganov 1a1ab7e7a4 cuda : fix HIP and MUSA BF16 (#0)
ggml-ci
2025-04-07 18:44:17 +03:00
Georgi Gerganov a4e46e28f9 sync : ggml
ggml-ci
2025-04-07 18:44:17 +03:00
Georgi Gerganov ff067dbcb9 ggml : simplify Arm fp16 CPU logic (ggml/1177)
* ggml : simlpify Arm fp16 CPU logic

ggml-ci

* cont : bring back CUDA/MUSA checks

ggml-ci
2025-04-07 18:44:17 +03:00
Sigbjørn Skjæret 36ca8b3628 CUDA: don't convert BF16 weights to FP32 (ggml/1174)
* add bf16 support

* use convert_from_bf16_cuda instead of convert_unary_cuda for f32

* revert 7ec5085

* move functionality into convert_unary with constexpr
2025-04-07 18:44:17 +03:00
cmdr2 995083e4ed cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
* cpu: refactor SIMD mappings and vectorized op functions into separate files

* Fix warning for ggml_float to float

* Fix warnings

* cpu: move all the operations (except mul_mat) to a separate c++ file

* fix whitespace

* Update ggml/src/ggml-cpu/vec.h

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp

* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-04-07 18:44:17 +03:00
zhouwg 518a01480e sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (#12734) 2025-04-07 17:22:57 +02:00
Xuan-Son Nguyen e391d3ee8d ci : no curl on ggml-ci (#12796) 2025-04-07 15:37:28 +03:00
Xuan-Son Nguyen bd3f59f812 cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
2025-04-07 13:35:19 +02:00
zhouwg 52b3d71f12 CANN: fix typo in ggml-cann (#12733) 2025-04-07 19:34:14 +08:00
hipudding d0d5b2232b CANN: Refactor to reduce duplicate code (#12731)
* CANN: Refactor to reduce duplicate code

* CANN: fix review comment
2025-04-07 17:10:36 +08:00
R0CKSTAR 916c83bfe7 musa: fix compilation warnings in mp_22/31 (#12780)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-06 15:23:54 +02:00
Jeff Bolz 0c74b04376 vulkan: fix NaN issue in flash attention shader (#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
2025-04-06 11:03:47 +02:00
Jeff Bolz 80b717d493 vulkan: Use unclamped loads for flash attention mask (#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
2025-04-06 10:47:13 +02:00
0cc4m 6bf28f0111 Vulkan: Tune Vulkan mmq int dot shader for performance (#12767) 2025-04-05 18:04:03 +02:00
Sergey Fedorov f1e3eb4249 common : fix includes in arg.cpp and gemma3-cli.cpp (#12766)
* arg.cpp: add a missing include

* gemma3-cli.cpp: fix cinttypes include
2025-04-05 17:46:00 +02:00
Xuan-Son Nguyen 0364178ca2 clip : refactor clip_init, add tests (#12757)
* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-05 17:17:40 +02:00
エシュナヴァリシア c6ff5d2a8d common: custom hf endpoint support (#12769)
* common: custom hf endpoint support

Add support for custom huggingface endpoints via HF_ENDPOINT environment variable

You can now specify a custom huggingface endpoint using the HF_ENDPOINT environment variable when using the --hf-repo flag, which works similarly to huggingface-cli's endpoint configuration.

Example usage:
HF_ENDPOINT=https://hf-mirror.com/ ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

The trailing slash in the URL is optional:
HF_ENDPOINT=https://hf-mirror.com ./bin/llama-cli --hf-repo Qwen/Qwen1.5-0.5B-Chat-GGUF --hf-file qwen1_5-0_5b-chat-q2_k.gguf -p "The meaning to life and the universe is"

* Update common/arg.cpp

readability Improvement

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: ベアトリーチェ <148695646+MakiSonomura@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-04-05 15:31:42 +02:00
Olivier Chafik 7a84777f42 sync: minja (#12739)
* sync: minja

https://github.com/google/minja/pull/57

* fix json include
2025-04-04 21:16:39 +01:00
Georgi Gerganov 3e1d29348b kv-cache : simplify + fix warning for recurrent models (#12756)
ggml-ci
2025-04-04 21:48:10 +03:00
bandoti 1be76e4620 ci: add Linux cross-compile build (#12428) 2025-04-04 14:05:12 -03:00
Nauful Shaikh b772394297 server : webui : Upgrade daisyui, tailwindcss. (#12735)
* Upgrade daisyui, tailwindcss.

* Switch to all themes.

* Revert a change.

* Update formatting.

* Install packages before npm build.

* Revert "Install packages before npm build."

This reverts commit 336c5147e6.

* Add index.html.gz

* run build

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-04-04 16:09:52 +02:00
nick huang 23106f94ea gguf-split : --merge now respects --dry-run option (#12681)
* gguf-split now respects dry-run option

* removing trailing space
2025-04-04 16:09:12 +02:00
Nicolò Scipione 94148ba330 sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (#12625) 2025-04-04 16:00:46 +02:00
Ronny Brendel 9ac4d611d0 cmake: fix ggml-shaders-gen compiler paths containing spaces (#12747)
fixes error for compiler paths with spaces
2025-04-04 10:12:40 -03:00
Daniel Bevenius 348888e0dc docs : add XCFramework section to README.md [no ci] (#12746)
This commit adds a new section to the README.md file, detailing the
usage of the XCFramework.

The motivation for this is that it might not be immediately clear to
users how to use the XCFramework in their projects and hopefully this
will help.
2025-04-04 10:24:12 +02:00
Jeff Bolz 74d4f5b041 vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630)
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
2025-04-04 07:54:35 +02:00
Jeff Bolz 35e592eb30 vulkan: set cmake minimum and project name in vulkan-shaders (#12744) 2025-04-04 07:53:20 +02:00
lhez 7d7b1bafa7 opencl: update doc for OpenCL (#12702)
* opencl: add OpenCL to build.md

* opencl: remove fixed issue/TODO

* opencl: add link to OPENCL.md

* opencl: update doc - refine tools requirement for Windows 11 arm64
2025-04-03 22:18:17 -07:00
Gaurav Garg c262beddf2 CUDA: Prefer vector flash decoding kernel for Gemma models (#12738)
* Prefer vector flash decoding kernel for Gemma models

Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.

* Update ggml/src/ggml-cuda/fattn.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-04-03 18:20:29 +02:00
yumeyao 5dd5d1ab00 vocab : use string_view::find() to avoid unnecessary looking up beyond the fragment range (#12706) 2025-04-03 18:32:54 +03:00
Jeff Bolz 1c059995e0 vulkan: Fix missing cmake logic for dot product extension (#12721) 2025-04-03 10:08:26 -05:00
Atharva Dubey 2004644b7a ci : add env variable in ggml-ci and document the same in SYCL.md (#12736) 2025-04-03 15:12:39 +03:00
R0CKSTAR 5f696e88e0 sync : minja (inclusionAI/Ling) and update tests (#12699)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-03 13:51:35 +02:00
a3sh 193c3e03a6 fix MUSA compiler warning (#12704)
* fix MUSA compiler warning

* replace (void) with GGML_UNUSED
2025-04-03 09:32:55 +02:00
Chenguang Li 65cfe136a0 CANN: Support operator SIN COS ARGMAX (#12709)
* [CANN]support sin cos argmax

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]codestyle adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]Remove redundant code

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2025-04-03 15:18:08 +08:00
Alan Gray 3f9da22c2b Simplify and improve CUDA graphs through use of indirect copy pointers (#9017)
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes #12152

* Addressed comments

* fix HIP builds

* properly sync to stream

* removed ggml_cuda_cpy_fn_ptrs

* move stream sync before free

* guard to only use indirection with graphs

* style fixes

* check for errors

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-04-03 03:31:15 +02:00
hipudding 2a0dc97e56 CANN: Fix failed test cases (#12708)
* CANN: Fix memory waste in aclnn_tensor

* CANN: fix backend ops fail

* CANN: fix acl_tensor memory alloc.

* CANN: format

* CANN: remove trailing whitespace
2025-04-03 08:49:51 +08:00
lhez 97a20c012b opencl: use max_alloc_size in backend ctx instead of querying again (#12705) 2025-04-02 17:01:42 -07:00
Jeff Bolz f01bd02376 vulkan: Implement split_k for coopmat2 flash attention. (#12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-02 14:25:08 -05:00
bandoti 6f3bd38640 cmake: remove caching from vulkan coopmat checks (#12719) 2025-04-02 14:56:26 -03:00
Jeff Bolz be0a0f8cae vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559)
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.
2025-04-02 19:40:32 +02:00
0cc4m 92e3006bb6 Vulkan: Fix mmq int dot float cache size (#12722) 2025-04-02 19:12:30 +02:00
Georgi Gerganov 833e2b7409 model : print tensor size during load (#12711)
* model : print tensor size during load

* cont : fix units MB -> MiB

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-04-02 16:38:54 +03:00
Diego Devesa e0e912f49b llama : add option to override model tensor buffers (#11397)
* llama : add option to override tensor buffers

* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Georgi Gerganov a10b36c91a llama : refactor kv cache guard (#12695)
* llama : refactor kv cache guard

ggml-ci

* cont : fix comment [no ci]

* llama : fix kv_cache restore logic

ggml-ci

* context : simplify kv cache updates

ggml-ci

* cont : better name [no ci]

* llama : fix llama_decode return code when could not find KV slot

ggml-ci

* context : change log err -> warn [no ci]

* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Sigbjørn Skjæret 83a88bd6af vocab : BailingMoE : change possessive quantifiers to greedy (#12677) 2025-04-02 11:21:48 +02:00
Xuan-Son Nguyen 42eb248f46 common : remove json.hpp from common.cpp (#12697)
* common : remove json.hpp from common.cpp

* fix comment
2025-04-02 09:58:34 +02:00
Chenguang Li 9bacd6b374 [CANN] get_rows and dup optimization (#12671)
* [CANN]get_rows and dup optimization.

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]GET_ROWS and CPY/DUP optimization

Co-authored-by: hipudding <huafengchun@gmail.com>
Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

* [CANN]code style adjustment

Signed-off-by: noemotiovon <noemotiovon@gmail.com>

---------

Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-04-02 15:22:13 +08:00
Xuan-Son Nguyen 267c1399f1 common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Junil Kim f423981ac8 opencl : fix memory allocation size (#12649)
issue:
https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283

This patch fixes the memory allocation size
not exceeding the maximum size of the OpenCL device.
2025-04-01 09:54:34 -07:00
jklincn e39e727e9a llama : use LLM_KV_GENERAL_FILE_TYPE instead of gguf_find_key (#12672) 2025-04-01 14:54:28 +02:00
Sigbjørn Skjæret 5936a616e4 convert : BailingMoE : fix qkv split when head_dim is 0 (#12687)
NOTE: Ling-lite-base is broken, see https://huggingface.co/inclusionAI/Ling-lite-base/discussions/2
2025-04-01 14:37:13 +02:00
Georgi Gerganov 3fd072a540 metal : use F32 prec in FA kernels (#12688)
* metal : use F32 prec in FA kernels

ggml-ci

* cont : fix FA vec kernel

ggml-ci
2025-04-01 14:57:19 +03:00
R0CKSTAR a6f32f0b34 Fix clang warning in gguf_check_reserved_keys (#12686)
* Fix clang warning in gguf_check_reserved_keys

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix typo

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-01 13:12:53 +02:00
Wagner Bruna 2bb3597e42 vulkan: fix build when glslc doesn't support coopmat (#12683) 2025-04-01 11:38:07 +02:00
Romain Biessy 8293970542 SYCL: Rename oneMKL to oneMath (#12192)
* Rename oneMKL Interface to oneMath

* Use oneMath for Intel vendor

* Rename occurences to mkl

* clang-format

* Silence verbose warnings

* Set oneMath HIP_TARGETS

* Fix silence warnings

* Remove step to build oneMath from build instructions

* Use fixed oneMath version

* Remove INTEL_CPU

* Fold CMake oneDNN conditions

* Use Intel oneMKL for Intel devices

* Improve CMake message

* Link against MKL::MKL_SYCL::BLAS only

* Move oneMath documentation to Nvidia and AMD sections
2025-04-01 16:24:29 +08:00
Akarshan Biswas 8bbf26083d SYCL: switch to SYCL namespace (#12674) 2025-04-01 10:11:39 +02:00
Sigbjørn Skjæret 35782aeedb convert : BailingMoE : avoid setting rope_dim to 0 (#12678) 2025-03-31 23:09:48 +02:00
Daniel Bevenius c80a7759da vocab : add special infill tokens for CodeLlama (#11850)
* vocab : add special infill tokens for CodeLlama

The commit adds the following special tokens for CodeLlama infill:
- `▁<PRE>`
- `▁<SUF>`
- `▁<MID>`

The motivation for this is that currently the infill example uses
CodeLlama as a suggested model. But when using this model the following
error is generated:
```console
/llama.cpp-debug/examples/infill/infill.cpp:165: GGML_ASSERT(llama_vocab_fim_pre(vocab) >= 0) failed

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
305251 Aborted                 (core dumped)
./build/bin/llama-infill -t 10 -ngl 0 -m models/codellama-13b.Q5_K_S.gguf \
  -c 4096 --temp 0.7 --repeat_penalty 1.1 -n 20 \
  --in-prefix "def helloworld():\n    print(\"hell" \
  --in-suffix "\n   print(\"goodbye world\")\n    "
```

* squash! vocab : add special infill tokens for CodeLlama

Add _<EOT> as well.
2025-03-31 18:40:56 +02:00
a3sh 250d7953e8 ggml : faster ssm scan (#10558)
* faster ssm_scan

* delete unused commnet

* clang format

* add space

* modify unnecessary calculations

* faster ssm conv implementatioin

* modify file name with dash
2025-03-31 18:05:13 +02:00
Sigbjørn Skjæret 403fbacbbc convert : Qwerky : use lora_rank_tokenshift and lora_rank_decay if present (#12667) 2025-03-31 16:36:25 +02:00
0cc4m a8a1f33567 Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135)
* Vulkan: Add DP4A MMQ and Q8_1 quantization shader

* Add q4_0 x q8_1 matrix matrix multiplication support

* Vulkan: Add int8 coopmat MMQ support

* Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code

* Add GL_EXT_integer_dot_product check

* Remove ggml changes, fix mmq pipeline picker

* Remove ggml changes, restore Intel coopmat behaviour

* Fix glsl compile attempt when integer vec dot is not supported

* Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq

* Remove redundant comment

* Fix integer dot check

* Fix compile issue with unsupported int dot glslc

* Update Windows build Vulkan SDK version
2025-03-31 14:37:01 +02:00
Georgi Gerganov 1790e73157 cmake : fix whitespace (#0) 2025-03-31 15:07:32 +03:00
Georgi Gerganov 0114a32da0 sync : ggml
ggml-ci
2025-03-31 15:07:32 +03:00
Sandro Hanea a7724480fd cmake: improve Vulkan cooperative matrix support checks (whisper/2966)
Co-authored-by: Sandro Hanea <me@sandro.rocks>
2025-03-31 15:07:32 +03:00
Sigbjørn Skjæret 1a85949067 llava : proper description fix (#12668) 2025-03-31 11:28:30 +02:00
Akarshan Biswas 6c02a032fa SYCL: Remove misleading ggml_sycl_op_flatten function (#12387)
* SYCL: Remove misleading ggml_sycl_op_flatten function

* remove trailing whitespace

* Fix L2 norm from rebase

* remove try catch block from element_wise.cpp

* remove comment from common.hp

* ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward

* norm.cpp: remove try catch exception block
2025-03-31 11:25:24 +02:00
Sigbjørn Skjæret f52d59d771 llava : fix clip loading GGUFs with missing description (#12660) 2025-03-31 11:07:07 +02:00
marcoStocchi 52de2e5949 tts : remove printfs (#12640)
* tts.cpp : llama tokens console output is done using LOG_INF instead of printf(). Therefore the options '--log-disable' and '--log-file' have now uniform impact on all output.
2025-03-31 11:20:30 +03:00
Sigbjørn Skjæret 2c3f8b850a llama : support BailingMoE (Ling) (#12634) 2025-03-30 22:21:03 +02:00
Georgi Gerganov 4663bd353c metal : use constexpr in FA kernels + fix typedef (#12659)
* metal : use constexpr in FA kernels

ggml-ci

* cont

ggml-ci

* cont : fix typedef

ggml-ci
2025-03-30 22:04:04 +03:00
Juyoung Suk b3de7cac73 llama : add Trillion 7B model support (#12556)
* Support Trillion 7B

* Update llama.h

* Update llama.h

* Update llama-vocab.cpp for Trillion

* Update llama-vocab.cpp
2025-03-30 20:38:33 +02:00
Sergei Vorobyov 7242dd9675 llama-chat : Add Yandex instruct model template support (#12621)
* add yandex template

* update yandex chat template

* fix tests

* adjust chat template

* fix style

* fix tool macro in template

* add clarify comment

---------

Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-30 20:12:03 +02:00
R0CKSTAR 492d7f1ff7 musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611)
* musa: fix all warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update ci doc (install ccache)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix Windows build issue

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-30 10:59:38 +02:00
Georgi Gerganov d3f1f0acfb sync : ggml
ggml-ci
2025-03-30 08:33:31 +03:00
Xuan-Son Nguyen 360dc22c00 cpu : rm unused variable (ggml/1166) 2025-03-30 08:33:31 +03:00
cmdr2 a62d7fa7a9 cpu: de-duplicate some of the operators and refactor (ggml/1144)
* cpu: de-duplicate some of the operators and refactor

* Fix PR comments

* Fix PR comments
2025-03-30 08:33:31 +03:00
Daniel Bevenius e408d4351a ggml : add logging for native build options/vars (whisper/2935)
This commit adds debug level logging for the native build options and
variables to ggml/CMakeLists.txt.

The motivation for this is that it can be useful to see the effective
result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a
cmake build. I've found myself adding similar logging a few times now,
so I thought it might be a good idea to add this.

Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when
running cmake produces the following output:
```console
-- GGML_NATIVE         : OFF
-- GGML_NATIVE_DEFAULT : OFF
-- INS_ENB             : OFF
```
2025-03-30 08:33:31 +03:00
Daniel Bevenius 3891e183c6 examples : command.wasm updates (whisper/2904)
This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated.

* emscripten : fix TOTAL_STACK for wasm

This commit moves the TOTAL_STACK setting from the compile flags to the
linker flags. This is because the TOTAL_STACK setting is a linker
setting.

The motivation for this change is that currently the following warnings
are generated when building:
```console
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument]
```

* examples : suppress C++17 deprecation warning for std::codecvt_utf8

This commit suppresses the C++17 deprecation warning for
std::codecvt_utf8 similar to what is done in
examples/talk-llama/unicode.cpp.

The motivation for this change is to suppress these warnings:
```console
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  251 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
 3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |                               ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here
  193 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
/Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations]
  257 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
      |          ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here
 3145 | class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert {
      |                            ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17'
  723 | #    define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED
      |                                         ^
/Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED'
  688 | #      define _LIBCPP_DEPRECATED __attribute__((__deprecated__))
      |                                                 ^
4 warnings generated.
```

* ggml : suppress double-promotion warning in GGML_F16x4_REDUCE

This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro
to suppress a double-promotion warning.

Currently the following warning is generated when compiling the
command.wasm example:
```console
/whisper-work/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1592 |     GGML_F16_VEC_REDUCE(sumf, sum);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/whisper-work/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1640 |         GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  932 | #define GGML_F16_VEC_REDUCE         GGML_F16x4_REDUCE
      |                                     ^
/Users/danbev/work/ai/whisper-work/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE'
  918 |     res = wasm_f32x4_extract_lane(x[0], 0) +       \
      |         ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  919 |           wasm_f32x4_extract_lane(x[0], 1) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  920 |           wasm_f32x4_extract_lane(x[0], 2) +       \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  921 |           wasm_f32x4_extract_lane(x[0], 3);        \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 warnings generated.
```
wasm_f32x4_extract_lane returns a 32-bit float and this is what the
addition is performed on. But there is an implicit conversion from
32-bit float to 64-bit double when the result is assigned to `res`,
which is of type `ggml_float`. My understanding here is that this is
intentional and adding a cast to `ggml_float` should suppress the
warning.

* emscripten : add -Wno-deprecated to for emscripten

This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten
builds.

The motivation for this is that currently there a number of warnings
generated like the following:
```console
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated]
em++: warning: warnings in JS library compilation [-Wjs-compiler]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument]
```

The downside of this is that we might miss other deprecation warnings
in the future so I'm not sure if this is acceptable. But it make the
wasm examples cleaner without the warnings.

* examples : fix tautological-compare warning in stb_vorbis.c [no ci]

This commit applies a fix to address a tautological-compare warning
in stb_vorbis.c.

The motivation for this is that currently the following warning is
generated when compiling the commmand-wasm example:
```console
/Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare]
 1404 |       if (f->stream_start + loc >= f->stream_end || f->stream_start + loc < f->stream_start) {
      |                                                                           ^
1 warning generated.
```

This fix was taken from an open pull request on the stb repository
that addreses this issue:
https://github.com/nothings/stb/pull/1746

* squash! examples : update command.wasm instructions [no ci]

This commit adds a Python script to serve the the wasm examples build
in the `build-em` directory. Initially I thought that it would be enough
to start a simple python server but I did not notice that there was an
error in the browser console when I did that:
```console
command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated.
    at command.js:1:1206224
    at new Promise (<anonymous>)
    at loadWasmModuleToWorker (command.js:1:1204981)
    at Array.map (<anonymous>)
    at Object.loadWasmModuleToAllWorkers (command.js:1:1206428)
    at command.js:1:1204318
    at callRuntimeCallbacks (command.js:1:1202062)
    at preRun (command.js:1:6136)
    at run (command.js:1:1294094)
    at removeRunDependency (command.js:1:7046)
```
We need a few CORS headers to be set and in order hopefully make this
easy for users a Python script is added to the examples directory.
This should be able to server all the wasm examples provided they have
been built. command.wasm's README.md is updated to reflect this change.

* examples : remove unused functions

This commit removed the unused functions convert_to_utf8 and
convert_to_wstring from examples/common.cpp.

* Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]"

This reverts commit 8e3c47d96141c7675c985562ebdc705e839e338a.

We should not make this change here and instead when the upstream PR is
merged we can sync with it.

Refs: https://github.com/ggerganov/whisper.cpp/issues/2784
2025-03-30 08:33:31 +03:00
Xuan-Son Nguyen af6ae1efb2 llama : fix non-causal mask for gemma 3 (#12615) 2025-03-30 00:07:37 +01:00
Djip007 0bb2919335 llama : change cpu_buft_list order: ACCEL -> GPU host -> CPU extra -> CPU (#12632)
this allow to use GPU host when possible over CPU repack.
this have the same effect to resolve this issues (#12498) without
completely disable CPU extra buffer.

Co-authored-by: philou <philou@framework>
2025-03-29 14:07:37 +01:00
Jay a69f846351 cmake : fix ccache conflict (#12522)
If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in
cmake again will lead to conflict and compile fail.

Signed-off-by: Jay <BusyJay@users.noreply.github.com>
2025-03-29 11:04:58 +01:00
hipudding d07a0d7a79 CANN : remove clang-format in ggml-cann (#12607) 2025-03-29 11:03:28 +01:00
Sigbjørn Skjæret 3714c3ee1a llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631) 2025-03-28 22:13:02 +01:00
Georgi Gerganov b4ae50810e metal : improve FA + improve MoE (#12612)
* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci
2025-03-28 20:21:59 +02:00
Icenowy Zheng b86f600723 vulkan: fix coopmat shader generation when cross-compiling (#12272)
* vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>

* Only call coop-mat shaders once

* Fix whitespace

---------

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>
2025-03-28 14:51:06 -03:00
Johannes Gäßler dd373dd3bf llama: fix error on bad grammar (#12628) 2025-03-28 18:08:52 +01:00
Benson Wong 5d01670266 server : include speculative decoding stats when timings_per_token is enabled (#12603)
* Include speculative decoding stats when timings_per_token is true

New fields added to the `timings` object:

  - draft_n           : number of draft tokens generated
  - draft_accepted_n  : number of draft tokens accepted
  - draft_accept_ratio: ratio of accepted/generated

* Remove redundant draft_accept_ratio var

* add draft acceptance rate to server console output
2025-03-28 10:05:44 +02:00
Radoslav Gerganov ef03229ff4 rpc : update README for cache usage (#12620) 2025-03-28 09:44:13 +02:00
amritahs-ibm 13731766db llamafile : ppc64le GEMV forwarding for FP32. (#12594)
This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.

The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.

This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.

The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-03-28 09:43:22 +02:00
Radoslav Gerganov ab6ab8f809 rpc : send hash when tensor data is above some fixed threshold (#12496)
* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency
2025-03-28 08:18:04 +02:00
Piotr 2099a9d5db server : Support listening on a unix socket (#12613)
* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-03-27 23:41:04 +01:00
Georgi Gerganov 2969019837 media : add SVG logo [no ci] (#12616) 2025-03-27 23:09:05 +02:00
lhez 5dec47dcd4 opencl: add multi and vision rope, gelu_quick and im2col (#12600)
* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope
2025-03-27 08:08:08 -07:00
Si1w f125b8dccf llama : add PLM GGUF Conversion & Inference Support (#12457)
* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch
2025-03-27 12:49:15 +02:00
HighDoping 953c2a62cf model : restore support for T5Encoder (#12590) 2025-03-27 11:43:33 +01:00
Csaba Kecskemeti d5c6309d91 convert : Support Qwen2_5_VLForConditionalGeneration (#12595) 2025-03-27 11:11:23 +01:00
Georgi Gerganov 029c693fdc sync : ggml
ggml-ci
2025-03-27 10:09:29 +02:00
Georgi Gerganov 771d84371c scripts : update sync + fix cmake merge
ggml-ci
2025-03-27 10:09:29 +02:00
Georgi Gerganov df0665a483 sync : ggml
ggml-ci
2025-03-27 09:04:38 +02:00
Georgi Gerganov 0306aad1ca cmake : sync/merge PowerPC build commands (#0) 2025-03-27 09:04:38 +02:00
amritahs-ibm c7b43ab608 llamafile : ppc64le MMA implementation for Q4_0. (#12489)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le ISA using MMA
builtins. This patch handles matrix multiplication
between quantised datatypes, block_q4_0 and
block_q8_0.

This change results in 5% - 50% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-03-27 08:51:47 +02:00
xctan 24feaec057 ggml : riscv: add 128-bit RVV support (#12530)
* ggml : add 128-bit RVV support

* ggml : revert to old RVV 256+ q2_K, q3_K, q4_K, q6_K impl

* remove trailing whitespaces

* restructure vector length selection code
2025-03-27 08:38:34 +02:00
Georgi Gerganov f28bc4c286 llama : make loras compatible with repacking (#12593)
* llama : make loras compatible with repacking

ggml-ci

* cont : simplify

ggml-ci

* cont : add TODO [no ci]
2025-03-27 08:24:10 +02:00
Akarshan Biswas f17a3bb4e8 SYCL: implement memset ggml backend buffer interface (#12580)
* SYCL: implement memset ggml backend buffer interface

* use GGML_ABORT macro

* Do not wait for all queues to finish for memset operation
2025-03-27 09:46:00 +08:00
Slobodan Josic bd40678df7 HIP: Add support for RDNA4 targets (#12372) 2025-03-26 23:46:30 +01:00
Georgi Gerganov b3298fa47a metal : refactor mat-vec code (#12569)
* metal : refactor mat-vec code

ggml-ci

* metal : rename all_sum -> sum_all

ggml-ci

* metal : fix comments [no ci]

* metal : fix nr constant [no ci]

* metal : mv q6_K support nr0 > 1

ggml-ci

* metal : reduce register pressure

ggml-ci

* metal : fix typo [no ci]

* metal : reduce register pressure

ggml-ci
2025-03-26 21:38:38 +02:00
Michał Moskal 2447ad8a98 upgrade to llguidance 0.7.10 (#12576) 2025-03-26 11:06:09 -07:00
Ivy233 02082f1519 clip: Fix llama-llava-clip-quantize-cli quantization error under CUDA backend (#12566)
* [Fix] Compiling clip-quantize-cli and running it in a CUDA environment will cause ggml_fp16_to_fp32 to report an error when trying to access video memory. You need to switch to the CPU backend to run quantize.
After the fix, it will automatically run in the CPU backend and will no longer be bound to CUDA.

* [Fix]Roll back the signature and implementation of clip_model_load, and change the call in clip_model_quantize to clip_init.
2025-03-26 15:06:04 +01:00
Georgi Gerganov df4d20cd53 convert : fix squeeze for ssm_conv tensors (#12573)
* convert : fix squeeze for ssm_conv tensors

* convert : match ssm_conv tensors by type

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-03-26 08:21:05 -04:00
Georgi Gerganov 5ed38b6852 ggml : fix MUL_MAT_ID repack with Q8_K (#12544)
* ggml : fix MUL_MAT_ID repack with Q8_K

ggml-ci

* ggml : improve repack templates

ggml-ci
2025-03-26 13:02:00 +02:00
R0CKSTAR fd7855f8f5 doc: [MUSA] minor changes (#12583)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-26 09:09:48 +02:00
Sigbjørn Skjæret 53af4dba42 convert: fix Mistral3/Gemma3 model hparams init (#12571)
* Fix Mistral3/Gemma3 model hparams init

* set positional args correctly

* use existing hparams if passed
2025-03-25 23:03:10 +01:00
Eric Curtin ef19c71769 run: de-duplicate fmt and format functions and optimize (#11596) 2025-03-25 18:46:11 +01:00
Dan Johansson 053b3f9aae ggml-cpu : update KleidiAI to v1.5.0 (#12568)
ggml-cpu : bug fix related to KleidiAI LHS packing

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-25 13:10:18 +02:00
Akarshan Biswas e2f560175a SYCL: disable Q4_0 reorder optimization (#12560)
ggml-ci
2025-03-25 18:40:18 +08:00
Dan Johansson 36ee06dd2d docs : add build instructions for KleidiAI (#12563)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-03-25 11:35:20 +02:00
R0CKSTAR 3cd3a39532 ci: [MUSA] add CI and update doc (#12562)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-25 09:45:08 +02:00
Georgi Gerganov 2d77d88e70 context : fix worst-case reserve outputs (#12545)
ggml-ci
2025-03-25 09:19:23 +02:00
Akarshan Biswas c95fa362b3 ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547) 2025-03-24 19:35:38 +02:00
lhez 2b65ae3029 opencl: simplify kernel embedding logic in cmakefile (#12503)
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2025-03-24 09:20:47 -07:00
Akarshan Biswas 48d7021c61 CI: fix SYCL build (#12546) 2025-03-24 14:58:32 +02:00
Tei Home 3361e2deba docs: update: improve the Fedoa CUDA guide (#12536)
* docs: update fedora-cuda guide

- Rename and place into Backend Folder.
- Update Host-Supplied Packages.
- Expand Recommended Users Section.

* docs: improve the flow of CUDA-FEDORA.md
2025-03-24 11:02:26 +00:00
compilade 00d53800e0 llama-vocab : add SuperBPE pre-tokenizer (#12532) 2025-03-24 11:47:24 +01:00
R0CKSTAR 7ea75035b6 CUDA: Fix clang warnings (#12540)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-24 11:28:34 +01:00
Prajwal B Mehendarkar c54f6b7988 mmap : skip resource limit checks on AIX (#12541) 2025-03-24 12:17:10 +02:00
Jeff Bolz 9b169a4d4e vulkan: fix mul_mat_vec failure in backend tests (#12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-24 07:56:17 +01:00
Marius Gerdes 77f9c6bbe5 server : Add verbose output to OAI compatible chat endpoint. (#12246)
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
2025-03-23 19:30:26 +01:00
Lars Sonchocky-Helldorf 18b663d8e4 install : add macports (#12518)
MacPorts section added
2025-03-23 10:21:48 +02:00
Xuan-Son Nguyen fbdfefe74e llama : gemma3 : use output tensor if it exists in model weight (#12506)
* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names
2025-03-22 23:28:19 +01:00
Georgi Gerganov ba932dfb50 ggml : fix quantized cpy op (#12310)
* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci
2025-03-22 16:23:26 +02:00
R0CKSTAR fac63a3d78 musa: refine compute capability (#12493)
* musa: refine compute capability

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-22 10:11:37 +01:00
Jeff Bolz eddfb43850 vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-22 09:40:11 +01:00
stduhpf 4375415b4a Vulkan: RTE rounding for cpy to quant (#12480)
* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <jbolz@nvidia.com>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-03-21 20:34:50 +01:00
Eve 30c42ef5cb vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472) 2025-03-21 20:27:47 +01:00
Georgi Gerganov af04481e6b model : do not repack if a GPU device is present (#12498)
ggml-ci
2025-03-21 16:14:29 +02:00
Sigbjørn Skjæret 960e726077 chore : cleanup llama_model_loader::TENSOR_ usage (#12492) 2025-03-21 10:21:36 +01:00
marcoStocchi ea1518e839 llama-tts : avoid crashes related to bad model file paths (#12482) 2025-03-21 11:12:45 +02:00
蕭澧邦 1aa87ee53d [SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
* [SYCL] Fix build on Windows when ccache enabled (#9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-21 14:58:47 +08:00
Svetlozar Georgiev 9ffcc9e374 sycl: cleanup oneDNN related code (#12097) 2025-03-21 10:15:56 +08:00
Woof Dog e04643063b webui : Prevent rerendering on textarea input (#12299)
* webui: Make textarea uncontrolled to eliminate devastating lag

* Update index.html.gz

* use signal-style implementation

* rm console log

* no duplicated savedInitValue set

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-20 15:57:43 +01:00
Sigbjørn Skjæret dbb3a4739e llama : make Qwen2MoE QKV bias optional (#12477) 2025-03-20 12:49:59 +01:00
Srihari-mcw 3d82dbcbce ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (#12332)
* Add block interleaving support for Q4_K quantization

* Remove whitespaces and fix CI/CD issues

* Update pointer of bsums from int16_t to const int16_t

* Add vector version of quantize_q8_K_4x8 function

* Update code formatting based on review comments
2025-03-20 13:35:34 +02:00
Bartowski 732b5fbf5e convert : avoid calls to tokenizer.added_tokens_decoder (#12473)
tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens
2025-03-20 08:36:37 +02:00
fairydreaming 568013d0cd context : clear sets containing encoder output sequence ids before storing new values (#12470)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-19 21:01:57 +01:00
Gaurav Garg 517b5ddbf0 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
Jeff Bolz a9b59288e2 vulkan: optimize iq1 coopmat2 dequant functions (#12427) 2025-03-19 19:56:23 +01:00
Guus Waals 0fd8487b14 Fix visionOS build and add CI (#12415)
* ci: add visionOS build workflow

Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.

* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs

* ci: remove define hacks for u_xxx system types

---------

Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
2025-03-19 11:15:23 +01:00
Sigbjørn Skjæret 108e53c2f1 llama : add support for GPT2, Bloom and CodeShell tied word embeddings (#12456)
* Add support for GPT2, Bloom and CodeShell tied word embeddings

* Deduplicate tied word embeddings weights

* Workaround for incorrect weight map

It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.

* check++

* fatfingers--
2025-03-19 09:08:49 +01:00
Sigbjørn Skjæret a686171ea7 convert : Support chat_template.json (#12460) 2025-03-19 08:58:13 +01:00
Jeff Bolz c446b2edd2 vulkan: Submit once enough matmul work has been recorded (#12406)
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
2025-03-19 08:26:26 +01:00
lhez d84635b1b0 opencl: improve profiling (#12442)
* opencl: more profiling timing

* opencl: generate trace for profiling

* opencl: reduce profiling overhead

* Populate profiling timing info at the end rather than after each
  kernel run

* opencl: fix for chrome tracing
2025-03-18 12:54:55 -07:00
Georgi Gerganov 75422e8bc4 graph : normalize Q, K, V shapes + sync cross attention (#12449)
* graph : normalize Q, K, V shapes and add comments

ggml-ci

* context : synchronize before getting cross attention data

* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
R0CKSTAR bb115d2bf7 musa: override warp_size of musa device to 32 (#12445)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-18 19:28:26 +01:00
Xuan-Son Nguyen 29fff308c7 llama : support converting Mistral Small text-only (#12450) 2025-03-18 19:16:19 +01:00
Georgi Gerganov c6af2161b2 speculative : fix seg fault in certain cases (#12454) 2025-03-18 19:35:11 +02:00
Xuan-Son Nguyen 99aa304fb9 llama : add support for EXAONE tied word embeddings (#12451) 2025-03-18 17:24:33 +01:00
Georgi Gerganov 8551c44d84 context : always use non-causal attention for encoder graphs (#12447)
* context : always use non-causal attention for encoder graphs

ggml-ci

* context : move the change to llama_context::encode()

ggml-ci
2025-03-18 13:05:49 +02:00
Łukasz Ślusarczyk 35cae5ba05 SYCL: using graphs is configurable by environment variable and compile option (#12371)
* alberto changes

* enable sycl graphs by env variable

* fixed compilation warnings in ggml-sycl.cpp

* renamed graph variables

* fix markdown in docs/backend/SYCL.md

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>

* fix markdown in docs/backend/SYCL.md again

* compiling graphs by default, renamed graph_enable to graph_disable

---------

Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
2025-03-18 11:16:31 +01:00
Georgi Gerganov 810e0af3f5 server : fix warmup draft cache type (#12446)
ggml-ci
2025-03-18 12:05:42 +02:00
Prajwal B Mehendarkar eba92d64c3 cmake : fix PowerPC build (#12241)
Closes #12240
2025-03-18 11:37:33 +02:00
fj-y-saito d9a14523bb ggml : add SVE support for q6_K_q8_K (#12361) 2025-03-18 10:14:39 +02:00
0cc4m fd123cfead Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434) 2025-03-18 07:21:40 +01:00
Łukasz Ślusarczyk a53f7f7b88 fixed compilation warnings in ggml-sycl (#12424) 2025-03-18 08:51:25 +08:00
Molly Sophia 7dfad387e3 llama: Add support for RWKV v7 architecture (#12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Sigbjørn Skjæret 60c902926c docs : bring llama-cli conversation/template docs up-to-date (#12426) 2025-03-17 21:14:32 +01:00
Gaurav Garg b1b132efcb cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394)
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
2025-03-17 20:25:13 +02:00
Guus Waals 01e8f2138b ggml-vulkan: remove unused find_program(glslc) (#12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
2025-03-17 13:35:43 -03:00
Jeff Bolz 484a8ab513 vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312) 2025-03-17 09:26:18 -05:00
Daniele cf2270e4d3 vulkan: subgroup size tuning (#12087)
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-03-17 12:42:33 +01:00
Jeff Bolz f07690c930 vulkan: use fp32 in coopmat2 q4_k dequant function (#12309) 2025-03-17 10:43:35 +01:00
Jeff Bolz 891c63956d vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273)
* vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
2025-03-17 10:41:59 +01:00
Jeff Bolz 2f21123c1d vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258) 2025-03-17 10:35:00 +01:00
Christian Kastner 374101fd74 cmake : enable building llama.cpp using system libggml (#12321)
* cmake: Factor out compiler flag function from ggml

llama.cpps's build requires it, too, and we may want to make use of it
without add_subdirectory(ggml).

* cmake: Enable building against system ggml

This facilitates package maintenance for Linux distributions, where the
libggml library most likely will be shipped as an individual package
upon which a llama.cpp package depends.
2025-03-17 11:05:23 +02:00
Akarshan Biswas b3c9a65673 SYCL: set extras only on GGML_TYPE_Q4_0 (#12366)
* SYCL: set extras only on GGML_TYPE_Q4_0

* release tensor_extras in reset buffer interface
2025-03-17 09:45:12 +08:00
Sigbjørn Skjæret 8ba95dca20 llama : fix OLMo-2-0325-32B-Instruct K-norm size (#12400) 2025-03-16 19:46:36 +02:00
Georgi Gerganov dc079cfdff context : fix init of n_outputs (#12397)
ggml-ci
2025-03-16 19:29:36 +02:00
Daniel Bevenius 7b61bcc87c ci : add --symlinks to xcframework zip command (#12409)
This commit adds the --symlinks option to the zip command used to create
the xcframework zip file. This is necessary to create symlinks in the
zip file. Without this option,  the Versions symlink is stored as a
regular directory entry in the zip file, rather than as a symlink in the
zip which causes the followig error in xcode:
```console
Couldn't resolve framework symlink for '/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current': readlink(/Users/danbev/work/ai/llama.cpp/tmp_1/build-apple/llama.xcframework/macos-arm64_x86_64/llama.framework/Versions/Current): Invalid argument (22)
```

Refs: https://github.com/ggml-org/llama.cpp/pull/11996#issuecomment-2727026377
2025-03-16 18:22:05 +01:00
marcoStocchi f4c3dd5daa llama-tts : add '-o' option (#12398)
* added -o option to specify an output file name

* llama-tts returns ENOENT in case of file write error

note : PR #12042 is closed as superseded with this one.
2025-03-15 17:23:11 +01:00
aubreyli 3d35d87b41 SYCL: Delete redundant plus sign and space (#12391) 2025-03-15 15:49:03 +01:00
fairydreaming b19bd064c0 SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (#12399)
* sycl : support non-contiguous tensors in binary ops

* sycl : silence unused variable warning

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-15 22:19:30 +08:00
Chenguang Li 92a391327e [CANN]MUL_MAT optimization (#12382) 2025-03-15 09:31:08 +08:00
Eric Curtin 9f2250ba72 Add CLI arg to llama-run to adjust the number of threads used (#12370)
We default to 4, sometimes we want to manually adjust this

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-03-14 16:41:20 +00:00
Sigbjørn Skjæret 774973b8f3 main : add -sysf / --system-prompt-file (#12249) (#12250)
* add system_prompt_file

* add -sysf / --system-prompt-file

* remove system_prompt_file
2025-03-14 16:57:05 +01:00
fairydreaming 8fcb563613 Load all MoE experts during warmup (#11571)
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup

* common : use new API to enable warmup mode during model warmup

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-03-14 13:47:05 +01:00
Victor add2a3aa5a server: fix "--grammar-file" parameter (#12285) 2025-03-14 11:21:17 +01:00
Georgi Gerganov c522ce4143 graph : simplify attn input build for unified KV cache (#12381)
ggml-ci
2025-03-14 10:47:44 +02:00
Georgi Gerganov 081bee8c64 hparams : add SWA rope parameters (#12374)
ggml-ci
2025-03-14 09:03:24 +02:00
Georgi Gerganov 84d5475541 llama : fix Gemma3 SWA KV cache shift (#12373)
* llama : fix Gemma3 SWA KV cache shift

ggml-ci

* hparams : add comment [no ci]
2025-03-13 19:08:07 +02:00
Xuan-Son Nguyen be7c303410 arg : no n_predict = -2 for examples except for main and infill (#12364) 2025-03-13 12:34:54 +01:00
Georgi Gerganov e0dbec0bc6 llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)
* llama : refactor llama_context, llama_kv_cache, llm_build_context

ggml-ci

* graph : don't mutate the KV cache during defrag

ggml-ci

* context : reduce virtuals + remove test function

ggml-ci

* context : move interface implementation to source file + factory

ggml-ci

* graph : move KV cache build functions to llama_context impl

ggml-ci

* graph : remove model reference from build_pooling

ggml-ci

* graph : remove llama_model reference

ggml-ci

* kv_cache : provide rope factors

ggml-ci

* graph : rework inputs to use only unique_ptr, remove attn input abstraction

ggml-ci

* context : remove llama_context_i abstraction

ggml-ci

* context : clean-up

ggml-ci

* graph : clean-up

ggml-ci

* llama : remove redundant keywords (struct, enum)

ggml-ci

* model : adapt gemma3

ggml-ci

* graph : restore same attention ops as on master

ggml-ci

* llama : remove TODO + fix indent

ggml-ci
2025-03-13 12:35:44 +02:00
Ishaan Gandhi 2048b5913d server : fix crash when using verbose output with input tokens that are not in printable range (#12178) (#12338)
* Fix DOS index bug

* Remove new APIs

* remove extra line

* Remove from API

* Add extra newline

* Update examples/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-13 11:10:05 +01:00
Oscar Barenys f08f4b3187 Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (#12301) 2025-03-12 20:06:58 +01:00
Daniel Bevenius 80a02aa858 llama.swiftui : fix xcframework dir in README [no ci] (#12353)
This commit fixes the path to the xcframework in the README file which I
had forgotten to change after renaming the build directory.
2025-03-12 13:45:32 +01:00
Alberto Cabrera Pérez 363f8c5d67 sycl : variable sg_size support for mmvq kernels (#12336) 2025-03-12 09:57:32 +00:00
uvos 34c961b181 CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315)
When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to
selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need
to avoid launching them with parameters for warp64
2025-03-12 10:14:11 +01:00
Xuan-Son Nguyen 7841fc723e llama : Add Gemma 3 support (+ experimental vision capability) (#12343)
* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
2025-03-12 09:30:24 +01:00
Jeff Bolz bf69cfe62f vulkan: fix bug in coopmat1 mul_mat_id (#12316)
* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id
2025-03-12 06:59:19 +01:00
uvos 10f2e81809 CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177)
refactor mmqv to unify the calculation of nwarps and rows per block between host and device code.

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-11 20:16:03 +01:00
jklincn ba7654380a ggml-backend : fix backend search path (#12330)
* Fix backend search path

* replace .native() with '/'

* reverted .native()
2025-03-11 14:25:17 +01:00
BB-fat 6ab2e4765a metal : Cache the Metal library at the device context level (#12265) 2025-03-11 13:45:02 +02:00
Xuan-Son Nguyen 96e1280839 clip : bring back GPU support (#12322)
* clip : bring back GPU support

* use n_gpu_layers param

* fix double free

* ggml_backend_init_by_type

* clean up
2025-03-11 09:20:16 +01:00
Eve 2c9f833d17 mat vec double buffer (#12188) 2025-03-10 19:28:11 +00:00
R0CKSTAR 251364549f musa: support new arch mp_31 and update doc (#12296)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-03-10 18:18:25 +01:00
Henry Linjamäki 8acdacb3ea opencl: use OpenCL C standard supported by the device (#12221)
This patch nudges the llama.cpp a bit to be supported on PoCL which
doesn't support OpenCL C CL2.0. The issue is solved by querying the
device for the supported OpenCL C versions and using the highest one
available.
2025-03-10 09:57:00 -07:00
John Bean 89b2b56e86 readme: added Sidekick to available UIs (#12311) 2025-03-10 16:13:09 +02:00
Georgi Gerganov e128a1bf5b tests : fix test-quantize-fns to init the CPU backend (#12306)
ggml-ci
2025-03-10 14:07:15 +02:00
marcoStocchi 6ef79a67ca common : refactor '-o' option (#12278)
As discussed in PR 'llama-tts : add -o option' (#12042):

* common_params : 'out_file' string is the only output file name parameter left in common_params. It's intended to be used in all example programs implementing an '-o' option.

* cvector-generator, export-lora, imatrix : default output filenames moved from 'common_params' to the 'main()' of each example program.
2025-03-10 13:34:13 +02:00
Olivier Chafik 4e39a3c332 server: extract <think> tags from qwq outputs (#12297)
* extract <think> tags from qwq outputs

* const for all static regexes in chat.cpp
2025-03-10 10:59:03 +00:00
Olivier Chafik be421fc429 tool-call: ensure there's always a non-empty tool call id (#12292) 2025-03-10 09:45:29 +00:00
Olivier Chafik 87c2630546 allow missing content in message if tool_calls provided (#12293) 2025-03-10 09:45:07 +00:00
Olivier Chafik 2b3a25c212 sampler: fixes trigger tokens + lazy grammars (fix typo cast from token to string) (#12291)
* Fix typo in lazy grammar handling (fixes trigger tokens)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-10 09:44:42 +00:00
tc-mb 8352cdc87b llava : fix bug in minicpm-v code (#11513)
* fix bug in minicpm-v code

* update readme of minicpm-v
2025-03-10 10:33:24 +02:00
Georgi Gerganov 1e2f78a004 server : add speculative decoding presets for FIM (#12287) 2025-03-09 19:08:20 +02:00
Georgi Gerganov 0fd7ca7a21 authors : update (#12271) 2025-03-08 18:26:00 +02:00
Jason C.H 6fefc05a7a ggml-backend : make path_str compatible with C++20 (#12269) 2025-03-08 17:02:39 +01:00
Georgi Gerganov 7ab364390f server : infill gen ends on new line (#12254) 2025-03-07 20:54:30 +02:00
Daniel Bevenius 7c7f3b7f43 ggml : skip intermediate .air file when compiling .metallib (#12247)
This commit updates the compilation of default.metallib to skip the
intermediate .air (Apple Intermediate Representation) file.

The motivation for this change is to simplify the custom command a
little and avoid generating and then removing the .air file.
2025-03-07 14:15:27 +01:00
Georgi Gerganov 102ac1891d sync : ggml
ggml-ci
2025-03-07 14:49:44 +02:00
vmobilis d6ae2fa061 ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118)
* ggml_compute_forward_concat() for arbitrary tensor type

* Check that tensors' type match

* ggml-cpu.c: check type of source tensors

* ggml-cpu.c: move tensor type check to ggml_compute_forward_concat()

* ggml.c: check concatenated tensor type

* Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c

..., as it was moved to ggml.c.
2025-03-07 14:49:44 +02:00
Rémy O 68d0027f3d ggml-cpu: faster AVX2 variant for IQ1_M (#12216) 2025-03-07 13:54:22 +02:00
Georgi Gerganov ea002810a2 ci : fix save-load test invocations (#12245) 2025-03-07 12:19:31 +02:00
Sigbjørn Skjæret 8fad3c7a7c server : Log original chat template parsing error (#12233) 2025-03-07 11:15:33 +01:00
Olivier Chafik 7cf64f6bee sync: minja - support QwQ-32B (#12235)
https://github.com/google/minja/commit/8a76f7815e8a3ae00bd233c2b5a8b7d4e86564ec
2025-03-07 09:33:37 +00:00
BB-fat 5e2d57b2b2 metal : simplify kernel arguments using a struct (#3229) (#12194)
* metal : refactor im2col parameters into a struct

* metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets

* metal : refactor sum_rows parameters into a struct

* metal : refactor soft_max parameters into a struct

* metal : refactor diag_mask_inf parameters into a struct

* metal : refactor ssm_conv parameters into a struct

* metal : refactor ssm_scan parameters into a struct

* metal : refactor get_rows parameters into a struct

* metal : refactor group_norm parameters into a struct

* metal : refactor conv_transpose_1d parameters into a struct

* metal : refactor upscale parameters into a struct

* metal : refactor pad parameters into a struct

* metal : refactor pad_reflect_1d parameters into a struct

* metal : refactor arange parameters into a struct

* metal : refactor timestep_embedding parameters into a struct

* metal : refactor argsort parameters into a struct

* metal : refactor leaky_relu parameters into a struct

* metal : refactor pool_2d parameters into a struct

* metal : fix trailing whitespace

---------

Co-authored-by: alexju <alexju@tencent.com>
2025-03-07 08:35:57 +01:00
David Huang f1648e91cf HIP: fix rocWMMA build flags under Windows (#12230) 2025-03-07 08:06:08 +01:00
Daniel Bevenius d6c95b0740 metal : fix default.metallib build (#12224)
This commit updates the custom command to build the default.metallib
file to use the correct path to ../ggml-common.h by using the variable
METALLIB_COMMON.

The motivation for this change is that currently when building and
specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is
generated:
```console
[ 11%] Linking CXX shared library ../../bin/libggml.dylib
[ 11%] Built target ggml
make[2]: *** No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'.  Stop.
make[1]: *** [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2
```

With the above change the build could progress but there was a follow
on error about not being able to find the ggml-common.h file in
ggml-metal.metal where is was included as a relative path:
```console
[ 11%] Compiling Metal kernels
/Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'?
         ^~~~~~~~~~~~~~~~~~
         "ggml-common.h"
1 error generated.
```
Removing the relative path then allowed the build to complete
successfully.
2025-03-07 06:23:16 +01:00
lhez d76a86d967 opencl: Noncontiguous norm, rms_norm, disable fp16 for some ops (#12217)
* opencl: support noncontiguous `norm`

* opencl: support noncontiguous `rms_norm`

* opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`
2025-03-07 00:20:35 +00:00
xiaofei 776f9e59cc cmake : fix undefined reference errors for std::filesystem in ggml (#12092) (#12094)
Signed-off-by: Ray Lee <hburaylee@gmail.com>
Co-authored-by: Ray Lee <hburaylee@gmail.com>
2025-03-06 22:58:25 +00:00
Lucas Moura Belo 3d652bfddf readme : update bindings (#12229) 2025-03-06 21:15:13 +02:00
Johannes Gäßler 5220a16d18 CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222) 2025-03-06 18:45:09 +01:00
David Huang 3ffbbd5ce1 HIP: rocWMMA documentation and enabling in workflow builds (#12179)
* Enable rocWMMA for Windows CI build

* Enable for Ubuntu

* GGML_HIP_ROCWMMA_FATTN documentation work
2025-03-06 14:14:11 +01:00
Olivier Chafik 42994048a3 update function-calling.md w/ template override for functionary-small-v3.2 (#12214) 2025-03-06 09:03:31 +00:00
Aaron Teo e9b2f84f14 llava: add big-endian conversion for image encoder (#12218)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-03-06 09:33:21 +01:00
uvos e721c05c93 HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209)
This avoids conflict with internal cuda/hip runtimes memory managment behavior.
2025-03-06 08:20:52 +01:00
Han Yin 57b6abf85a android : fix KV cache log message condition (#12212) 2025-03-06 08:22:49 +02:00
Henry Linjamäki 94bb63e4f0 opencl : fix buffer alignment (#12197)
Fix the following error:

```
ggml-alloc.c:99: not enough space in the buffer
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024)
```

which occurs when `ggml_backend_opencl_context::alignment` is larger
than `cl_ptr_base` (hard-coded to `0x1000`).

Also, fix `ggml_backend_opencl_context::alignment` was set to
`CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the
value is reported in bits.
2025-03-06 02:33:40 +01:00
Henry Linjamäki f79243992c opencl : fix ulong kernel args were set from int variables (#12174)
... which left garbage bits in the upper half of the kernel args. This
caused segmentation faults when running PoCL.
2025-03-06 02:31:14 +01:00
simon886212 ed4ce0dda2 opencl : fix profile-related errors (#12095)
Co-authored-by: ubuntu <ubuntu@localhost.localdomain>
2025-03-06 02:30:05 +01:00
Rémy O 07d1572347 ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (#12154)
* ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

* cmake: Add GGML_BMI2 build option

* ggml: enable BMI2 on relevant CPU variants

* ggml-cpu: include BMI2 in backend score

* ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

* ggml-cpu: add __BMI2__ define when using MSVC
2025-03-06 02:26:10 +01:00
Akarshan Biswas 5e43f104cc SYCL: Disable f16 Unary OPs as not supported by the kernels (#12201) 2025-03-05 16:58:23 +01:00
Plamen Minev 16e4b22c5e ggml : fix GGMLMetalClass ODR (#12200)
-- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.
2025-03-05 17:16:01 +02:00
Daniel Bevenius 074c4fd39d ci : add fetch-depth to xcframework upload (#12195)
This commit adds the fetch-depth: 0 option to the checkout action in the
build.yml workflow file (0 meaning that it fetches the complete
history). The default value is 1 when not specified which only fetches
the latest commit.

This is necessary to ensure that `git rev-list --count HEAD` counts the
total number of commits in the history. Currently because the default is
being used the name of the xcframework artifact is always
llama-b1-xcframework.
2025-03-05 14:16:40 +01:00
Olivier Chafik 669912d9a5 tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00
Daniel Bevenius fa31c438e0 ci : fix xcframework artifact tag (#12191)
The commit add the name parameter to the upload-artifact action to
ensure that the artifact is uploaded with the correct name.

The motivation for this is that currently the uploaded xcframework
is named as llama-b1-xcframework.zip. With this change the name of this
artifact should contain the build number like the other artifacts.
2025-03-05 10:22:29 +01:00
Daniel Bevenius 3ccbfe5a71 ci : remove xframework upload (#12190)
* ci : remove xframework upload

This commit removes the upload of the xframework zip file as an
artifact.

The motivation for this change is that the xframework zip file is
currently being uploaded as part of strategy and will therefore be
attempted to be uploaded multiple times and will fail the build.

The uploading should be moved to somewhere else in the build to avoid
this.

* ci : add xcframework upload to macos-latest job
2025-03-05 08:34:02 +01:00
Clauszy 06a92a193a server : fix cache reuse logic (#12161)
The first kv shift offsets the positions of all tokens after head_c.
When using llama_kv_cache_seq_rm next, using head_c will remove the valid tokens because their positions have already been offset.
2025-03-05 09:25:45 +02:00
Daniel Bevenius a057897ad4 llama : add xcframework build script (#11996)
* llama : add xcframework build script

This commit adds a script to build an XCFramework for Apple
ios, macos, visionos, and tvos platforms.

The generated XCFramework can then be added to a project and used in
the same way as a regular framework. The llama.swiftui example project
has been updated to use the XCFramework and can be started using the
following command:
```console
$ open examples/llama.swiftui/llama.swiftui.xcodeproj/
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10747

* examples : remove llama.cpp (source dir ref) from project.pbxproj

This commit removes the reference to llama.cpp from the project.pbxproj
file since Package.swift has been removed.

* ci : updated build.yml to use build-xcframework.sh

* ci : add xcframework build to github releases

This commit adds the ability to create a GitHub release with the
xcframework build artifact.

* scripts : add apple app validation scripts

This commit adds scripts that can validate the iOS, macOS, tvOS, and
VisionOS applications. The scripts create a simple test app project,
copy the llama.xcframework to the test project, build and archive the
app, create an IPA from the archive, and validate the IPA using altool.

The motivation for this is to provide some basic validation and
hopefully avoid having to manually validate apps in Xcode.

* llama : remove Package.swift

This commit removes the Package.swift file, as we are now building an
XCFramework for the project.

* llama : remove Sources and spm-headers directories

* llama : use TargetConditionals.h for visionOS/tvOS
2025-03-05 06:30:31 +01:00
mgroeber9110 5bbe6a9fe9 ggml : portability fixes for VS 2017 (#12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
Georgi Gerganov 20a9b8f5e1 readme : fix roadmap link (#12185) 2025-03-04 18:42:44 +02:00
Sigbjørn Skjæret 56d7a9f812 main: allow preloading conversation with -p and add -st / --single-turn (#12145)
* Add chat template formatting to -no-cnv

* only enable prompt formatting if explicitly enabled

* add -st / --single-turn

* add --single-turn and -p in conversation mode

* fix -sys + -p

* reword warning

* small readability change and fix (long) outdated example usage

* only activate single turn in conversation mode
2025-03-04 12:19:39 -04:00
Olivier Chafik 1a24c4621f server: fix deadly typo in response_format.json_schema.schema handling (#12168) 2025-03-04 08:24:07 +02:00
David Huang becade5de7 HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032)
Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check
Adds rocWMMA support to fattn-wmma-f16

---

Signed-off-by: Carl Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Ben Jackson <ben@ben.com>
2025-03-03 22:10:54 +01:00
Georgi Gerganov dfd6b2c0be sync : ggml
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2 b64d7cc272 cuda: unary ops as float + de-duplicate (ggml/1130) 2025-03-03 18:18:11 +02:00
Georgi Gerganov 3d1cf3cf33 sync : ggml
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2 0cbee131ad cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
ggml-ci
2025-03-03 18:18:11 +02:00
Georgi Gerganov 8371d44595 sync : ggml
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2 87abb7e903 cuda/cpu: Increase support for fp16 unary operations (ggml/1125)
* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests
2025-03-03 18:18:11 +02:00
Diego Devesa 6d4c23b81b whisper : support GGML_BACKEND_DL (whisper/2843)
* whisper : support GGML_BACKEND_DL

* fix DTW crash

* whisper.objc : fix build - add ggml-cpp.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-03 18:18:11 +02:00
midnight 6512a90037 cmake : fix compile assumptions for power9/etc (whisper/2777)
* Add small comment re: VSX to readme

Co-authored-by: midnight <midnight@example.com>
2025-03-03 18:18:11 +02:00
petterreinholdtsen 4512055792 Told cmake to install ggml-cpp.h as a public header file. (ggml/1126)
It is used by Whisper talk-llama example.

Co-authored-by: Petter Reinholdtsen <pere@debian.org>
2025-03-03 18:18:11 +02:00
cmdr2 f54a4ba11e Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)
* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div
2025-03-03 18:18:11 +02:00
Georgi Gerganov aede2074f6 scripts : sync-ggml-am.sh fix 2025-03-03 18:18:11 +02:00
Daniel Bevenius 2679c3b55d ci : set GITHUB_ACTION env var for server tests (#12162)
This commit tries to address/improve an issue with the server tests
which are failing with a timeout. Looking at the logs it seems like
they are timing out after 12 seconds:
```
FAILED unit/test_chat_completion.py::test_completion_with_json_schema[False-json_schema0-6-"42"] - TimeoutError: Server did not start within 12 seconds
```

This is somewhat strange as in utils.py we have the following values:
```python
DEFAULT_HTTP_TIMEOUT = 12

if "LLAMA_SANITIZE" in os.environ or "GITHUB_ACTION" in os.environ:
    DEFAULT_HTTP_TIMEOUT = 30

    def start(self, timeout_seconds: int | None = DEFAULT_HTTP_TIMEOUT) -> None:
```
It should be the case that a test running in a github action should have
a timeout of 30 seconds. However, it seems like this is not the case.
Inspecting the logs from the CI job we can see the following environment
variables:
```console
Run cd examples/server/tests
2 cd examples/server/tests
3 ./tests.sh
4 shell: /usr/bin/bash -e {0}
5 env:
6 LLAMA_LOG_COLORS: 1
7 LLAMA_LOG_PREFIX: 1
8 LLAMA_LOG_TIMESTAMPS: 1
9 LLAMA_LOG_VERBOSITY: 10
10 pythonLocation: /opt/hostedtoolcache/Python/3.11.11/x64
```

This probably does not address the underlying issue that the servers
that are providing the models to be downloaded occasionally take a
longer time to response but might improve these situations in some
cases.
2025-03-03 16:17:36 +01:00
dm4 c43af9276b tts: add speaker file support (#12048)
* tts: add speaker file support

Signed-off-by: dm4 <sunrisedm4@gmail.com>

* tts: handle outetts-0.3

* tts : add new line in error message

---------

Signed-off-by: dm4 <sunrisedm4@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-03 15:09:29 +02:00
Diego Devesa d5c63cd7f9 test-backend-ops : add option -p to filter by op params (#12155) 2025-03-03 14:00:46 +01:00
ag2s20150909 9660ffef58 ggml : fix kleidiai build (#12159)
The libggml API has changed, but this has not been updated.
2025-03-03 13:54:08 +01:00
Eric Curtin c950a1f692 Adding UTF-8 support to llama.cpp (#12111)
For emojis, non-alpha characters, etc.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-03-03 12:44:56 +00:00
Xuan-Son Nguyen 7b69003af7 webui : add ?m=... and ?q=... params (#12148)
* webui : add ?m=... and ?q=... params

* also clear prefilledMessage variable

* better approach

* fix comment

* test: bump timeout on GITHUB_ACTION
2025-03-03 11:42:45 +01:00
Akarshan Biswas ece9745bb8 SYCL: Move CPY kernels to a separate file and add few missing kernels (#12133)
* SYCL: refactor and move cpy kernels to a separate file

* Add few missing cpy kernels

* refactor and add debug logs
2025-03-03 11:07:22 +01:00
Diego Devesa cc473cac7c ggml-backend : keep paths in native string type when possible (#12144) 2025-03-02 22:11:00 +01:00
Sigbjørn Skjæret 14dec0c2f2 main: use jinja chat template system prompt by default (#12118)
* Use jinja chat template system prompt by default

* faster conditional order

* remove nested ternary

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-02 14:53:48 +01:00
Sigbjørn Skjæret 1782cdfed6 main: update outdated system prompt message (followup to #12131) (#12132)
* Update outdated message

* wording

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-01 15:22:27 +01:00
Sigbjørn Skjæret 45a8e76745 common : add --system-prompt parameter, replace behavior of -p in conversation mode (#12131)
* Add --system-prompt parameter

* use user defined system prompt

* clarify

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* add warning

* clarify

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-01 13:56:45 +01:00
Erik Scholz 80c41ddd8f CUDA: compress mode option and default to size (#12029)
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
2025-03-01 12:57:22 +01:00
Vivian 2cc4a5e44a webui : minor typo fixes (#12116)
* fix typos and improve menu text clarity

* rename variable trimedValue to trimmedValue

* add updated index.html.gz

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-03-01 11:15:09 +01:00
Xuan-Son Nguyen 06c2b1561d convert : fix Norway problem when parsing YAML (#12114)
* convert : fix Norway problem when parsing YAML

* Update gguf-py/gguf/metadata.py

* add newline at correct place
2025-02-28 17:44:46 +01:00
William Tambellini 70680c48e5 ggml : upgrade init_tensor API to return a ggml_status (#11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-02-28 14:41:47 +01:00
Xuan-Son Nguyen c43a3e7996 llama : add Phi-4-mini support (supersede #12099) (#12108)
* Added Phi-4-mini-instruct support

* Update regex per ngxson

* Change the vocab base to Xenova/gpt-4o

* fix conversion update script

* no need to check longrope

* minor style fix

* fix python style

---------

Co-authored-by: Nicholas Sparks <nisparks@microsoft.com>
2025-02-28 12:44:11 +01:00
Alex Brooks 84d5f4bc19 Update granite vision docs for 3.2 model (#12105)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-28 11:31:47 +00:00
Rémy O 438a83926a vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595)
* vulkan: implement specialized MMV kernels for IQ2 quantizations

* vulkan: add MMV kernels for IQ3 quants

* vulkan: Increase MMV batch size and unroll IQ LUT setup

* vulkan: fix init_iq_shmem for WG sizes larger than tables

* vulkan: common batch size for all I-quants
2025-02-28 09:42:52 +01:00
Johannes Gäßler 9c42b1718c CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098) 2025-02-28 09:26:43 +01:00
Prashant Vithule 05e6f5aad0 ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (#12064)
* Added SVE Support for Q2_K Quantized Models

* Use 4-space indentation in the switch cases

* removed comments lines

* Remove the loop Retain the curly bracess for better understanding of code

* Remove the comment like added for q3_k_q8_k kernel

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
2025-02-28 09:36:12 +02:00
hipudding 673cfef9aa CANN: Fix build error with GCC 13 (#11990)
Remove unused header file that causes compilation failure on ARM
platform with GCC 13.
2025-02-28 15:23:47 +08:00
Eve fbeda9002d vulkan: matmul dequantization improvements (#12015)
* faster dequant for old quants

* dont use unpack for iq4_nl

* vec2 unpack for q8
2025-02-28 08:20:08 +01:00
Daniele 581650b7ca vulkan: improve im2col (#11826)
* vulkan: improve im2col performance
2025-02-28 07:52:51 +01:00
Vladimir Vuksanovic b95c8af37c cmake: Fix ggml backend dependencies and installation (#11818)
* Fix dependencies between ggml and backends

ggml backends link only to ggml-base and ggml links to all backends.

* Fix installation of ggml backends

Set up GNUInstallDirs before setting the installation directory of ggml backends
2025-02-27 09:42:48 +02:00
Ting Lou a800ae46da llava : add struct for FFI bindgen (#12079)
* add struct for FFI bindgen

* Apply suggestions from code review

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-02-26 15:26:52 +01:00
Sigbjørn Skjæret 69050a11be Refactor gguf scripts to improve metadata handling (#11909)
* Refactor gguf scripts to improve metadata handling

Added contents method to ReaderField class
Added endianess property to GGUFReader class

* update scripts

* fix import

* remove unused import

* attempt to work around flake and pyright errors

* second attempt

* give up, ignore type

* bump version

* apply newbyteorder fixes
2025-02-26 08:04:48 -05:00
Aleksei Nikiforov 3567ee3a94 gguf-py: enable reading non-native endian files (#12081)
Currently self.byte_order is never used.
Actually use it to byteswap read data to
allow reading big endian files on little endian systems
and vice versa.

Now it's possible to convert little-endian model
into a big-endian model and back
on a little-endian system.
2025-02-26 11:39:27 +00:00
Kante Yin 53e4db1012 readme : update infra list (#9096)
Signed-off-by: kerthcet <kerthcet@gmail.com>
2025-02-26 09:49:36 +02:00
Olivier Chafik d7cfe1ffe0 docs: add docs/function-calling.md to lighten server/README.md's plight (#12069) 2025-02-25 18:52:56 +00:00
Jeff Bolz a82c9e7c23 vulkan: fix assertion when qy_needs_dequant (#12068)
Looks like a copy/paste bug from qx_needs_dequant.
2025-02-25 16:30:21 +01:00
rhjdvsgsgks 401af80b54 server: handle echo=false on /v1/completions (#12060) 2025-02-25 12:52:52 +01:00
Judd c132239bfb add OP sigmoid (#12056)
Co-authored-by: Judd <foldl@boxvest.com>
2025-02-25 12:32:20 +01:00
Molly Sophia 393fca629e ggml-cpu: Fix build with sve (#12059)
* ggml-cpu: Fix build with sve

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml-cpu: Remove unused variable in sve q3_k vec dot

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-02-25 19:28:22 +08:00
Rémy O 61d4f39dfe vulkan: implement more backpropagation operators (#11914)
* vulkan: implement GGML_OP_ROPE_BACK

* vulkan: implement GGML_OP_RMS_NORM_BACK

* vulkan: implement GGML_OP_SILU_BACK

* vulkan: implement GGML_OP_SOFTMAX_BACK
2025-02-25 12:04:45 +01:00
Olivier Chafik 0b52745649 server: support add_generation_prompt query param (#12062) 2025-02-25 10:40:22 +00:00
Alex Brooks 4d1051a40f Add Doc for Converting Granite Vision -> GGUF (#12006)
* Add example docs for granite vision

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-25 10:46:05 +01:00
Vitali Lovich 3e9a2860e9 llama : expose llama_model_n_head_kv in the API (#11997)
It's useful to be able to have this from the library layer as it's a key
parameter of the model (e.g. to figure out how much KV cache memory is
needed).
2025-02-25 11:29:33 +02:00
Gian-Carlo Pascutto 58d07a8043 metal : copy kernels for quant to F32/F16 conversions (#12017)
metal: use dequantize_q templates

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-25 11:27:58 +02:00
lhez 34a846b584 opencl: fix for small models (#11950)
* opencl: fix small shape gemv, remove unused extensions

* opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size

* opencl: fix for token length < 4

* opencl: use wave size of 64 for all Adreno GPUs

---------

Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com>
Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
2025-02-24 14:47:07 -07:00
Alex Brooks 7a2c913e66 llava : Add Granite Vision Support (#11794)
* Add super wip scripts for multimodal granite gguf

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Add example for converting mmgranite to gguf

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* remove hardcoded path

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Add vision feature layer to gguf params

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Clean up llava surgery and remove name substitution hacks

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Add transformers llava next tensor name mapping

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Make siglip / openclip mutuall exclusive

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix projector linear substitution

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix linear 2 substitution index

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Increase max flattened gridpoints to 64

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix hardcoded concat for multiple feature layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Pull vision feature layers out of gguf keys

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* fix num gridpoints and use all layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Avoid dropping last image encoder layer in llava models

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use 10 for max number of patches

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Standardize vision feature layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Cleanup logs

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Update comment for vision feature layer init

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Update notes for alternative to legacy llm conversion script

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix notes rendering

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Add v prefix to vision feature layer log

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use current defaults for feature layer

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use constant for max gridpoints / feat layers, style fixes

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* clarify non-negative feature layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Remove CLIP_API from func signature

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* USE MAX_IMAGE_FEATURE_LAYERS const in layer calc

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Clarify feature layers are non negative ints and not uint

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix condition for reading feature layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* pop last llava layer when feature layers are unset

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix unset vision layer 0

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Update examples/llava/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Reenable assertion for out of bounds get_rows

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use std vector for gridpoints and feature layers

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Caculate max feature layer at load time

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Include base patch for granite vision allocation

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Fix trailing whitespace

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Add max num patches = 10 back for minicpmv

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use unordered set to store feature layers

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Use max feature layer for postnorm

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Apply suggestions from code review

---------

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-02-24 17:09:51 +01:00
Neo Zhang Jianyu 08d5986290 [SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035)
* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2025-02-24 22:33:23 +08:00
Aleksei Nikiforov 651adf4b66 gguf_convert_endian.py: implement byteswapping for q4_k and q6_k (#11349) 2025-02-24 11:27:01 +00:00
Akarshan Biswas 8303e8b0fb SYCL: Fix GGML_SYCL_DEBUG macro (#11995) 2025-02-24 10:18:25 +00:00
Florent BENOIT 7ad0779f5d run: allow to customize prompt by env var LLAMA_PROMPT_PREFIX (#12041)
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
2025-02-23 17:15:51 +00:00
Eric Curtin f777a73e18 Some llama-run cleanups (#11973)
Use consolidated open function call from File class. Change
read_all to to_string(). Remove exclusive locking, the intent for
that lock is to avoid multiple processes writing to the same file,
it's not an issue for readers, although we may want to consider
adding a shared lock. Remove passing nullptr as reference,
references are never supposed to be null. clang-format the code
for consistent styling.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-23 13:14:32 +00:00
Aaron Teo af7747c95a ggml-cpu: Support s390x SIMD Instruction Set (#12019)
* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: remove test.py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml : fix LoongArch compile error with 128-bit SIMD (#11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: cmake remove fork when determining s390x machine type

thank you @ericcurtin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>
2025-02-22 21:39:24 +00:00
Johannes Gäßler a28e0d5eb1 CUDA: app option to compile without FlashAttention (#12025) 2025-02-22 20:44:34 +01:00
Ting Lou 36c258ee92 llava: build clip image from pixels (#11999)
* llava: export function `clip_build_img_from_pixels` to build image from pixels decoded by other libraries instead of stb_image.h for better performance

* Apply suggestions from code review

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-02-22 15:28:28 +01:00
Georgi Gerganov f3e64859ed ci : fix arm upload artifacts (#12024)
* ci : fix arm upload artifacts

* cont : fix archive name to use matrix
2025-02-22 15:03:00 +02:00
Johannes Gäßler 5fa07c2f93 CUDA: optimize FA for GQA + large batches (#12014) 2025-02-22 12:20:17 +01:00
Rohanjames1997 335eb04a91 ci : Build on Github-hosted arm64 runners (#12009) 2025-02-22 11:48:57 +01:00
Georgi Gerganov cf756d6e0a server : disable Nagle's algorithm (#12020) 2025-02-22 11:46:31 +01:00
Gian-Carlo Pascutto d70908421f cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000) 2025-02-22 09:43:24 +01:00
Daniel Bevenius de8b5a3624 llama.swiftui : add "Done" dismiss button to help view (#11998)
The commit updates the help view in the llama.swiftui example to use a
NavigationView and a Done button to dismiss the help view.

The motivation for this is that without this change there is now way to
dimiss the help view.
2025-02-22 06:33:29 +01:00
Georgi Gerganov 51f311e057 llama : skip loading unused tensors (#12004)
* llama : assign unknown/unused tensors to host buffer type

ggml-ci

* llama : skip unused tensors

ggml-ci
2025-02-21 18:33:18 +02:00
Johannes Gäßler 586d5fe6eb doc: update contributing guidelines [no ci] (#11969) 2025-02-21 12:51:25 +01:00
PureJourney ecc8e3aeff CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984)
* CUDA: correct the lowest Maxwell supported by CUDA 12

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-02-21 12:21:05 +01:00
Bodhi 0b3863ff95 MUSA: support ARM64 and enable dp4a .etc (#11843)
* MUSA:  support ARM64 and enable __dp4a .etc

* fix cross entropy loss op for musa

* update

* add cc info log for musa

* add comment for the MUSA .cc calculation block

---------

Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>
2025-02-21 09:46:23 +02:00
Alex Brooks ee02ad02c5 clip : fix visual encoders with no CLS (#11982)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-02-21 08:11:03 +02:00
momonga c392e5094d server (webui): Fix Premature Submission During IME Conversion (#11971)
* fix skip ime composing

* fix npm rebuild

* fix warn

---------

Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-20 19:43:22 +01:00
Charles Xu c5d91a7400 ggml-cpu: Add CPU backend support for KleidiAI library (#11390)
* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
2025-02-20 15:06:51 +02:00
Prashant Vithule 4806498bf1 ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-20 12:08:32 +02:00
Michael Engel 0d559580a0 run : add --chat-template-file (#11961)
Relates to: https://github.com/ggml-org/llama.cpp/issues/11178

Added --chat-template-file CLI option to llama-run. If specified, the file
will be read and the content passed for overwriting the chat template of
the model to common_chat_templates_from_model.

Signed-off-by: Michael Engel <mengel@redhat.com>
2025-02-20 10:35:11 +02:00
Johannes Gäßler d04e7163c8 doc: add links to ggml examples [no ci] (#11958) 2025-02-19 20:45:17 +01:00
Daniel Bevenius d07c621393 common : add llama.vim preset for Qwen2.5 Coder (#11945)
This commit adds a preset for llama.vim to use the default Qwen 2.5
Coder models.

The motivation for this change is to make it easier to start a server
suitable to be used with the llama.vim plugin. For example, the server
can be started with a command like the following:
```console
$ llama.vim --fim-qwen-1.5b-default
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10932
2025-02-19 12:29:52 +01:00
Georgi Gerganov abd4d0bc4f speculative : update default params (#11954)
* speculative : update default params

* speculative : do not discard the last drafted token
2025-02-19 13:29:42 +02:00
Daniel Bevenius 9626d9351a llama : fix indentation in llama-grammar [no ci] (#11943)
This commit adjusts the indentation for the functions `parse_sequence`
and `parse_rule` in src/llama-grammar.cpp.

The motivation is consistency and improve readability.
2025-02-19 06:16:23 +01:00
igardev b58934c183 server : (webui) Enable communication with parent html (if webui is in iframe) (#11940)
* Webui: Enable communication with parent html (if webui is in iframe):
- Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server
- On pressing na Escape button sends command "escapePressed" to the parent

Example handling from the parent html side:
- Send command "setText" from parent html to webui in iframe:
const iframe = document.getElementById('askAiIframe');
if (iframe) {
	iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*');
}

- Listen for Escape key from webui on parent html:
// Listen for escape key event in the iframe
window.addEventListener('keydown', (event) => {
	if (event.key === 'Escape') {
		// Process case when Escape is pressed inside webui
	}
});

* Move the extraContext from storage to app.context.

* Fix formatting.

* add Message.extra

* format + build

* MessageExtraContext

* build

* fix display

* rm console.log

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-18 23:01:44 +01:00
Olivier Chafik 63e489c025 tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900)
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type

* addressed clang-tidy lints in [test-]chat.*

* rm minja deps from util & common & move it to common/minja/

* add name & tool_call_id to common_chat_msg

* add common_chat_tool

* added json <-> tools, msgs conversions to chat.h

* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)

* fix deepseek r1 slow test (no longer <think> opening w/ new template)

* allow empty tools w/ auto + grammar

* fix & test server grammar & json_schema params w/ & w/o --jinja
2025-02-18 18:03:23 +00:00
Xuan-Son Nguyen 63ac128563 server : add TEI API format for /rerank endpoint (#11942)
* server : add TEI API format for /rerank endpoint

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix

* also gitignore examples/server/*.gz.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-18 14:21:41 +01:00
MoonRide303 5137da7b8c scripts: corrected encoding when getting chat template (#11866) (#11907)
Signed-off-by: MoonRide303 <moonride303@gmail.com>
2025-02-18 10:30:16 +01:00
xiaobing318 09aaf4f1f5 docs : Fix duplicated file extension in test command (#11935)
This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was:
./tests.sh unit/test_chat_completion.py.py -v -x
It has been corrected to:
./tests.sh unit/test_chat_completion.py -v -x
This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.
2025-02-18 10:12:49 +01:00
Johannes Gäßler 73e2ed3ce3 CUDA: use async data loading for FlashAttention (#11894)
* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-17 14:03:24 +01:00
Eve f7b1116af1 update release requirements (#11897) 2025-02-17 12:20:23 +01:00
Antoine Viallon c4d29baf32 server : fix divide-by-zero in metrics reporting (#11915) 2025-02-17 11:25:12 +01:00
Rémy O 2eea03d86a vulkan: implement several ops relevant for ggml_opt (#11769)
* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
2025-02-17 07:55:57 +01:00
Xuan-Son Nguyen 0f2bbe6564 server : bump httplib to 0.19.0 (#11908) 2025-02-16 17:11:22 +00:00
standby24x7 fe163d5bf3 common : Fix a typo in help (#11899)
This patch fixes a typo in command help.
prefx -> prefix

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2025-02-16 10:51:13 +01:00
Xuan-Son Nguyen 818a340ea8 ci : fix (again) arm64 build fails (#11895)
* docker : attempt fixing arm64 build on ci

* qemu v7.0.0-28
2025-02-16 10:36:39 +01:00
Jeff Bolz bf42a23d0a vulkan: support multi/vision rope, and noncontiguous rope (#11902) 2025-02-16 08:52:23 +01:00
Hale Chan c2ea16f260 metal : fix the crash caused by the lack of residency set support on Intel Macs. (#11904) 2025-02-16 08:50:26 +02:00
Johannes Gäßler 6dde178248 scripts: fix compare-llama-bench commit hash logic (#11891) 2025-02-15 20:23:22 +01:00
708-145 fc10c38ded examples: fix typo in imatrix/README.md (#11884)
* simple typo fixed

* Update examples/imatrix/README.md

---------

Co-authored-by: Tobias Bergmann <tobias.bergmann@gmx.de>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-15 21:03:30 +02:00
Adrian Kretz 22885105a6 metal : optimize dequant q6_K kernel (#11892) 2025-02-15 20:39:20 +02:00
Georgi Gerganov c2cd24fbfd readme : add notice about new package registry (#11890)
* readme : add notice about new package registry

* cont : fix whitespace
2025-02-15 20:29:56 +02:00
Georgi Gerganov 68ff663a04 repo : update links to new url (#11886)
* repo : update links to new url

ggml-ci

* cont : more urls

ggml-ci
2025-02-15 16:40:57 +02:00
Olivier Chafik f355229692 server: fix type promotion typo causing crashes w/ --jinja w/o tools (#11880) 2025-02-15 10:11:36 +00:00
Rémy O fc1b0d0936 vulkan: initial support for IQ1_S and IQ1_M quantizations (#11528)
* vulkan: initial support for IQ1_S and IQ1_M quantizations

* vulkan: define MMV kernels for IQ1 quantizations

* devops: increase timeout of Vulkan tests again

* vulkan: simplify ifdef for init_iq_shmem
2025-02-15 09:01:40 +01:00
Michał Moskal 89daa2564f llguidance build fixes for Windows (#11664)
* setup windows linking for llguidance; thanks @phil-scott-78

* add build instructions for windows and update script link

* change VS Community link from DE to EN

* whitespace fix
2025-02-14 12:46:08 -08:00
lhez 300907b211 opencl: Fix rope and softmax (#11833)
* opencl: fix `ROPE`

* opencl: fix `SOFT_MAX`

* Add fp16 variant

* opencl: enforce subgroup size for `soft_max`
2025-02-14 12:12:23 -07:00
Diego Devesa 94b87f87b5 cuda : add ampere to the list of default architectures (#11870) 2025-02-14 15:33:52 +01:00
Georgi Gerganov dbc2ec59b5 docker : drop to CUDA 12.4 (#11869)
* docker : drop to CUDA 12.4

* docker : update readme [no ci]
2025-02-14 14:48:40 +02:00
Daniel Bevenius 3d68f034da llama : add completion for --chat-template-file (#11860)
This commit adds completion for `--chat-template-file`, enabling only
`.jinja` files to be displayed as completions.

Example usage:
```console
$ ./build/bin/llama-cli --chat-template-file models/templates/<TAB>
models/templates/CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja
models/templates/CohereForAI-c4ai-command-r-plus-tool_use.jinja
models/templates/deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja
models/templates/deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja
models/templates/fireworks-ai-llama-3-firefunction-v2.jinja
models/templates/google-gemma-2-2b-it.jinja
models/templates/llama-cpp-deepseek-r1.jinja
models/templates/meetkai-functionary-medium-v3.1.jinja
models/templates/meetkai-functionary-medium-v3.2.jinja
models/templates/meta-llama-Llama-3.1-8B-Instruct.jinja
models/templates/meta-llama-Llama-3.2-3B-Instruct.jinja
models/templates/meta-llama-Llama-3.3-70B-Instruct.jinja
models/templates/microsoft-Phi-3.5-mini-instruct.jinja
models/templates/mistralai-Mistral-Nemo-Instruct-2407.jinja
models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja
models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja
models/templates/Qwen-Qwen2.5-7B-Instruct.jinja
```
This is not limited to the models/templates directory, it can be used
anywhere in the filesystem, the above is just an example.
2025-02-14 11:16:56 +01:00
Jinyang He 38e32eb6a0 ggml: optimize some vec dot functions for LoongArch ASX (#11842)
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX

* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX

* Optimize mul_sum_i8_pairs_float for LoongArch ASX

* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
2025-02-14 10:54:27 +02:00
Eve a4f011e8d0 vulkan: linux builds + small subgroup size fixes (#11767)
* mm subgroup size

* upload vulkan x86 builds
2025-02-14 02:59:40 +00:00
theraininsky a7b8ce2260 llama-bench : fix unexpected global variable initialize sequence issue (#11832)
* llama-bench : fix unexpected global variable initialize sequence issue

* Update examples/llama-bench/llama-bench.cpp

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-14 02:13:43 +01:00
Georgi Gerganov 04045bb842 readme : minor 2025-02-14 00:16:56 +02:00
Jeffrey Morgan 8a8c4ceb60 llamafile: use member variable instead of constant for iq4nlt (#11780) 2025-02-13 18:05:04 +01:00
Reza Rahemtola c1f958c038 server : (docs) Update wrong tool calling example (#11809)
Call updated to match the tool used in the output just below, following the example in https://github.com/ggerganov/llama.cpp/pull/9639
2025-02-13 17:22:44 +01:00
Daniel Bevenius c48f630d1c llama : add --completion-bash option (#11846)
This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
2025-02-13 14:46:59 +01:00
R0CKSTAR bd6e55bfd3 musa: bump MUSA SDK version to rc3.1.1 (#11822)
* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: Remove workaround in PR #10042

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-02-13 13:28:18 +01:00
Olivier Chafik c7f460ab88 server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none (#11607)
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 10:05:16 +00:00
Vinesh Janarthanan 27e8a23300 sampling: add Top-nσ sampler (#11223)
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov e4376270d9 llama.cpp: fix warning message (#11839)
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.

Before the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```

After the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
2025-02-13 08:25:34 +02:00
Daniel Bevenius 3e69319772 llama : update llama_decode_internal ref [no ci] (#11840)
This commit updates the comment in llama_kv_cache.h to reflect the
change of the function name from llama_decode_internal to
llama_decode_impl.
2025-02-13 08:07:51 +02:00
Diego Devesa a394039db0 ggml-cpu : add chunking support to mul_mat_id (#11666)
* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes
2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen be3bbd6215 ggml : x2 speed for WASM by optimizing SIMD (#11453)
* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

---------

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
2025-02-13 00:33:45 +01:00
Woof Dog 31afcbee0e server : (webui) Give copy button back to all message bubbles (#11814)
* All messages get the copy button

* Update index.html.gz
2025-02-12 23:47:11 +01:00
uvos 5c4284d57b HIP: Remove GCN from list of devices that avoid MMQ (#11831) 2025-02-12 22:25:28 +01:00
JC bfd11a2344 Fix: Compile failure due to Microsoft STL breaking change (#11836) 2025-02-12 21:36:11 +01:00
Georgi Gerganov 0fb77f821f sync : ggml 2025-02-12 21:46:02 +02:00
uvos e598697d63 HIP: Switch to std::vector in rocblas version check (#11820) 2025-02-12 17:25:03 +01:00
bandoti fef0cbeadf cleanup: fix compile warnings associated with gnu_printf (#11811) 2025-02-12 10:06:53 -04:00
Richard 748ee9fe93 ggml : fix multi-threaded clamp_f32 (#11824)
* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32
2025-02-12 15:57:33 +02:00
Weizhao Ouyang 198b1ec611 ggml-cpu: Fix duplicate MATMUL_INT8 (#11817)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-02-12 13:22:58 +01:00
Johannes Gäßler c3d6af7cd2 CUDA: fix CUDART_VERSION checks (#11821) 2025-02-12 13:16:39 +01:00
Daniel Bevenius 369be5598a llama : fix typo in llama-grammar.h [no ci] (#11816) 2025-02-12 09:40:01 +02:00
lhez 4078c77f98 docs: add OpenCL (#11697) 2025-02-11 15:04:13 -07:00
Sheldon Robinson 90e4dba461 Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (#11803)
* Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Daniel Bevenius a18f481f99 server : use common_token_to_piece instead of common_detokenize (#11740)
* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: https://github.com/ggerganov/llama.cpp/issues/11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Johannes Gäßler b9ab0a4d0b CUDA: use arch list for compatibility check (#11775)
* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush 7b891bdc86 fix: typos in documentation files (#11791)
* Update ggml.c

* Update arg.cpp

* Update speculative.h
2025-02-10 23:21:31 +01:00
jason_w 81732619fd docs: utilize the forward slash (/) as the path separator for Unix-like systems (#11770) 2025-02-10 23:17:48 +01:00
Xuan-Son Nguyen 507f9174fe server : (webui) introduce conversation branching + idb storage (#11792)
* server : (webui) introduce conversation branching + idb storage

* mark old conv as "migrated" instead deleting them

* improve migration

* add more comments

* more clarification
2025-02-10 21:23:17 +01:00
Wilken Gottwalt 19b392d58d llama-mmap: fix missing include (#11796)
Technically the fixed width types come only from iostream and
cstdint/stdint.h headers. memory and vector headers should not provide
these. In GCC 15 the headers are cleaned up and you require the proper
header cstdint.

src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type
   26 |     uint32_t read_u32() const;
      |     ^~~~~~~~
2025-02-10 20:58:18 +02:00
Xuan-Son Nguyen 0893e0114e server : correct signal handler (#11795) 2025-02-10 18:03:28 +01:00
Olivier Chafik d7b31a9d84 sync: minja (https://github.com/google/minja/commit/a72057e5190de2c612d4598bb10b4bfd0f53011f) (#11774) 2025-02-10 09:34:09 +00:00
pascal-lc 9ac3457b39 Update README.md [no ci] (#11781)
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Danny Milosavljevic c2a67efe38 vulkan: Make Vulkan optional at runtime (#11493). (#11494)
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna b044a0fe3c vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (#11592) 2025-02-10 07:08:22 +01:00
Eric Curtin 19d3c8293b There's a better way of clearing lines (#11756)
Use the ANSI escape code for clearing a line.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Jeff Bolz 98f6b0fd1e vulkan: account for lookup tables when checking shared memory size (#11502) 2025-02-09 08:43:51 +01:00
Xuan-Son Nguyen 55ac8c7791 server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759)
* redo Settings modal UI

* add python code interpreter

* fix auto scroll

* build

* fix overflow for long output lines

* bring back sticky copy button

* adapt layout on mobile view

* fix multiple lines output and color scheme

* handle python exception

* better state management

* add webworker

* add headers

* format code

* speed up by loading pyodide on page load

* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog e6e6583199 server : (webui) increase edit textarea size (#11763) 2025-02-08 20:09:55 +01:00
Georgi Gerganov aaa5505307 server : minor log updates (#11760)
ggml-ci
2025-02-08 18:08:43 +02:00
Georgi Gerganov bdcf8b6a56 cont : fix mmap flag print (#11699) 2025-02-08 16:49:38 +02:00
Karol Kontny 4d3465c5ae ggml: Fix data race in ggml threadpool (#11736)
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-08 15:30:53 +01:00
Johannes Gäßler d80be897ac CUDA: fix min. version for movmatrix (#11751) 2025-02-08 10:46:07 +01:00
Nikolaos Pothitos 3ab410f55f readme : update front-end framework (#11753)
After the migration to React with #11688
2025-02-08 10:43:04 +01:00
Xuan-Son Nguyen 0cf867160c server : (webui) fix numeric settings being saved as string (#11739)
* server : (webui) fix numeric settings being saved as string

* add some more comments
2025-02-08 10:42:34 +01:00
Eric Curtin d2fe216fb2 Make logging more verbose (#11714)
Debugged an issue with a user who was on a read-only filesystem.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-07 14:42:46 +00:00
Georgi Gerganov ed926d8833 llama : fix defrag logic (#11707)
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
2025-02-07 16:05:34 +02:00
Christian Fillion 2d219b389e vocab : ignore invalid UTF-8 input in the BPE tokenizer (#11729)
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.

This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)

Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
2025-02-07 15:55:47 +02:00
magicse 333820d749 llama : fix progress dots (#11730)
* Update llama.cpp

For display progress dots in terminal.
Without this it didn't display dots progress during loading model from file.

* Update llama.cpp

removed trailing spaces
2025-02-07 15:48:47 +02:00
Jeff Bolz c026ba3c23 vulkan: print shared memory size (#11719) 2025-02-07 11:26:03 +01:00
Christian Fillion 7ee953a64a llama : add llama_sampler_init for safe usage of llama_sampler_free (#11727)
The C API in llama.h claims users can implement `llama_sampler_i` to
create custom `llama_sampler`. The sampler chain takes ownership and
calls `llama_sampler_free` on them. However, `llama_sampler_free` is
hard-coded to use `delete`. This is undefined behavior if the object
wasn't also allocated via `new` from libllama's C++ runtime. Callers
in C and C-compatible languages do not use C++'s `new` operator. C++
callers may not be sharing the same heap as libllama.
2025-02-07 11:33:27 +02:00
Akarshan Biswas ec3bc8270b SYCL: remove XMX info from print devices (#11712) 2025-02-07 09:27:53 +00:00
Daniel Bevenius b7552cfcbc common : add default embeddings presets (#11677)
* common : add default embeddings presets

This commit adds default embeddings presets for the following models:
- bge-small-en-v1.5
- e5-small-v2
- gte-small

These can be used with llama-embedding and llama-server.

For example, with llama-embedding:
```console
./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?"
```

And with llama-server:
```console
./build/bin/llama-server --embd-gte-small-default
```
And the embeddings endpoint can then be called with a POST request:
```console
curl --request POST \
    --url http://localhost:8080/embeddings \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello, how are you?"}'
```

I'm not sure if these are the most common embedding models but hopefully
this can be a good starting point for discussion and further
improvements.

Refs: https://github.com/ggerganov/llama.cpp/issues/10932
2025-02-07 09:15:22 +01:00
Jinyang He 225bbbfa39 ggml : optimize and build warning fix for LoongArch (#11709)
* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-07 09:38:31 +02:00
tv1wnd 855cd0734a llama : fix old glm4 models (#11670) 2025-02-06 22:48:51 +01:00
Georgi Gerganov 8a59053f63 sync : ggml 2025-02-06 21:23:03 +02:00
Patrick Peng 1d20e53c40 rpc: fix known RCE in rpc-server (ggml/1103)
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if  `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-06 21:22:54 +02:00
Xuan-Son Nguyen 2fb3c32a16 server : (webui) migrate project to ReactJS with typescript (#11688)
* init version

* fix auto scroll

* bring back copy btn

* bring back thought process

* add lint and format check on CI

* remove lang from html tag

* allow multiple generations at the same time

* lint and format combined

* fix unused var

* improve MarkdownDisplay

* fix more latex

* fix code block cannot be selected while generating
2025-02-06 17:32:29 +01:00
Tei Home 9ab42dc722 docs: update fedora cuda guide for 12.8 release (#11393)
* docs: update fedora cuda guide for 12.8 release

* docs: build cuda update
2025-02-06 12:16:15 +00:00
Akarshan Biswas 194b2e69f8 SYCL: Adjust support condition for norm operators (#11674)
SYCL does not support non contiguous tensors for norm operations
2025-02-06 11:42:35 +00:00
Georgi Gerganov 9dd7a0390f llama : add log about loading model tensors (#11699) 2025-02-06 13:41:37 +02:00
Adrien Gallouët c0d4843225 build : fix llama.pc (#11658)
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2025-02-06 13:08:13 +02:00
junchao-zhao 8d4d2be143 ggml : fix LoongArch compile error with 128-bit SIMD (#11701) 2025-02-06 11:20:00 +02:00
Jeff Bolz 2c6c8df56d vulkan: optimize coopmat2 iq2/iq3 callbacks (#11521)
* vulkan: optimize coopmat2 iq2/iq3 callbacks

* build: trigger CI on GLSL compute shader changes
2025-02-06 07:15:30 +01:00
Rémy O 8a7e3bf17a vulkan: initial support for IQ4_XS quantization (#11501) 2025-02-06 07:09:59 +01:00
Jeff Bolz 1b598b3058 vulkan: use smaller combined allocations to avoid fragmentation (#11551) 2025-02-06 07:02:18 +01:00
Charles Duffy 902368a06b metal : avoid breaking build when metal API predates TARGET_OS_VISION (#11690)
Avoids breakage in nix flake build introduced by b0569130c5
2025-02-06 09:52:31 +08:00
Matvey Soloviev c3db0480bb readme : add link to Autopen under UIs (#11684)
Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.
2025-02-06 01:55:25 +01:00
Georgi Gerganov d774ab3acc metal : adjust support conditions for norm operators (#11671)
cont #11659

ggml-ci
2025-02-05 10:57:42 +02:00
Johannes Gäßler fa62da9b2d CUDA: support for mat. mul. with ne03 != ne13 (#11656) 2025-02-05 08:58:31 +01:00
SAMI 1ec208083c llava: add quantization for the visual projector LLAVA, Qwen2VL (#11644)
* Added quantization for visual projector
* Added README
* Fixed the clip quantize implementation in the file

* Fixed the gcc warning regarding minor linting

* Removed trailing whitespace
2025-02-05 10:45:40 +03:00
Olivier Chafik 9f4cc8f8d3 sync: minja (#11641)
* `sync`: minja

https://github.com/google/minja/commit/182de30cdaee78ba86179122f8047b3bdbab7f7f

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
Johannes Gäßler fd08255d0d CUDA: non-contiguous (RMS) norm support (#11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-04 22:21:42 +01:00
fxzjshm 3ec9fd4b77 HIP: force max threads per block to be 1024 (#11621)
Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm.

Signed-off-by: fxzjshm <fxzjshm@163.com>
2025-02-04 19:18:38 +01:00
Xuan-Son Nguyen 3962fc1a79 server : add try..catch to places not covered by set_exception_handler (#11620)
* server : add try..catch to places not covered by set_exception_handler

* log_server_request: rm try catch, add reminder
2025-02-04 18:25:42 +01:00
Radoslav Gerganov 1bef571f6a arg : list RPC devices first when using --list-devices (#11655)
List devices in the same order as they appear when evaluating the model
and splitting tensors across devices, i.e. RPC devices come first in the
list.

ref #11435
2025-02-04 18:16:20 +02:00
Olivier Chafik db288b60cb tool-call: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
2025-02-04 15:48:53 +00:00
Shelby Jenkins 106045e7bb readme : add llm_client Rust crate to readme bindings (#11628)
[This crate](https://github.com/ShelbyJenkins/llm_client) has been in a usable state for quite awhile, so I figured now is fair to add it.

It installs from crates.io, and automatically downloads the llama.cpp repo and builds it for the target platform - with the goal being the easiest user experience possible.

It also integrates model presets and choosing the largest quant given the target's available VRAM. So a user just has to specify one of the presets (I manually add the most popular models), and it will download from hugging face.

So, it's like a Rust Ollama, but it's not really for chatting. It makes heavy use of llama.cpp's grammar system to do structured output for decision making and control flow tasks.
2025-02-04 13:20:55 +02:00
Jhen-Jie Hong f117d84b48 swift : fix llama-vocab api usage (#11645)
* swiftui : fix vocab api usage

* batched.swift : fix vocab api usage
2025-02-04 13:15:24 +02:00
Jhen-Jie Hong 534c46b53c metal : use residency set for other platforms (#11648) 2025-02-04 13:07:18 +02:00
Georgi Gerganov 387a1598ca authors : update 2025-02-04 13:04:10 +02:00
Georgi Gerganov 7c9e0ca520 sync : ggml 2025-02-04 12:59:21 +02:00
Christian Kastner 8f8290ada9 cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096)
This makes git as a dependency optional, and is useful in the case where
ggml is built not from git, but from a tarball, or a distribution source
package.

This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be
using it, though, so there doesn't seem much value factor it out, or
even require it.
2025-02-04 12:59:15 +02:00
Georgi Gerganov b34aedd558 ci : do not stale-close roadmap issues 2025-02-04 09:31:01 +02:00
Olivier Chafik cde3833239 tool-call: allow --chat-template chatml w/ --jinja, default to chatml upon parsing issue, avoid double bos (#11616)
* tool-call: allow `--jinja --chat-template chatml`

* fix double bos issue (drop bos/eos tokens from jinja template)

* add missing try catch around jinja parsing to default to chatml

* Simplify default chatml logic
2025-02-03 23:49:27 +00:00
Xuan-Son Nguyen b3451785ac server : (webui) revert hacky solution from #11626 (#11634) 2025-02-04 00:10:52 +01:00
Woof Dog 1d1e6a90bc server : (webui) allow typing and submitting during llm response (#11626) 2025-02-03 23:16:27 +01:00
Daniel Bevenius 5598f475be server : remove CPPHTTPLIB_NO_EXCEPTIONS define (#11622)
This commit removes the CPPHTTPLIB_NO_EXCEPTIONS define from the server
code.

The motivation for this is that when using a debug build the server
would crash when an exception was throws and terminate the server
process, as it was unhandled. When CPPHTTPLIB_NO_EXCEPTIONS is set
cpp_httplib will not call the exception handler, which would normally
return a 500 error to the client. This caused tests to fail when using
a debug build.

Fixes: https://github.com/ggerganov/llama.cpp/issues/11613
2025-02-03 16:45:38 +01:00
Georgi Gerganov 8ec05832fa sync : ggml 2025-02-03 14:57:08 +02:00
Johannes Gäßler 21c84b5d2d CUDA: fix Volta FlashAttention logic (#11615) 2025-02-03 14:25:56 +02:00
mashdragon d92cb67e37 server : (webui) Fix Shift+Enter handling (#11609)
* Fix Shift+Enter handling

`exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway

* build index.html.gz

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-02-03 10:42:55 +01:00
Johannes Gäßler 6eecde3cc8 HIP: fix flash_attn_stream_k_fixup warning (#11604) 2025-02-02 23:48:29 +01:00
uvos 396856b400 CUDA/HIP: add support for selectable warp size to mmv (#11519)
CUDA/HIP: add support for selectable warp size to mmv
2025-02-02 22:40:09 +01:00
uvos 4d0598e144 HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601)
This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly
2025-02-02 22:08:05 +01:00
Olivier Chafik 90f9b88afb nit: more informative crash when grammar sampler fails (#11593) 2025-02-02 19:58:34 +00:00
Johannes Gäßler 864a0b67a6 CUDA: use mma PTX instructions for FlashAttention (#11583)
* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-02 19:31:09 +01:00
Eric Curtin 84ec8a58f7 Name colors (#11573)
It's more descriptive, use #define's so we can use compile-time
concatenations.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-02 15:14:48 +00:00
Olivier Chafik bfcce4d693 tool-call: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
2025-02-02 09:25:38 +00:00
Olivier Chafik 69804487e0 Fix exotic ci env that lacks ostringstream::str (#11581) 2025-02-02 09:10:15 +00:00
Michał Moskal ff227703d6 sampling : support for llguidance grammars (#10224)
* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12
2025-02-02 09:55:32 +02:00
piDack 0cec062a63 llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
Olivier Chafik 53debe6f3c ci: use sccache on windows HIP jobs (#11553) 2025-02-01 18:22:38 +00:00
Olivier Chafik cfd74c86db sync: minja (https://github.com/google/minja/commit/418a2364b56dc9be4ed9a1a2b0fb16fb53a7a22e) (#11574) 2025-02-01 12:24:51 +00:00
Eric Curtin ecef206ccb Implement s3:// protocol (#11511)
For those that want to pull from s3

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-01 10:30:54 +00:00
Olivier Chafik 5bbc7362cb ci: simplify cmake build commands (#11548) 2025-02-01 00:01:20 +00:00
Olivier Chafik aa6fb13213 ci: use sccache on windows instead of ccache (#11545)
* Use sccache on ci for windows

* Detect sccache in cmake
2025-01-31 17:12:40 +00:00
Olivier Chafik a83f528688 tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme (#11539)
* An empty tool_call_id is better than none!

* sync: minja (tool call name optional https://github.com/google/minja/pull/36)

* Force-disable parallel_tool_calls if template doesn't support it

* More debug logs

* Llama 3.x tools: accept / trigger on more varied spaced outputs

* Fix empty content for functionary v3.2 tool call

* Add proper tool call docs to server README

* readme: function calling *is* supported now

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-31 14:15:25 +00:00
Olivier Chafik b1bcd309fc fix stop regression (#11543) 2025-01-31 13:48:31 +00:00
Olivier Chafik 5783575c9d Fix chatml fallback for unsupported builtin templates (when --jinja not enabled) (#11533) 2025-01-31 08:24:29 +00:00
Olivier Chafik 4a2b196d03 server : fix --jinja when there's no tools or schema (typo was forcing JSON) (#11531) 2025-01-31 10:12:40 +02:00
Steve Grubb 1bd3047a93 common: Add missing va_end (#11529)
The va_copy man page states that va_end must be called to revert
whatever the copy did. For some implementaions, not calling va_end
has no consequences. For others it could leak memory.
2025-01-31 07:58:55 +02:00
Daniel Bevenius a2df2787b3 server : update help metrics processing/deferred (#11512)
This commit updates the help text for the metrics `requests_processing`
and `requests_deferred` to be more grammatically correct.

Currently the returned metrics look like this:
```console
\# HELP llamacpp:requests_processing Number of request processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of request deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```

With this commit, the metrics will look like this:
```console
\# HELP llamacpp:requests_processing Number of requests processing.
\# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
\# HELP llamacpp:requests_deferred Number of requests deferred.
\# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
```
This is also consistent with the description of the metrics in the
server examples [README.md](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#get-metrics-prometheus-compatible-metrics-exporter).
2025-01-31 06:04:53 +01:00
Olivier Chafik 553f1e46e9 ci: ccache for all github worfklows (#11516) 2025-01-30 22:01:06 +00:00
Olivier Chafik 8b576b6c55 Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639)
---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-30 19:13:58 +00:00
uvos 27d135c970 HIP: require at least HIP 5.5 2025-01-30 16:25:44 +01:00
uvos 6af1ca48cb HIP: Prepare reduction operators for wave 64 2025-01-30 16:25:44 +01:00
uvos c300e68ef4 CUDA/HIP: add warp_size to cuda_device_info 2025-01-30 16:25:44 +01:00
Olivier Chafik 3d804dec76 sync: minja (#11499) 2025-01-30 10:30:27 +00:00
mgroeber9110 ffd0821c57 vocab : correctly identify LF token for GPT-2 style BPE tokenizer (#11496) 2025-01-30 12:10:59 +02:00
Daniel Bevenius 4314e56c4f server : use lambda instead of std::bind (#11507)
This commit replaces the two usages of `std::bind` in favor of lambdas for
the callback functions for `callback_new_task` and
`callback_update_slots`.

The motivation for this changes is consistency with the rest of the code
in server.cpp (lambdas are used for all other callbacks/handlers). Also
lambdas are more readable (perhaps this is subjective) but also they are
recommended over `std::bind` in modern C++.

Ref: https://github.com/LithoCoders/dailycpp/blob/master/EffectiveModernC%2B%2B/chapter6/Item34_Prefer_lambdas_to_std::bind.md
2025-01-30 11:05:00 +01:00
Isaac McFadyen 496e5bf46b server : (docs) added response format for /apply-template [no ci] (#11503) 2025-01-30 10:11:53 +01:00
Guspan Tanadi 7919256c57 readme : reference examples relative links (#11505) 2025-01-30 06:58:02 +01:00
Daniel Bevenius e0449763a4 server : update json snippets in README.md [no ci] (#11492)
This commit updates some of JSON snippets in README.md file and
removes the `json` language tag from the code blocks.

The motivation for this changes is that if there is invalid json in a
code snippet these are highlighted in red which can make it somewhat
difficult to read and can be a little distracting.
2025-01-30 05:48:14 +01:00
Nigel Bosch eb7cf15a80 server : add /apply-template endpoint for additional use cases of Minja functionality (#11489)
* add /apply-template endpoint to server

* remove unnecessary line

* add /apply-template documentation

* return only "prompt" field in /apply-template

* use suggested idea instead of my overly verbose way
2025-01-29 19:45:44 +01:00
Rémy Oudompheng 66ee4f297c vulkan: implement initial support for IQ2 and IQ3 quantizations (#11360)
* vulkan: initial support for IQ3_S

* vulkan: initial support for IQ3_XXS

* vulkan: initial support for IQ2_XXS

* vulkan: initial support for IQ2_XS

* vulkan: optimize Q3_K by removing branches

* vulkan: implement dequantize variants for coopmat2

* vulkan: initial support for IQ2_S

* vulkan: vertically realign code

* port failing dequant callbacks from mul_mm

* Fix array length mismatches

* vulkan: avoid using workgroup size before it is referenced

* tests: increase timeout for Vulkan llvmpipe backend

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-01-29 18:29:39 +01:00
Daniel Bevenius e51c47b401 server : update auto gen files comments [no ci] (#11484)
* server : update auto gen files comments

This commit updates the 'auto generated files' comments in server.cpp
and removes `deps.sh` from the comment.

The motivation for this change is that `deps.sh` was removed in
Commit 91c36c269b ("server : (web ui)
Various improvements, now use vite as bundler (#10599)").

* squash! server : update auto gen files comments [no ci]

Move comments about file generation to README.md.

* squash! server : update auto gen files comments [no ci]

Remove the comments in server.cpp that mention that information
can be found in the README.md file.
2025-01-29 16:34:18 +01:00
Jeff Bolz 2711d0215f vulkan: Catch pipeline creation failure and print an error message (#11436)
* vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

* vulkan: fix pipeline creation logging
2025-01-29 09:26:50 -06:00
Eric Curtin f0d4b29edf Parse https://ollama.com/library/ syntax (#11480)
People search for ollama models using the web ui, this change
allows one to copy the url from the browser and for it to be
compatible with llama-run.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-29 11:23:10 +00:00
Georgi Gerganov 815857791d sync : ggml 2025-01-29 11:25:29 +02:00
William Tambellini 1a0e87d291 ggml : add option to not print stack on abort (ggml/1081)
* Add option to not print stack on abort

Add option/envvar to disable stack printing on abort.
Also link some unittests with Threads to fix link errors on
ubuntu/g++11.

* Update ggml/src/ggml.c

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-29 11:24:53 +02:00
issixx d2e518e9b4 ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
some threads kept looping and failed to terminate properly after an abort during CPU execution.

Co-authored-by: issi <issi@gmail.com>
2025-01-29 11:24:51 +02:00
Daniel Bevenius b636228c0a embedding : enable --no-warmup option (#11475)
This commit enables the `--no-warmup` option for the llama-embeddings.

The motivation for this change is to allow the user to disable the
warmup when running the the program.
2025-01-29 10:38:54 +02:00
Molly Sophia 325afb370a llama: fix missing k_cache store for rwkv6qwen2 (#11445)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-01-29 12:07:21 +08:00
Emreerdog 794fe23f29 cmake: add hints for locating ggml on Windows using Llama find-package (#11466) 2025-01-28 19:22:06 -04:00
peidaqi cf8cc856d7 server : Fixed wrong function name in llamacpp server unit test (#11473)
The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True
2025-01-29 00:03:42 +01:00
Xuan-Son Nguyen d0c08040b6 ci : fix build CPU arm64 (#11472)
* ci : fix build CPU arm64

* failed, trying ubuntu 22

* vulkan: ubuntu 24

* vulkan : jammy --> noble
2025-01-29 00:02:56 +01:00
uvos be5ef7963f HIP: Supress transformation warning in softmax.cu
loops with bounds not known at compile time can not be unrolled.
when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.
2025-01-28 23:06:32 +01:00
Nikita Sarychev cae9fb4361 HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080)
This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.
2025-01-28 16:42:20 +01:00
Eric Curtin 7fee2889e6 Add github protocol pulling and http:// (#11465)
As pulling protocols to llama-run

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-28 14:45:41 +00:00
Nuno d7d1eccacc docker: allow installing pip packages system-wide (#11437)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2025-01-28 14:17:25 +00:00
someone13574 4bf3119d61 cmake : don't fail on GGML_CPU=OFF (#11457) 2025-01-28 15:15:34 +01:00
Nuno f643120bad docker: add perplexity and bench commands to full image (#11438)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2025-01-28 10:42:32 +00:00
Akarshan Biswas 6e84b0ab8e SYCL : SOFTMAX F16 mask support and other fixes (#11261)
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).

* SYCL: SOFTMAX F16 mask support and other fixes

* test-backend-ops: Add F16 mask test cases
2025-01-28 09:56:58 +00:00
Michael Engel 2b8525d5c8 Handle missing model in CLI parameters for llama-run (#11399)
The HTTP client in llama-run only prints an error in case the download of
a resource failed. If the model name in the CLI parameter list is missing,
this causes the application to crash.
In order to prevent this, a check for the required model parameter has been
added and errors for resource downloads get propagated to the caller.

Signed-off-by: Michael Engel <mengel@redhat.com>
2025-01-28 08:32:40 +00:00
Eric Curtin a4417ddda9 Add new hf protocol for ollama (#11449)
https://huggingface.co/docs/hub/en/ollama

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-27 19:36:10 +01:00
Haus1 d6d24cd9ed AMD: parse the architecture as supplied by gcnArchName (#11244)
The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.
2025-01-27 14:58:17 +01:00
lexasub a5203b4465 llama : minor fixes for up llama load model speed (#11448)
* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30%

* llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings

* Update src/llama-vocab.cpp

---------

Co-authored-by: lexasub <empty@empty.ru>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-27 14:42:09 +01:00
Johannes Gäßler df984e0147 llama: refactor llama_decode_impl (#11381) 2025-01-27 12:07:12 +01:00
Ihar Hrachyshka acd38efee3 metal: Handle null returned from MTLCreateSystemDefaultDevice() (#11441)
This fixes segmentation fault error when running tests when no metal
devices are available (for example, when not linked with Core Graphics
framework or otherwise).
2025-01-27 09:41:59 +02:00
Xuan Son Nguyen caf773f249 docker : fix ARM build and Vulkan build (#11434)
* ci : do not fail-fast for docker

* build arm64/amd64 separatedly

* fix pip

* no fast fail

* vulkan: try jammy
2025-01-26 22:45:32 +01:00
Georgi Gerganov 178a7eb952 metal : use residency sets (#11427)
* metal : use residency sets

ggml-ci

* metal : restore commandBufferWithUnretainedReferences calls [no ci]

* metal : release descriptors

ggml-ci

* metal : check env GGML_METAL_NO_RESIDENCY

ggml-ci

* metal : fix build + clean-up

ggml-ci
2025-01-26 20:06:16 +02:00
Nuno 6f53d8a6b4 docker: add missing vulkan library to base layer and update to 24.04 (#11422)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2025-01-26 18:22:43 +01:00
bandoti 19f65187cb cmake: add ggml find package (#11369)
* Add initial ggml cmake package

* Add build numbers to ggml find-package

* Expand variables with GGML_ prefix

* Guard against adding to cache variable twice

* Add git to msys2 workflow

* Handle ggml-cpu-* variants

* Link ggml/ggml-base libraries to their targets

* Replace main-cmake-pkg with simple-cmake-pkg

* Interface features require c_std_90

* Fix typo

* Removed unnecessary bracket from status message

* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update examples/simple-cmake-pkg/README.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-26 12:07:48 -04:00
Frank Mai 1d8ee06000 rpc: fix register position (#11424)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2025-01-26 16:20:34 +01:00
Georgi Gerganov 2cc9b8c32c readme : update hot topics 2025-01-26 14:30:15 +02:00
Jeff Bolz f35726c2fb build: apply MSVC /bigobj option to c/cpp files only (#11423) 2025-01-26 03:10:03 +01:00
Jeff Bolz 4a75d19376 vulkan: compile shaders on-demand (#11406)
Reduce first-run startup time and memory consumption.

Should fix #11339.
2025-01-25 22:29:57 +01:00
uvos 26771a1491 Hip: disable VMM on hip as it seams that it dosent work in some configurations (#11420) 2025-01-25 21:01:12 +01:00
Jeff Bolz ca6baf76c1 build: add /bigobj to MSVC build (#11407) 2025-01-25 11:26:37 -06:00
Diego Devesa 6e264a905b docker : add GGML_CPU_ARM_ARCH arg to select ARM architecture to build for (#11419) 2025-01-25 17:22:41 +01:00
Xuan Son Nguyen 49b0e3cec4 server : fix cleaning up stream task (#11418)
* server : fix cleaning up stream task

* one more spot
2025-01-25 16:36:44 +01:00
Diego Devesa 20a758155b docker : fix CPU ARM build (#11403)
* docker : fix CPU ARM build

* add CURL to other builds
2025-01-25 15:22:29 +01:00
Georgi Gerganov 00c24acb2a ci : fix line breaks on windows builds (#11409)
* ci : fix line breaks on windows builds

* cont : another try

* ci : fix powershell line breaks
2025-01-25 13:36:48 +02:00
jiahao su 466ea66f33 CANN: Add Ascend CANN build ci (#10217)
* CANN: Add Ascend CANN build ci

* Update build.yml

* Modify cann image version

* Update build.yml

* Change to run on x86 system

* Update build.yml

* Update build.yml

* Modify format error

* Update build.yml

* Add 'Ascend NPU' label restrictions

* Exclude non PR event

Co-authored-by: Yuanhao Ji <jiyuanhao@apache.org>

* Update build.yml

---------

Co-authored-by: Yuanhao Ji <jiyuanhao@apache.org>
2025-01-25 00:26:01 +01:00
uvos 5f0db9522f hip : Add hipGraph and VMM support to ROCM (#11362)
* Add hipGraph support

* Enable VMM on rocm
2025-01-25 00:02:23 +01:00
Johannes Gäßler c5d9effb49 CUDA: fix FP16 cuBLAS GEMM (#11396) 2025-01-24 21:02:43 +01:00
uvos 9fbadaef4f rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356) 2025-01-24 17:50:49 +01:00
Georgi Gerganov 9755129c27 release : pack /lib in the packages (#11392)
* release : pack /lib and /include in the packages

* cmake : put libs in /bin

* TMP : push artifacts

* Revert "TMP : push artifacts"

This reverts commit 4decf2c4df.

* ci : fix HIP cmake compiler options to be on first line

* ci : restore the original HIP commands

* ci : change ubuntu build from latest to 20.04

* ci : try to fix macos build rpaths

* ci : remove obsolete MacOS build

* TMP : push artifacts

* ci : change back to ubuntu latest

* ci : macos set build rpath to "@loader_path"

* ci : fix typo

* ci : change ubuntu package to 22.04

* Revert "TMP : push artifacts"

This reverts commit 537b09e70f.
2025-01-24 18:41:30 +02:00
Jafar Uruç a07c2c8a52 docs : Update readme to build targets for local docker build (#11368) 2025-01-24 14:30:13 +01:00
Johannes Gäßler 8137b4bb2b CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380) 2025-01-24 12:38:31 +01:00
Bernhard M. Wiedemann 1af6945eb0 cmake : avoid -march=native when reproducible build is wanted (#11366)
See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.

Without this patch, compiling on different machines produced different binaries, which made verification of results difficult.

Fixes: #11317

This patch was done while working on reproducible builds for openSUSE.
2025-01-24 13:21:35 +02:00
Eric Curtin 01f37edf1a Update llama-run README.md (#11386)
For consistency

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-24 09:39:24 +00:00
stduhpf c07e87f38b server : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364)
* webui : put DeepSeek R1 CoT in a collapsible <details> element

* webui: refactor split

* webui: don't use regex to split cot and response

* webui: format+qol

* webui: no loading icon if the model isn't generating

* ui fix, add configs

* add jsdoc types

* only filter </think> for assistant msg

* build

* update build

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-24 09:02:38 +01:00
Jeff Bolz 564804b79b tests: fix some mul_mat test gaps (#11375)
Now that we have batched mat-vec mul Vulkan shaders for up to n==8,
these tests weren't actually exercising the mat-mat mul path. Test
n==9 as well. Also, change to use all_types.
2025-01-23 14:51:24 -06:00
Eric Curtin 05f63cc9ee Update documentation (#11373)
To show -n, -ngl, --ngl is acceptable.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-23 20:04:31 +00:00
Eric Curtin f7fb43cd0b Add -ngl (#11372)
Most other llama.cpp cli tools accept -ngl with a single dash.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-23 16:16:18 +00:00
Xuan Son Nguyen 5845661640 server : add more clean up when cancel_tasks is called (#11340)
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
2025-01-23 13:56:05 +01:00
Eric Curtin f211d1dc10 Treat hf.co/ prefix the same as hf:// (#11350)
ollama uses hf.co/ to specify huggingface prefix, like RamaLama
uses hf://

Treat them similarly.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-23 10:38:20 +00:00
amd-dwang 955a6c2d91 Vulkan-run-test: fix mmq_wg_denoms (#11343)
There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of
wg_denoms.
2025-01-23 08:14:28 +01:00
Jeff Bolz 1971adf55e vulkan: sort shaders for more deterministic binary (#11315)
Fixes #11306.
2025-01-23 08:07:50 +01:00
Jeff Bolz 5245729e33 vulkan: fix diag_mask_inf (#11323)
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
2025-01-23 08:01:17 +01:00
Diego Devesa 6152129d05 main : update README documentation for batch size (#11353)
* main : update README documentation for batch size

* fix formatting

* minor
2025-01-22 19:22:20 +01:00
Georgi Gerganov 16d3df7ab0 readme : add plugin links (#11355) 2025-01-22 19:44:26 +02:00
Diego Devesa 12c2bdf2de server : fix draft context not being released (#11354) 2025-01-22 17:44:40 +01:00
Olivier Chafik c64d2becb1 minja: sync at https://github.com/google/minja/commit/0f5f7f2b3770eb682fbc11763266d45204173686 (#11352) 2025-01-22 16:16:27 +00:00
Jiří Podivín 96f4053934 Adding logprobs to /v1/completions (#11344)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2025-01-22 12:51:32 +01:00
Olivier Chafik a94f3b2727 common: utils to split / join / repeat strings (from json converter) (#11342)
* Factor string_join, string_split, string_repeat into common

* json: refactor to surface a versatile builder

* Update common.cpp
2025-01-22 09:51:44 +00:00
tc-mb 3e3357fd77 llava : support Minicpm-omni (#11289)
* init

* add readme

* update readme

* no use make

* update readme

* update fix code

* fix editorconfig-checker

* no change convert py

* use clip_image_u8_free
2025-01-22 09:35:48 +02:00
Olivier Chafik 6171c9d258 Add Jinja template support (#11016)
* Copy minja from https://github.com/google/minja/commit/58f0ca6dd74bcbfbd4e71229736640322b31c7f9

* Add --jinja and --chat-template-file flags

* Add missing <optional> include

* Avoid print in get_hf_chat_template.py

* No designated initializers yet

* Try and work around msvc++ non-macro max resolution quirk

* Update test_chat_completion.py

* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

* Refactor test-chat-template

* Test templates w/ minja

* Fix deprecation

* Add --jinja to llama-run

* Update common_chat_format_example to use minja template wrapper

* Test chat_template in e2e test

* Update utils.py

* Update test_chat_completion.py

* Update run.cpp

* Update arg.cpp

* Refactor common_chat_* functions to accept minja template + use_jinja option

* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

* Revert LLAMA_CHATML_TEMPLATE refactor

* Normalize newlines in test-chat-templates for windows tests

* Forward decl minja::chat_template to avoid eager json dep

* Flush stdout in chat template before potential crash

* Fix copy elision warning

* Rm unused optional include

* Add missing optional include to server.cpp

* Disable jinja test that has a cryptic windows failure

* minja: fix vigogne (https://github.com/google/minja/pull/22)

* Apply suggestions from code review

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Finish suggested renamings

* Move chat_templates inside server_context + remove mutex

* Update --chat-template-file w/ recent change to --chat-template

* Refactor chat template validation

* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)

* Warn against missing eos / bos tokens when jinja template references them

* rename: common_chat_template[s]

* reinstate assert on chat_templates.template_default

* Update minja to https://github.com/google/minja/commit/b8437df626ac6cd0ce3b333b3c74ed1129c19f25

* Update minja to https://github.com/google/minja/pull/25

* Update minja from https://github.com/google/minja/pull/27

* rm unused optional header

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-21 13:18:51 +00:00
Xuan Son Nguyen e28245f35f export-lora : fix tok_embd tensor (#11330) 2025-01-21 14:07:12 +01:00
Radoslav Gerganov 6da5bec81c rpc : better caching of the base buffer pointer (#11331)
There is no need to use map, just store the base pointer in the buffer
context.
2025-01-21 15:06:41 +02:00
Eric Curtin 2e2f8f093c linenoise.cpp refactoring (#11301)
More RAII mainly

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-21 09:32:35 +00:00
Georgi Gerganov 2139667ec4 metal : fix out-of-bounds write (#11314)
ggml-ci
2025-01-21 08:48:13 +02:00
Georgi Gerganov 80d0d6b4b7 common : add -hfd option for the draft model (#11318)
* common : add -hfd option for the draft model

* cont : fix env var

* cont : more fixes
2025-01-20 22:29:43 +02:00
Jeff Bolz aea8ddd516 vulkan: fix coopmat2 validation failures (#11284)
mul mat and flash attention shaders were loading f32 types directly into
A/B matrices, which happens to work but is technically invalid usage.
For FA, we can load it as an Accumulator matrix and convert and this
is not in the inner loop and is cheap enough. For mul mat, it's more
efficient to do this conversion in a separate pass and have the input(s)
be f16.

coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId
requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
2025-01-20 10:38:32 -06:00
Georgi Gerganov 9f7add1cde examples : fix add_special conditions (#11311) 2025-01-20 16:36:08 +02:00
Christopher Nielsen 90d987b105 mmap: add include for cerrno (#11296)
ggml-ci

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-20 16:02:43 +02:00
Michael Podvitskiy a4251edd6f cmake: fix shell command quoting in build-info script (#11309) 2025-01-20 16:02:15 +02:00
Xuan Son Nguyen ec7f3ac9ab llama : add support for Deepseek-R1-Qwen distill model (#11310)
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
2025-01-20 14:35:07 +01:00
Georgi Gerganov ef6dada60c cont : fix whitespaces (#11305) 2025-01-20 09:29:32 +02:00
Kyle Bruene ae3c1db2f9 llama : re-add LLM_ARCH_PHIMOE (#11305)
Phi 3.5 MoE was partially removed during a refactor. The code was originally in llama.cpp and should be in llama-model.cpp after the refactor.
2025-01-20 09:21:01 +02:00
Georgi Gerganov 92bc493917 tests : increase timeout when sanitizers are enabled (#11300)
* tests : increase timeout when sanitizers are enabled

* tests : add DEFAULT_HTTP_TIMEOUT
2025-01-19 20:22:30 +02:00
Georgi Gerganov b9daaffe02 simple-chat : fix BOS being added to each message (#11278) 2025-01-19 18:12:09 +02:00
Nicolò Scipione 99487b57d4 SYCL: Introducing memory host pool (#11251)
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-01-19 21:33:34 +08:00
Eric Curtin a1649cc13f Adding linenoise.cpp to llama-run (#11252)
This is a fork of linenoise that is C++17 compatible. I intend on
adding it to llama-run so we can do things like traverse prompt
history via the up and down arrows:

https://github.com/ericcurtin/linenoise.cpp

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-18 14:42:31 +00:00
Georgi Gerganov 4dd34ff831 cmake : add sanitizer flags for llama.cpp (#11279)
* cmake : add sanitizer flags for llama.cpp

ggml-ci

* tests : fix compile warnings

ggml-ci

* cmake : move sanitizer flags to llama_add_compile_flags

ggml-ci

* cmake : move llama.cpp compile flags to top level lists

ggml-ci

* cmake : apply only sanitizer flags at top level

ggml-ci

* tests : fix gguf context use in same_tensor_data

* gguf-test: tensor data comparison

* dummy : trigger ggml-ci

* unicode : silence gcc warnings

ggml-ci

* ci : use sanitizer builds only in Debug mode

ggml-ci

* cmake : add status messages [no ci]

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-01-18 16:18:15 +02:00
Xuan Son Nguyen f30f099228 server : implement cancellable request (#11285)
* server : implement cancellable request

* fix typo

* httplib 0.18.5

* fix i underflow
2025-01-18 14:12:05 +01:00
Georgi Gerganov f26c874179 scripts : restore hf.sh (#11288)
ggml-ci
2025-01-18 13:18:32 +02:00
LostRuins Concedo 6390a998bf tts : add guide tokens support (#11186)
* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences.

* applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start
2025-01-18 12:20:57 +02:00
Jeff Bolz 44e18ef939 vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281)
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid
a performance regression.

Add noncontiguous FA tests in test-backend-ops.

Fixes #11268.
2025-01-18 09:26:50 +01:00
codezjx 3edfa7d375 llama.android: add field formatChat to control whether to parse special tokens when send message (#11270) 2025-01-17 14:57:56 +02:00
Radoslav Gerganov 667d72846c rpc : early register backend devices (#11262)
Early register RPC devices and do not propagate RPC specifics in the
llama model structures.

ref: #10609
2025-01-17 10:57:09 +02:00
Georgi Gerganov a133566d34 vocab : fix double-eos check (#11273)
ggml-ci
2025-01-17 09:28:00 +02:00
David Renshaw 960ec65273 llama : fix deprecation message: vocabable -> vocab (#11269) 2025-01-17 08:12:01 +01:00
musoles 7a689c415e README : added kalavai to infrastructure list (#11216) 2025-01-17 01:10:49 +01:00
Jeff Bolz bd38ddea01 vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166)
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

Shaders are based on cpy.cu.

* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32

* ggml: copy q->f32 assumes some contiguity in the destination
2025-01-16 22:47:10 +01:00
Jeff Bolz 466300fe14 vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (#11206)
Do masking on whole dwords, fetch all scales at once.
2025-01-16 22:23:49 +01:00
Jeff Bolz 206bc53422 vulkan: optimize coopmat2 q2_k dequant function (#11130) 2025-01-16 22:16:39 +01:00
RunningLeon 4dbc8b9cb7 llama : add internlm3 support (#11233)
* support internlm3

* fix lint
2025-01-16 20:10:38 +02:00
Johannes Gäßler 9c8dcefe17 CUDA: backwards pass for misc. ops, add tests (#11257)
* CUDA: backwards pass for misc. ops, add tests

* remove restrict from pointers
2025-01-16 16:43:38 +01:00
Xuan Son Nguyen 681149ced2 llama : add llama_model_load_from_splits (#11255)
* llama : add `llama_model_load_from_splits`

* update
2025-01-16 13:54:08 +01:00
fj-y-saito c67cc9837d ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (#11227)
* Add SVE support for q4_K_q8_K

* Update ggml/src/ggml-cpu/ggml-cpu-quants.c

change to use K_SCALE_SIZE

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-16 11:11:49 +02:00
Eve adc5dd92e8 vulkan: scale caching for k quants + misc fixes (#11081)
* q6_k scale caching

* 16 bit unpack

* q4_k test (slow)

* revert it

* q3_k

* q2_k

* little stuff

* try precalculating products of a and q2_k scales

* Revert "try precalculating products of a and q2_k scales"

This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.

* unpack should be u16, add vim swap to gitignore (about time)

* better q4_k scales

* q5_k

* better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations

* q2_k better dequant

* q3_k optimizations

* q3_k use hmask simd from cpu avx version

* make the caches happy

* q3_k separate out calculation

* q2_k separate out

* little stuff

* use calc_superblock everywhere

* q2_k optimize scale calculation

* more barriers
2025-01-15 19:50:13 +00:00
Georgi Gerganov f11cfdfd7f ci : use -no-cnv in gguf-split tests (#11254)
* ci : use -no-cnv in gguf-split tests

ggml-ci

* ci : use -no-cnv in requantize tests

ggml-ci

* scripts : fix [no ci]
2025-01-15 18:28:35 +02:00
Junil Kim 1d8504338e fix: ggml: fix vulkan-shaders-gen build (#10448)
* fix: ggml: fix vulkan-shaders-gen build

The vulkan-shaders-gen target was not being built correctly
in case of cross-compilation.
Other outputs need to be built for the cross compile target,
but vulkan-shaders-gen needs to be built for the host.

* refactor: ggml: Improve vulkan-shaders-gen toolchain setup

- Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
- Auto-detect host toolchain if not set.

* refactor: ggml: Improve vulkan-shaders-gen toolchain setup

Use configure_file to generate host_toolchain.cmake from template

* fix: ggml: Fix compile error

Fix compile error not finding vulkan-shaders-gen

* fix: vulkan-shaders-gen build and path handling

Fix build issues with vulkan-shaders-gen:
- Add target dependency for correct build order
- Use CMAKE_HOST_SYSTEM_NAME for executable suffix
- Fix MSVC output directory in host toolchain
- Normalize path handling for cross-compilation

* fix: improve host compiler detection in vulkan shader build

Improve host compiler detection for vulkan shader generation:
- Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
- Consolidate compiler detection logic
- Fix Windows-specific MSVC detection
- Ensure correct compiler search in cross-compilation

* refactor: Simplify CMake function for detecting host compiler

Simplified the CMake function to improve the process of detecting the host compiler.

* fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt

Since `vulkan-shader-gen.cpp` only requires the `glslc` executable
and not the Vulkan headers or libraries, CMakeLists.txt needs to
be corrected.
(See: ecc93d0558)

* refactor: Rename host_toolchain.cmake.in

- Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in

* refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
2025-01-15 14:17:42 +01:00
Johannes Gäßler 432df2d5f9 RoPE: fix back, CUDA support for back + noncont. (#11240)
* RoPE: fix back, CUDA support for back + noncont.

* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
Daniel Bevenius 0ccd7f3eb2 examples : add embd_to_audio to tts-outetts.py [no ci] (#11235)
This commit contains a suggestion for adding the missing embd_to_audio
function from tts.cpp to tts-outetts.py. This introduces a depencency
numpy which I was not sure if that is acceptable or not (only PyTorch
was mentioned in referened PR).

Also the README has been updated with instructions to run the example
with llama-server and the python script.

Refs: https://github.com/ggerganov/llama.cpp/pull/10784#issuecomment-2548377734
2025-01-15 05:44:38 +01:00
Akarshan Biswas f446c2cf6a SYCL: Add gated linear attention kernel (#11175)
* SYCL: Add Gated Linear attention kernel

* glahpp: add a space at the end of file

* gla: Put the barrier inside the main logic loop
2025-01-15 11:20:17 +08:00
Xuan Son Nguyen b4d92a59a2 ci : add -no-cnv for tests (#11238) 2025-01-14 16:42:23 +02:00
Georgi Gerganov bbf3e55e35 vocab : add dummy tokens for "no_vocab" type (#11231)
* vocab : add dummy tokens for "no_vocab" type

ggml-ci

* vocab : minor [no ci]
2025-01-14 11:54:58 +01:00
ebraminio c5bf0d1bd7 server : Improve code snippets direction between RTL text (#11221) 2025-01-14 11:39:33 +01:00
Olivier Chafik 091592d758 Refactor test-chat-template.cpp (#11224)
* Refactor test-chat-template

* Update test-chat-template.cpp
2025-01-14 10:16:41 +00:00
Georgi Gerganov 44d1e796d0 sync : ggml 2025-01-14 10:39:42 +02:00
Georgi Gerganov a4f3f5d8e6 scripts : sync gguf (cont) 2025-01-14 09:40:52 +02:00
Georgi Gerganov 48e1ae0e61 scripts : sync gguf 2025-01-14 09:36:58 +02:00
Georgi Gerganov d00a80e89d scripts : sync opencl 2025-01-14 09:19:58 +02:00
ebraminio 504af20ee4 server : (UI) Improve messages bubble shape in RTL (#11220)
I simply have overlooked message bubble's tail placement for RTL
text as I use the dark mode and that isn't visible there and this
fixes it.
2025-01-13 20:23:31 +01:00
Xuan Son Nguyen 84a44815f7 cli : auto activate conversation mode if chat template is available (#11214)
* cli : auto activate conversation mode if chat template is detected

* add warn on bad template

* update readme (writing with the help of chatgpt)

* update readme (2)

* do not activate -cnv for non-instruct models
2025-01-13 20:18:12 +01:00
Andreas Kieslinger 39509fb082 cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (#11042)
* Refactor: Moves cuda graph executable update step to separate function.

* Refactor: Moves cuda graph update check to separate function.

* Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability.

* Fix: Adds missing reference to maintain_cuda_graph() definition.

* Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function.

* Refactor: Moves node graph checks and copy ops into individual function for improved readability.

* Refactor: Removes code permanently excluded from compilation to increase readability.

* Style: Adds missing newline

* Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one

* Refactor: Makes 'cuda_graph_update_required' a local variable

* remove double lines between functions

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-01-13 16:45:53 +01:00
Georgi Gerganov a29f0870d4 contrib : add naming guidelines (cont) (#11177) 2025-01-13 15:59:26 +02:00
ebraminio 437e05f714 server : (UI) Support for RTL text as models input or output (#11208) 2025-01-13 14:46:39 +01:00
Georgi Gerganov ca001f6656 contrib : add naming guidelines (cont) (#11177) 2025-01-13 15:08:44 +02:00
Xuan Son Nguyen 00b4c3da62 common : support tag-based --hf-repo like on ollama (#11195)
* common : support tag-based hf_repo like on ollama

* fix build

* various fixes

* small fixes

* fix style

* fix windows build?

* move common_get_hf_file to common.cpp

* fix complain with noreturn
2025-01-13 13:56:23 +01:00
Georgi Gerganov 7426a26b24 contrib : add naming guidelines (#11177)
* contrib : add naming guidelines

* contrib : expand naming guidelines [no ci]

* contrib : cont [no ci]

* contrib : add `_t` suffix guideline [no ci]

* contrib : cont [no ci]

* minor [no ci]

* contrib : move coding guidelines to correct section [no ci]

* contrib : minor reword coding guidelines [no ci]

* contrib : add TODO for preprocessor directives [no ci]

* contrib : expand [no ci]

* minor [no ci]

* contrib : clarify `_context` suffix usage [no ci]

* contrib : filename guidelines [no ci]

* contrib : fix notes [no ci]
2025-01-13 14:46:36 +02:00
Daniel Bevenius 8f70fc3d1b llama : remove 'd' from bad special token log (#11212)
This commit removes the 'd' from the log message in llama-vocab.cpp
when logging a bad special token.

The motivation for this is that currently the output can look something
like the following:
```console
load: bad special token:
    'tokenizer.ggml.image_token_id' = 128256d, using default id -1
```
2025-01-13 13:38:20 +01:00
Radoslav Gerganov 1244cdcf14 ggml : do not define GGML_USE_CUDA when building with GGML_BACKEND_DL (#11211)
Build fails when using HIP and GGML_BACKEND_DL:
```
/usr/bin/ld: ../ggml/src/libggml.so: undefined reference to `ggml_backend_cuda_reg'
collect2: error: ld returned 1 exit status
```
This patch fixes this.
2025-01-13 13:31:41 +02:00
Eric Curtin 924518e2e5 Reset color before we exit (#11205)
We don't want colors to leak post termination of llama-run.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-12 18:23:10 +00:00
Xuan Son Nguyen 9a483999a6 llama : fix chat template gguf key (#11201) 2025-01-12 13:45:14 +01:00
Georgi Gerganov 08f10f69c3 llama : remove notion of CLS token (#11064)
ggml-ci
2025-01-12 12:15:53 +02:00
Georgi Gerganov afa8a9ec9b llama : add llama_vocab, functions -> methods, naming (#11110)
* llama : functions -> methods (#11110)

* llama : add struct llama_vocab to the API (#11156)

ggml-ci

* hparams : move vocab params to llama_vocab (#11159)

ggml-ci

* vocab : more pimpl (#11165)

ggml-ci

* vocab : minor tokenization optimizations (#11160)

ggml-ci

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* lora : update API names (#11167)

ggml-ci

* llama : update API names to use correct prefix (#11174)

* llama : update API names to use correct prefix

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* minor [no ci]

* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174)

ggml-ci

* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174)

ggml-ci

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Vinesh Janarthanan c05e8c9934 gguf-py: fixed local detection of gguf package (#11180)
* updated path to gguf package for non-installed setups

* added reader.py to readme

* Bumped gguf version to 0.15.0
2025-01-11 11:42:31 +02:00
Daniel Bevenius 2739a71e4b convert : sort print supported models [no ci] (#11179)
This commit sorts the list of supported models when printing them out.

The motivation for this change is to make it easier to find a specific
model in the list of supported models. For example:
```console
$ ./convert_hf_to_gguf.py --print-supported-models
Supported models:
- ArcticForCausalLM
- BaiChuanForCausalLM
- BaichuanForCausalLM
- BertForMaskedLM
- BertModel
- BitnetForCausalLM
- BloomForCausalLM
- BloomModel
- CamembertModel
- ChameleonForCausalLM
- ChameleonForConditionalGeneration
- ChatGLMForConditionalGeneration
- ChatGLMModel
- CodeShellForCausalLM
- Cohere2ForCausalLM
- CohereForCausalLM
- DbrxForCausalLM
- DeciLMForCausalLM
- DeepseekForCausalLM
- DeepseekV2ForCausalLM
- DeepseekV3ForCausalLM
- ExaoneForCausalLM
- FalconForCausalLM
- FalconMambaForCausalLM
- GPT2LMHeadModel
- GPTBigCodeForCausalLM
- GPTNeoXForCausalLM
- GPTRefactForCausalLM
- Gemma2ForCausalLM
- GemmaForCausalLM
- GraniteForCausalLM
- GraniteMoeForCausalLM
- GrokForCausalLM
- InternLM2ForCausalLM
- JAISLMHeadModel
- JinaBertForMaskedLM
- JinaBertModel
- LLaMAForCausalLM
- LlamaForCausalLM
- LlavaStableLMEpochForCausalLM
- MPTForCausalLM
- MT5ForConditionalGeneration
- MambaForCausalLM
- MambaLMHeadModel
- MiniCPM3ForCausalLM
- MiniCPMForCausalLM
- MistralForCausalLM
- MixtralForCausalLM
- NemotronForCausalLM
- NomicBertModel
- OLMoForCausalLM
- Olmo2ForCausalLM
- OlmoForCausalLM
- OlmoeForCausalLM
- OpenELMForCausalLM
- OrionForCausalLM
- Phi3ForCausalLM
- PhiForCausalLM
- PhiMoEForCausalLM
- PlamoForCausalLM
- QWenLMHeadModel
- Qwen2ForCausalLM
- Qwen2MoeForCausalLM
- Qwen2VLForConditionalGeneration
- RWForCausalLM
- RWKV6Qwen2ForCausalLM
- RobertaModel
- Rwkv6ForCausalLM
- StableLMEpochForCausalLM
- StableLmForCausalLM
- Starcoder2ForCausalLM
- T5EncoderModel
- T5ForConditionalGeneration
- T5WithLMHeadModel
- UMT5ForConditionalGeneration
- WavTokenizerDec
- XLMRobertaForSequenceClassification
- XLMRobertaModel
- XverseForCausalLM
```
2025-01-11 05:50:33 +01:00
Daniel Bevenius ba8a1f9c5b examples : add README.md to tts example [no ci] (#11155)
* examples : add README.md to tts example [no ci]

* squash! examples : add README.md to tts example [no ci]

Fix heading to be consistent with other examples, and add a quickstart
section to README.md.

* squash! examples : add README.md to tts example [no ci]

Fix spelling mistake.
2025-01-10 13:16:16 +01:00
Daniel Bevenius ff3fcabc72 convert : add --print-supported-models option (#11172)
* convert : add --print-supported-models option

This commit adds a new option to the convert_hf_to_gguf.py script to
print the supported models.

The motivation for this is that it can be useful to know which models
are supported by the script without having to look at the code.

Example usage:
```console
$ ./convert_hf_to_gguf.py --print-supported-models
Supported models:
- GPTNeoXForCausalLM
- BloomForCausalLM
- BloomModel
- MPTForCausalLM
- OrionForCausalLM
- BaichuanForCausalLM
- BaiChuanForCausalLM
- XverseForCausalLM
- FalconForCausalLM
- RWForCausalLM
- GPTBigCodeForCausalLM
- GPTRefactForCausalLM
- StableLmForCausalLM
- StableLMEpochForCausalLM
- LlavaStableLMEpochForCausalLM
- LLaMAForCausalLM
- LlamaForCausalLM
- MistralForCausalLM
- MixtralForCausalLM
- DeciLMForCausalLM
- BitnetForCausalLM
- GrokForCausalLM
- DbrxForCausalLM
- MiniCPMForCausalLM
- MiniCPM3ForCausalLM
- QWenLMHeadModel
- Qwen2ForCausalLM
- Qwen2VLForConditionalGeneration
- WavTokenizerDec
- Qwen2MoeForCausalLM
- GPT2LMHeadModel
- PhiForCausalLM
- Phi3ForCausalLM
- PhiMoEForCausalLM
- PlamoForCausalLM
- CodeShellForCausalLM
- InternLM2ForCausalLM
- BertModel
- BertForMaskedLM
- CamembertModel
- RobertaModel
- NomicBertModel
- XLMRobertaModel
- XLMRobertaForSequenceClassification
- GemmaForCausalLM
- Gemma2ForCausalLM
- Starcoder2ForCausalLM
- Rwkv6ForCausalLM
- RWKV6Qwen2ForCausalLM
- MambaForCausalLM
- MambaLMHeadModel
- FalconMambaForCausalLM
- CohereForCausalLM
- Cohere2ForCausalLM
- OLMoForCausalLM
- OlmoForCausalLM
- Olmo2ForCausalLM
- OlmoeForCausalLM
- JinaBertModel
- JinaBertForMaskedLM
- OpenELMForCausalLM
- ArcticForCausalLM
- DeepseekForCausalLM
- DeepseekV3ForCausalLM
- DeepseekV2ForCausalLM
- UMT5ForConditionalGeneration
- MT5ForConditionalGeneration
- T5ForConditionalGeneration
- T5WithLMHeadModel
- T5EncoderModel
- JAISLMHeadModel
- ChatGLMModel
- ChatGLMForConditionalGeneration
- NemotronForCausalLM
- ExaoneForCausalLM
- GraniteForCausalLM
- GraniteMoeForCausalLM
- ChameleonForCausalLM
- ChameleonForConditionalGeneration
```

* squash! convert : add --print-supported-models option

Fix flake8 error.
2025-01-10 11:30:53 +01:00
0cc4m c3f9d25706 Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (#11161)
* Vulkan: Remove float16 use in shaders

* Fix validation error about subgroup_size_control extension
2025-01-10 06:39:33 +01:00
Molly Sophia ee7136c6d1 llama: add support for QRWKV6 model architecture (#11001)
llama: add support for QRWKV6 model architecture (#11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Akarshan Biswas c6860cc734 SYCL: Refactor ggml_sycl_compute_forward (#11121)
* SYCL: refactor ggml_sycl_compute_forward

* SYCL: add back GGML_USED(dst) to ggml_sycl_cpy

* SYCL: add function name to noop debug

* SYCL: Some device info print refactoring and add details of XMX availability
2025-01-10 08:13:03 +08:00
Tei Home 1204f97270 doc: add cuda guide for fedora (#11135)
Since NVIDIA does not release CUDA for in-maintenance versions of Fedora, the process of setting up the CUDA toolkit on Fedora has become quite involved. This guide should help mere mortals install CUDA for development in a Fedora 39 toolbox environment, without affecting the host system.
2025-01-09 11:32:06 +00:00
Daniel Bevenius 8eceb888d7 server : add tooltips to settings and themes btn (#11154)
* server : add tooltips to settings and themes btn

This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.

The motivation for this change is to clarify the purpose of the themes
button.

* squash! server : add tooltips to settings and themes btn

This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.

* rm tooltip for 3 dots button

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-09 11:28:29 +01:00
Pierrick Hymbert f8feb4b01a model: Add support for PhiMoE arch (#11003)
* model: support phimoe

* python linter

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: minor

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>

* doc: add phimoe as supported model

ggml-ci

---------

Co-authored-by: ThiloteE <73715071+ThiloteE@users.noreply.github.com>
2025-01-09 11:21:41 +01:00
Georgi Gerganov be0e950c91 media : remove old img [no ci] 2025-01-09 11:15:15 +02:00
Xuan Son Nguyen d9feae1c06 llama-chat : add phi 4 template (#11148) 2025-01-09 10:07:33 +01:00
hydai 8d59d91171 fix: add missing msg in static_assert (#11143)
Signed-off-by: hydai <z54981220@gmail.com>
2025-01-08 20:03:28 +00:00
Vinesh Janarthanan 8a1d9c25fa gguf-py : move scripts directory (#11116)
* Moved scripts dir and fixed pyproject.toml

* updated readme

* fixed README urls

* bump pypi gguf to v0.14.0

* retrigger ci

* empty commit - trigger ci
2025-01-08 20:54:58 +02:00
Eric Curtin 1bf839b1e8 Enhance user input handling for llama-run (#11138)
The main motivation for this change is it was not handing
ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF,
"/bye" command, and empty input cases. Introduce `get_user_input`
function to manage user input loop and handle different return
cases.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-08 18:47:05 +00:00
Xuan Son Nguyen f7cd13301c ci : use actions from ggml-org (#11140) 2025-01-08 16:09:20 +01:00
Xuan Son Nguyen 4d2b3d8804 lora : improve compat with mergekit-extract-lora (#11131)
* (wip) support mergekit-extracted lora

* support mergekit-extract-lora

* use lora->get_scale

* correct comment

* correct norm name & condition

* add some hints
2025-01-08 15:59:53 +01:00
Georgi Gerganov c07d437bbd llama : avoid hardcoded QK_K (#11061)
ggml-ci
2025-01-08 16:19:36 +02:00
Georgi Gerganov 99a3755a3c sync : ggml 2025-01-08 13:40:30 +02:00
Radoslav Gerganov c792dcf488 ggml : allow loading backend with env variable (ggml/1059)
ref: #1058
2025-01-08 13:40:18 +02:00
Xuan Son Nguyen 80ccf5d725 ci : pin dependency to specific version (#11137)
* ci : pin dependency to specific version

* will this fix ec?
2025-01-08 12:07:20 +01:00
Georgi Gerganov a3c1232c3f arg : option to exclude arguments from specific examples (#11136)
* arg : option to exclude arguments from specific examples

ggml-ci

* readme : remove old args [no ci]
2025-01-08 12:55:36 +02:00
amritahs-ibm 8cef75c743 llamafile : ppc64le MMA INT8 implementation (#10912)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2025-01-08 12:54:19 +02:00
Georgi Gerganov 0d52a69e4b ci : fix cmake option (#11125) 2025-01-08 11:29:34 +02:00
Mathieu Baudier 02f0430141 Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (#11117)
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.

* Perform Vulkan extensions checks in a more sensible order

* Remove unnecessary #ifdef directive
2025-01-08 09:18:13 +01:00
ag2s20150909 bec2183f2c fix: Vulkan shader gen binary path when Cross-compiling (#11096)
* fix: Vulkan shader gen binary path when cross compiling
2025-01-08 09:17:29 +01:00
Johannes Gäßler 53ff6b9b9f GGUF: C++ refactor, backend support, misc fixes (#11030)
* GGUF: C++ refactor, backend support, misc fixes

remove ggml_tensor.backend

update CODEOWNERS [no ci]

remove gguf_get_data from API

revise GGUF API data types
2025-01-07 18:01:58 +01:00
Diego Devesa 017cc5f446 ggml-backend : only offload from host buffers (fix) (#11124) 2025-01-07 16:11:57 +01:00
Diego Devesa a3d50bc022 ggml-backend : only offload from host buffers (#11120) 2025-01-07 12:38:05 +01:00
Radoslav Gerganov a4dd490069 rpc : code cleanup (#11107)
Remove duplicated macros, use GGML_LOG_ERROR for errors
2025-01-07 08:37:02 +02:00
Akarshan Biswas c0d6f790d0 SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 (#11087)
* SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6

* Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6"

This reverts commit f62dc45f31.

* Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6
2025-01-07 14:26:07 +08:00
Eric Curtin dc7cef9f37 llama-run : fix context size (#11094)
Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is
a more reasonable 2048.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-06 23:45:28 +01:00
Georgi Gerganov ecebbd292d llama : remove unused headers (#11109)
ggml-ci
2025-01-06 17:52:35 +02:00
Xuan Son Nguyen 96be8c3264 github : add cmd line field to bug report (#11090)
* github : cmd line to bug report

* codeowners : (@ngxson) only watch dockerfile

* Apply suggestions from code review [no ci]

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* rm cmd in log output [no ci]

* rm 2 [no ci]

* no need backticks [no ci]

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-01-06 16:34:49 +01:00
Georgi Gerganov e6e7c75d94 server : fix extra BOS in infill endpoint (#11106)
* server : fix extra BOS in infill endpoing

ggml-ci

* server : update infill tests
2025-01-06 15:36:08 +02:00
Xuan Son Nguyen 09186fabbe llama : remove check flash_attn with lora (#11104) 2025-01-06 13:41:12 +01:00
Asghar Ghorbani 96a1dc27c3 llama : prevent system info string accumulation across calls (#11101) 2025-01-06 13:21:46 +02:00
Daniel Bevenius 6369f867a4 llama : rename missed batch params/vars to ubatch (#10059)
This commit renames the `batch` parameter to `ubatch` in the
`llama_kv_cache_find_slot`, `llm_build_inp_embd`, and
`llm_build_mamba` functions.

The motivation for this is that this should have been done as part of
Commit 19d900a756 ("llama : rename batch
to ubatch (#9950)") but for some reason I missed these functions in
that commit and only noticed them now (sorry).
2025-01-06 11:28:17 +02:00
Georgi Gerganov 47182dd03f llama : update llama_model API names (#11063)
* llama : deprecate llama_free_model, add llama_model_free

ggml-ci

* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`

ggml-ci
2025-01-06 10:55:18 +02:00
Georgi Gerganov 3e6e7a6bc2 tokenize : escape the prompt (#11058)
* tokenize : escape the prompt

* tokenize : update help
2025-01-06 10:54:25 +02:00
Georgi Gerganov ae2f606bb5 mmap : fix fileno macro clash (#11076)
* mmap : fix fileno macro clash

ggml-ci

* cont

ggml-ci
2025-01-06 10:52:38 +02:00
Georgi Gerganov 727368c60f llama : use LLAMA_TOKEN_NULL (#11062)
ggml-ci
2025-01-06 10:52:15 +02:00
Georgi Gerganov 5047dd3546 llama : use _impl suffix instead of _internal (#11060)
ggml-ci
2025-01-06 10:52:01 +02:00
Johannes Gäßler 46e3556e01 CUDA: add BF16 support (#11093)
* CUDA: add BF16 support
2025-01-06 02:33:52 +01:00
0cc4m b56f079e28 Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (#11074)
* Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver

* Add (TM) to AMD name check
2025-01-04 21:09:59 +01:00
fairydreaming 9394bbd484 llama : Add support for DeepSeek V3 (#11049)
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type

* vocab : add DeepSeek V3 pre-tokenizer regexes

* unicode : handle ACCENT_MARK and SYMBOL categories in regex

* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-04 21:06:11 +01:00
matt23654 f922a9c542 [GGML][RPC] Support for models with non-512-aligned tensors over RPC. (#11047)
* Added init tensor calling code

* Added get_alloc_size forwarding

* Cleaned up and improved type/error handling.

* fix: remove trailing whitespaces.

* Cleanup and use GGML error logging functions.

* Handle potentially dangerous edge cases.

* Apply suggestions from code review

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-04 17:10:30 +01:00
DAN™ 46be942214 llama : add support for the cohere2 model architecture (#10900) 2025-01-04 16:33:31 +02:00
Georgi Gerganov 78c6785175 sync : ggml 2025-01-04 16:09:53 +02:00
Georgi Gerganov 5e3b08d606 ggml : do not install metal source when embed library (ggml/1054) 2025-01-04 16:09:53 +02:00
Daniel Bevenius db68c93b57 ggml : improve inputs log sched_print_assignments (ggml/1053)
This commit attempts to improve the log message for the inputs of the
splits in the sched_print_assignments function.

The motivation for this change is that currently even if there are no
inputs a colon is displayed at the end of the line, which can make it a
little confusing when reading the output as it could be interpreted as
the line below are inputs when they are in fact nodes. With this change
the colon will only be printed if there actually are inputs.
2025-01-04 16:09:53 +02:00
Gilad S. c31fc8b966 fix: Vulkan shader gen binary path (#11037) 2025-01-04 09:17:31 +01:00
Molly Sophia 4b0c638b9a common : disable KV cache shifting automatically for unsupported models (#11053)
* Disable KV cache shifting automatically for unsupported models

instead of exiting directly

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update common/common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-03 14:13:18 +02:00
Georgi Gerganov e7da954ecc metal : avoid uint (#11019) 2025-01-03 11:26:14 +02:00
Georgi Gerganov f66f582927 llama : refactor src/llama.cpp (#10902)
* llama : scatter llama.cpp into multiple modules (wip)

* llama : control-vector -> adapter

* llama : arch

* llama : mmap

ggml-ci

* ci : remove BUILD_SHARED_LIBS=OFF

ggml-ci

* llama : arch (cont)

ggml-ci

* llama : chat

ggml-ci

* llama : model

ggml-ci

* llama : hparams

ggml-ci

* llama : adapter

ggml-ci

* examples : fix

ggml-ci

* rebase

ggml-ci

* minor

* llama : kv cache

ggml-ci

* llama : impl

ggml-ci

* llama : batch

ggml-ci

* cont

ggml-ci

* llama : context

ggml-ci

* minor

* llama : context (cont)

ggml-ci

* llama : model loader

ggml-ci

* common : update lora

ggml-ci

* llama : quant

ggml-ci

* llama : quant (cont)

ggml-ci

* minor [no ci]
2025-01-03 10:18:53 +02:00
Pierrick Hymbert 2f0ee84b9b server: bench: minor fixes (#10765)
* server/bench:
- support openAI streaming standard output with [DONE]\n\n
- export k6 raw results in csv
- fix too many tcp idle connection in tcp_wait
- add metric time to emit first token

* server/bench:
- fix when prometheus not started
- wait for server to be ready before starting bench
2025-01-02 18:06:12 +01:00
Xuan Son Nguyen 0da5d86026 server : allow using LoRA adapters per-request (#10994)
* slot.can_batch_with

* lora per request

* test: force disable cache prompt

* move can_batch_with check

* fix condition

* add slow test with llama 8b

* update docs

* move lora change task to queue

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* lora_base

* remove redundant check

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-02 15:05:18 +01:00
Benson Wong a45433ba20 readme : add llama-swap to infrastructure section (#11032)
* list llama-swap under tools in README

* readme: add llama-swap to Infrastructure
2025-01-02 09:14:54 +02:00
Srihari-mcw 0827b2c1da ggml : fixes for AVXVNNI instruction set with MSVC and Clang (#11027)
* Fixes for clang AVX VNNI

* enable AVX VNNI and alder lake build for MSVC

* Apply suggestions from code review

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-12-31 15:23:33 +01:00
Xuan Son Nguyen 45095a61bf server : clean up built-in template detection (#11026)
* server : clean up built-in template detection

* fix compilation

* add chat template test

* fix condition
2024-12-31 15:22:01 +01:00
Xuan Son Nguyen 5896c65232 server : add OAI compat for /v1/completions (#10974)
* server : add OAI compat for /v1/completions

* add test

* add docs

* better docs
2024-12-31 12:34:13 +01:00
ymcki bc7b1f8632 convert : fix Llama-3_1-Nemotron-51B rope settings (#11008)
* conflict resolution

* move comments after bracket to its own line

* DeciLMCausalModel now reads rope_theta from config.json properly
2024-12-31 13:04:48 +02:00
Peter 6e1531aca5 common, examples, ggml : fix MSYS2 GCC compiler errors and warnings when building with LLAMA_CURL=ON and GGML_OPENCL=ON (#11013)
In common/common.cpp:
* Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature)
* Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2)

In examples/run/run.cpp:
* Add io.h header inclusion (error cannot find function _get_osfhandle)
* Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members)
* Add initialiser for hFile (warning it may be uninitialised)
* Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int)

In ggml/src/ggml-opencl/ggml-opencl.cpp:
* Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)
2024-12-31 01:46:06 +01:00
Jeff Bolz 716bd6dec3 vulkan: optimize mul_mat for small values of N (#10991)
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where
the batch_strides are overloaded to hold the row strides. Put the loads from the
B matrix in the innermost loop because it should cache better.

Share some code for reducing the result values to memory in mul_mat_vec_base.
2024-12-30 18:27:11 +01:00
ag2s20150909 c250ecb315 android : fix llama_batch free (#11014) 2024-12-30 14:35:13 +02:00
Jeff Bolz a813badbbd vulkan: im2col and matmul optimizations for stable diffusion (#10942)
* tests: Add im2col perf tests

* vulkan: optimize im2col, more elements per thread

* vulkan: increase small tile size for NV_coopmat2

* vulkan: change im2col to 512 elements per workgroup
2024-12-29 10:16:34 +01:00
Jeff Bolz fdd2188912 vulkan: Use push constant offset to handle misaligned descriptors (#10987) 2024-12-29 09:35:11 +01:00
Isaac McFadyen f865ea149d server: added more docs for response_fields field (#10995) 2024-12-28 16:09:19 +01:00
Alexey Parfenov 16cdce7b68 server : fix token duplication when streaming with stop strings (#10997) 2024-12-28 16:08:54 +01:00
Eve d79d8f39b4 vulkan: multi-row k quants (#10846)
* multi row k quant shaders!

* better row selection

* more row choices

* readjust row selection

* rm_kq=2 by default
2024-12-26 16:54:44 +01:00
Peter d283d02bf2 examples, ggml : fix GCC compiler warnings (#10983)
Warning types fixed (observed under MSYS2 GCC 14.2.0):
* format '%ld' expects argument of type 'long int', but argument has type 'size_t'
* llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers]  (emitted for all struct field except first)
2024-12-26 14:59:11 +01:00
Reza Kakhki 9ba399dfa7 server : add support for "encoding_format": "base64" to the */embeddings endpoints (#10967)
* add support for base64

* fix base64 test

* improve test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 21:33:04 +01:00
Djip007 2cd43f4900 ggml : more perfo with llamafile tinyblas on x86_64 (#10714)
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
2024-12-24 18:54:49 +01:00
NeverLucky 09fe2e7613 server: allow filtering llama server response fields (#10940)
* llama_server_response_fields

* llama_server_response_fields_fix_issues

* params fixes

* fix

* clarify docs

* change to "response_fields"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-24 17:39:49 +01:00
Georgi Gerganov 30caac3a68 llama : the WPM vocabs use the CLS token as BOS (#10930)
* llama : the WPM vocabs use the CLS token as BOS

ggml-ci

* llama : add comment
2024-12-24 09:44:20 +02:00
Diego Devesa 60cfa728e2 ggml : use wstring for backend search paths (#10960)
ggml-ci
2024-12-24 04:05:27 +01:00
Diego Devesa 3327bb0f8d ggml : fix arm enabled features check (#10961) 2024-12-24 04:05:17 +01:00
Diego Devesa 32d6ee6385 ggml : fix const usage in SSE path (#10962) 2024-12-23 20:25:52 +01:00
Xuan Son Nguyen 14b699ecde server : fix missing model id in /model endpoint (#10957)
* server : fix missing model id in /model endpoint

* fix ci
2024-12-23 12:52:25 +01:00
Xuan Son Nguyen 485dc01214 server : add system_fingerprint to chat/completion (#10917)
* server : add system_fingerprint to chat/completion

* update README
2024-12-23 12:02:44 +01:00
Radoslav Gerganov 86bf31cfe6 rpc-server : add support for the SYCL backend (#10934) 2024-12-23 10:39:30 +02:00
Yun Dou b92a14a841 llama : support InfiniAI Megrez 3b (#10893)
* Support InfiniAI Megrez 3b

* Fix tokenizer_clean_spaces for megrez
2024-12-23 01:35:44 +01:00
ymcki 6f0c9e034b llama : support for Llama-3_1-Nemotron-51B (#10669)
* conflict resolution

* move comments after bracket to its own line
2024-12-23 01:22:33 +01:00
Eric Curtin dab76c92cc llama-run : include temperature option (#10899)
This commit updates the `examples/run/README.md` file to include a new
option for setting the temperature and updates the `run.cpp` file to
parse this option.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-23 01:21:40 +01:00
yuri@FreeBSD 7024d59e6a ggml : fix run-time on FreeBSD in get_executable_path() (#10948) 2024-12-23 01:20:11 +01:00
Rudi Servo 7c0e285858 devops : add docker-multi-stage builds (#10832) 2024-12-22 23:22:58 +01:00
Billel Mokeddem 7ae33a616f llama : add Falcon3 support (#10883)
* Add Falcon3 model support

* Add fix for adding bos to added special tokens

* Add comment explaining the logic behind the if statement

* Add a log message to better track the when the following line of code is triggered

* Update log to only print when input and output characters are different

* Fix handling pre-normalized tokens

* Refactoring
2024-12-23 00:09:58 +02:00
Jeff Bolz ebdee9478c vulkan: build fixes for 32b (#10927)
* vulkan: build fixes for 32b

Should fix #10923

* vulkan: initialize some buffer/offset variables
2024-12-22 10:44:01 +01:00
Georgi Gerganov 5cd85b5e00 convert : add BertForMaskedLM (#10919) 2024-12-21 10:10:18 +02:00
Jeff Bolz a91a41364b vulkan: optimize coopmat2 dequant functions (#10855)
Change the code to do 16b loads when possible and extract the appropriate
component late, so the code is effectively decoding a pair of elements and
then selecting one. This can allow more commoning to happen in the compiler
when neighboring elements are loaded.
2024-12-21 08:04:45 +01:00
Adrien Gallouët e34c5af43f ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (#10874)
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml-cpu: format code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-21 00:33:37 +01:00
Akarshan Biswas eb5c3dc64b SYCL: Migrate away from deprecated ggml_tensor->backend (#10840)
* Migrate to tensor->buffer for checking backend buffer type: 1

* SYCL: common.cpp try to migrate away from tensor->backend

* SYCL: fix assertions and add proper comments

* SYCL: remove extra space

* SYCL: Add back static to ggml_backend_buffer_is_sycl_split function

* SYCL: Add pragma directive to suppress warning spam

* SYCL: Integrate debug logs with GGML_LOG and other fixes

* Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes"

This reverts commit 2607b7de0f.
Let's keep the current SYCL specific logging mechanism for now

* SYCL: Use GGML_SYCL_DEBUG after reverting

* SYCL: reg_get_proc_address func, update to the current func signature

* SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d
2024-12-20 23:31:28 +08:00
Xuan Son Nguyen 0ca416c91a server : (UI) fix copy to clipboard function (#10916) 2024-12-20 14:12:06 +01:00
Diego Devesa 21ae3b9be8 ggml : add test for SVE and disable when it fails (#10906) 2024-12-20 13:31:28 +01:00
Molly Sophia 0a11f8b7b5 convert : fix RWKV v6 model conversion (#10913)
* Enable --no-context-shift for llama-perplexity example

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV 6: Fix error in ggml_cuda_op_bin_bcast

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-20 11:44:58 +02:00
Georgi Gerganov d408bb9268 clip : disable GPU support (#10896)
ggml-ci
2024-12-19 18:47:15 +02:00
Georgi Gerganov 5cab3e4aaa llama : minor grammar refactor (#10897)
ggml-ci
2024-12-19 17:42:13 +02:00
Georgi Gerganov 36319dec5d tts : small QoL for easy model fetch (#10903) 2024-12-19 17:35:15 +02:00
Xuan Son Nguyen 57bb2c40cd server : fix logprobs, make it OAI-compatible (#10783)
* server : fix logprobs, make it openai-compatible

* update docs

* add std::log

* return pre-sampling p

* sort before apply softmax

* add comment

* fix test

* set p for sampled token

* update docs

* add --multi-token-probs

* update docs

* add `post_sampling_probs` option

* update docs [no ci]

* remove --multi-token-probs

* "top_probs" with "post_sampling_probs"

* resolve review comments

* rename struct token_prob to prob_info

* correct comment placement

* fix setting prob for sampled token
2024-12-19 15:40:08 +01:00
Adrien Gallouët a3c33b1dce ggml: fix arm build with gcc (#10895)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-12-19 14:20:41 +01:00
Sukriti Sharma 2fffc52b50 llama : fix Roberta embeddings (#10856)
* fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fixes to position embeddings

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* map roberta-bpe to gpt-2

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix linting

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
2024-12-19 15:04:51 +02:00
fairydreaming 7585edbdeb convert : Add support for Microsoft Phi-4 model (#10817)
* convert : use GPT2 vocab for Phi-4 model

* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models

* llama : do not use sliding window attention mask for Phi-4 model

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-12-19 10:37:12 +01:00
Johannes Gäßler cd920d0ac3 tests: disable GGUF test for bad value size (#10886) 2024-12-19 08:53:58 +01:00
Eric Curtin 7909e8588d llama-run : improve progress bar (#10821)
Set default width to whatever the terminal is. Also fixed a small bug around
default n_gpu_layers value.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-19 03:58:00 +01:00
Diego Devesa 9177484f58 ggml : fix arm build (#10890)
* ggml: GGML_NATIVE uses -mcpu=native on ARM

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml: Show detected features with GGML_NATIVE

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* remove msvc support, add GGML_CPU_ARM_ARCH option

* disable llamafile in android example

* march -> mcpu, skip adding feature macros

ggml-ci

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
2024-12-18 23:21:42 +01:00
Georgi Gerganov 0bf2d10c55 tts : add OuteTTS support (#10784)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
2024-12-18 19:27:21 +02:00
Gaetan Bisson 7bbb5acf12 server: avoid overwriting Authorization header (#10878)
* server: avoid overwriting Authorization header

If no API key is set, leave the Authorization header as is. It may be
used by another part of the Web stack, such as an authenticating proxy.

Fixes https://github.com/ggerganov/llama.cpp/issues/10854

* rebuild

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-18 15:00:07 +01:00
Georgi Gerganov 152610eda9 server : output embeddings for all tokens when pooling = none (#10861)
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : update readme [no ci]

* server : fix spacing [no ci]

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : be explicit about the pooling type in the tests

ggml-ci

* server : update /embeddings and /v1/embeddings endpoints

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* server : update readme

ggml-ci

* server : fixes

* tests : update server tests

ggml-ci

* server : update readme [no ci]

* server : remove rebase artifact

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 13:01:41 +02:00
Georgi Gerganov 0e70ba686e server : add "tokens" output (#10853)
* server : add "tokens" output

ggml-ci

* server : update readme

ggml-ci

* server : return tokens ids only if requested

ggml-ci

* tests : improve "tokens" type check

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* server : remove "tokens" from the OAI endpoint

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-18 11:05:29 +02:00
Xuan Son Nguyen 46828872c3 server : (embeddings) using same format for "input" and "content" (#10872)
* server : (embeddings) using same format for "input" and "content"

* fix test case

* handle empty input case

* fix test
2024-12-18 10:55:09 +02:00
redbeard 6b064c92b4 docs: Fix HIP (née hipBLAS) in README (#10880)
Related to #10524 / be0e350c references to hipBLAS have been removed
across the repository.  This fixes the link from the repositories
`README.md`.

Signed-off-by: Brian 'redbeard' Harrington <redbeard@dead-city.org>
2024-12-18 10:35:00 +02:00
Diego Devesa 4da69d1abd Revert "llama : add Falcon3 support (#10864)" (#10876)
This reverts commit 382bc7f2e8.
2024-12-18 01:36:46 +01:00
DAN™ d62b532c52 Use model->gguf_kv for loading the template instead of using the C API. (#10868)
* Bump model_template to 16384 bytes to support larger chat templates.

* Use `model->gguf_kv` for efficiency.
2024-12-17 23:24:22 +01:00
Johannes Gäßler 081b29bd2a tests: add tests for GGUF (#10830) 2024-12-17 19:09:35 +01:00
Georgi Gerganov 5437d4aaf5 sync : ggml 2024-12-17 18:36:02 +02:00
Georgi Gerganov 78f766768d cmake : fix "amd64" processor string (whisper/2638) 2024-12-17 18:35:49 +02:00
gn64 8dd19a4812 vulkan : fix soft_max.comp division by zero (whisper/2633)
This change prevents a division by zero error when p.KY is 0.
2024-12-17 18:35:49 +02:00
Daniel Bevenius 130d0c90bd ggml : remove return from ggml_gallocr_allocate_node (ggml/1048)
This commit removes the return statement from ggml_gallocr_allocate_node
function.

The motivation behind this change is to make the code more readable and
consistent.
2024-12-17 18:35:49 +02:00
Daniel Bevenius 3919da8e33 ggml : add check for grad_accs (ggml/1046)
* ggml : add check for grad_accs

This commit adds a check for grad_accs in ggml_graph_get_grad and
ggml_graph_get_grad_acc functions. This is necessary to avoid segfaults
when grad_accs is not initialized.

The motivation for this change is that I find it nice to be able to
print out a computation graph using ggml_graph_print but this function
segfaults when grad_accs is not initialized:
```console
(gdb) p g1
$2 = (ggml_cgraph *) 0x7ffff66004b0
(gdb) p *g1
$3 = {size = 2048, n_nodes = 1, n_leafs = 2, nodes = 0x7ffff6600500,
grads = 0x0, grad_accs = 0x0, leafs = 0x7ffff6604500,
visited_hash_set = {size = 4099, used = 0x7ffff6610518,
keys = 0x7ffff6608500}, order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT}
(gdb) p ggml_graph_print(g1)
=== GRAPH ===
n_nodes = 1

Program received signal SIGSEGV, Segmentation fault.
0x0000555555579775 in ggml_graph_get_grad
(cgraph=0x7ffff66004b0,node=0x7ffff6600340)
    at /ggml/ggml/src/ggml.c:5990
5990  return igrad != GGML_HASHSET_FULL &&
          ggml_bitset_get(cgraph->visited_hash_set.used, igrad) ?
          cgraph->grads[igrad] : NULL;
```

* squash! ggml : add check for grad_accs

Fix the check in ggml_graph_get_grad. The check was incorrectly using
cgraph->grad_accs instead of cgraph->grads.
2024-12-17 18:35:48 +02:00
Georgi Gerganov 0006f5a74a ggml : update ggml_backend_cpu_device_supports_op (#10867)
* ggml : fix cpy op for IQ-quants to use reference impl

ggml-ci

* ggml : disable tests involving i-matrix quantization

* ggml : update ggml_backend_cpu_device_supports_op

ggml-ci
2024-12-17 18:35:42 +02:00
krystiancha 05c3a444b8 server : fill usage info in embeddings and rerank responses (#10852)
* server : fill usage info in embeddings response

* server : fill usage info in reranking response
2024-12-17 18:00:24 +02:00
Billel Mokeddem 382bc7f2e8 llama : add Falcon3 support (#10864) 2024-12-17 17:24:56 +02:00
Ruan 4f51968aca readme : update typos (#10863) 2024-12-17 11:47:20 +02:00
Xuan Son Nguyen 227d7c5a7f server : (UI) fix missing async generator on safari (#10857)
* server : (UI) fix missing async generator on safari

* fix
2024-12-17 09:52:09 +01:00
Eve 7b1ec53f56 vulkan: bugfixes for small subgroup size systems + llvmpipe test (#10809)
* ensure mul mat shaders work on systems with subgroup size less than 32

more fixes

add test

* only s_warptile_mmq needs to be run with 32 threads or more
2024-12-17 06:52:55 +01:00
Zhiyuan Li 160bc039c8 rwkv6: add wkv6 support for Vulkan backend (#10829)
* rwkv_wkv6 vulkan shader

* RWKV_WKV6 Vulkan op tests passed

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* add [[unroll]] and remove unnecessary conditions

* add uma support

* fix erros in EditorConfig Checker

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Molly Sophia <mollysophia379@gmail.com>
2024-12-16 22:00:46 +01:00
Georgi Gerganov 08ea539df2 unicode : improve naming style (#10838)
* unicode : improve naming style

ggml-ci

* cont [no ci]
2024-12-16 12:31:45 +02:00
Georgi Gerganov 644fd71b44 sampling : refactor + optimize penalties sampler (#10803)
* sampling : refactor + optimize penalties sampler

ggml-ci

* common : apply ignore_eos as logit bias

ggml-ci

* batched : remove penalties sampler

* params : allow penalty_last_n == -1 to be equal to context size

ggml-ci

* common : by default, move the penalties at the end of the sampling chain

ggml-ci

* common : ignore all EOG tokens

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* common : move back the penalties at the front of the sampling chain

ggml-ci

* readme : restore hint about --ignore-eos flag [no ci]

* llama : minor

ggml-ci

* webui : update

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Bartowski 4ddd199f6f llava : Allow locally downloaded models for QwenVL (#10833)
* Allow locally downloaded models for QwenVL

* Define model_path

* rm trailing space

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-15 21:43:25 +01:00
Valentin Mamedov a0974156f3 llama : add Deepseek MoE v1 & GigaChat models (#10827)
* Add deepseek v1 arch & gigachat template

* improve template code

* add readme

* delete comments

* remove comment

* fix format

* lint llama.cpp

* fix order of deepseek and deepseek2, move gigachat temlate to the end of func

* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need

* remove comments

* move deepseek above deepseek2

* change placement of gigachat chat template
2024-12-15 19:02:46 +02:00
Georgi Gerganov 87cf323cef scripts : change build path to "build-bench" for compare-commits.sh (#10836) 2024-12-15 18:44:47 +02:00
Vinesh Janarthanan 5478bbcd17 server: (UI) add syntax highlighting and latex math rendering (#10808)
* add code highlighting and math formatting

* code cleanup

* build public/index.html

* rebuild public/index.html

* fixed coding style

* fixed coding style

* style fixes

* highlight: smaller bundle size, fix light & dark theme

* remove katex

* add bundle size check

* add more languages

* add php

* reuse some langs

* use gzip

* Revert "remove katex"

This reverts commit c0e5046acc.

* use better maintained @vscode/markdown-it-katex

* fix gzip non deterministic

* ability to add a demo conversation for dev

* fix latex rendering

* add comment

* latex codeblock as code

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-15 12:55:54 +01:00
Georgi Gerganov b5ae1ddff9 gguf-py : bump to v0.13.0 2024-12-15 13:16:42 +02:00
Michelle Tan 89d604f2c8 server: Fix has_next_line in JSON response (#10818)
* Update server JSON response.

* Add unit test to check `has_new_line` JSON response

* Remove `has_new_line` unit test changes.

* Address code review comment: type check for `has_new_line` in unit test
2024-12-14 23:29:45 +01:00
Evgeny Kurnevsky e52aba537a nix: allow to override rocm gpu targets (#10794)
This allows to reduce compile time when you are building for a single GPU.
2024-12-14 10:17:36 -08:00
HimariO ba1cb19cdd llama : add Qwen2VL support + multimodal RoPE (#10361)
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
cduk 56eea0781c Removes spurious \r in output that causes logging in journalctl to treat lines as binary and therefore hidden by default (#10771)
Signed-off-by: Charles Darke <s.cduk@toodevious.com>
Co-authored-by: Charles Darke <s.cduk@toodevious.com>
2024-12-13 23:21:49 +01:00
lhez a76c56fa1a Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (#10693)
* [cl][adreno] Add Adreno GPU support

Add new OpenCL backend to support Adreno GPUs

---------

Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>

* [cl][ci] Add workflow for CL

* [cl][adreno] Fix memory leak for non SMALL_ALLOC path

* opencl: integrate backend dyn.load interface and fix compiler and format warnings

* opencl: remove small-alloc support and fix build errors for non-opencl platforms

* opencl: fixed merge conflict (MUSA added twice in cmake)

* opencl-ci: use RUNNER_TEMP instead of github.workspace

* opencl: fix embed tool invocation with python3

* opencl: CI workflow fixes

* opencl: Clean up small-alloc in CMake files

* opencl: cleanup ggml-opencl2 header file

* opencl: use ulong for offsets and strides in ADD kernel

* opencl: use cl_ulong for all offsets

* opencl: use cl_ulong for sizes and strides

* opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)`

* opencl: rename backend `opencl2` -> `opencl`

* opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl`

* opencl: make OpenCL required, remove redundant lib and inc directories

* `ggml-base`, `..` and `.` are added by `ggml_add_backend_library`

* opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl`

* opencl: remove copyright marker since main license already covers

* opencl: replace some more OPENCL2 leftovers

* opencl: remove limits on `tensor_extra`

* opencl: use pools for `tensor_extra`

* opencl: fix compiler warnings with GCC and Clang

Still getting the warning about clCreateCmdQueue being obsolete.
Will fix that separately.

* opencl: fail gracefully if opencl devices are not available

Also for unsupported GPUs.

* opencl: fix MSVC builds (string length error)

* opencl: check for various requirements, allow deprecated API

* opencl: update log message for unsupported GPUs

---------

Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Co-authored-by: Alexander Angus <quic_aangus@quicinc.com>
Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com>
Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2024-12-13 12:23:52 -08:00
Eric Curtin c27ac678dd Opt class for positional argument handling (#10508)
Added support for positional arguments `model` and `prompt`. Added
functionality to download via strings like:

  llama-run llama3
  llama-run ollama://granite-code
  llama-run ollama://granite-code:8b
  llama-run hf://QuantFactory/SmolLM-135M-GGUF/SmolLM-135M.Q2_K.gguf
  llama-run huggingface://bartowski/SmolLM-1.7B-Instruct-v0.2-GGUF/SmolLM-1.7B-Instruct-v0.2-IQ3_M.gguf
  llama-run https://example.com/some-file1.gguf
  llama-run some-file2.gguf
  llama-run file://some-file3.gguf

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-12-13 19:34:25 +01:00
Corentin REGAL 11e07fd63b fix: graceful shutdown for Docker images (#10815) 2024-12-13 18:23:50 +01:00
Jett Janiak 4601a8bb67 gguf-py : numpy 2 newbyteorder fix (#9772) 2024-12-13 16:48:44 +02:00
谢乃闻 9f35e44592 Fix crash caused by ggml_backend_load_all when launching on Android Activity (#10812)
* Fix crash caused by ggml_backend_load_all when launching on AndroidActivity.

Details:
Calling ggml_backend_load_all during initialization in the AndroidActivity project leads to a crash with the error:
terminating with uncaught exception of type std::__ndk1::__fs::filesystem::filesystem_error: filesystem error: in directory_iterator::directory_iterator(...): Permission denied [./].
This issue occurs because AndroidActivity restricts file access due to sandboxing.

Reproduction:
In the example folder, the LlamaAndroid project can reproduce the crash by calling ggml_backend_load_all first in Java_android_llama_cpp_LLamaAndroid_backend_1init.

* Update ggml/src/ggml-backend-reg.cpp

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-13 13:56:07 +01:00
Eve 64ae065511 vulkan: small mul_mat_vec optimizations (#10665)
* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs
2024-12-13 09:42:04 +01:00
Akarshan Biswas 83ed24a97b SYCL: Reduce most of the compiler warnings (#10748)
* Try to reduce some unused and typecast warnings

* Reduce compiler warnings step 2

* add a newline at the end of the file

* Initialize nreduce as size_t

* [SYCL] Remove pragma directives from mmq.cpp

* SYCL: mmq add condition to prevent blocks_per_tile_x_row variable from becoming 0

* SYCL softmax: Initialize nreduce as size_t

* ggml-sycl.cpp: fix some trailing whitespaces

* SYCL: remove the unused variables instead of commenting it out

* SYCL poo2d kernel: set NAN for invalid pooling op

* SYCL gemm.hpp: remove pragma directives

* SYCL gemm.hpp: use const cast to properly support dnnl::memory

* SYCL: wkv6 remove a comment

* SYCL: clean comments step 2

* SYCL: clean comments and variables step 3

* SYCL: Use GGML_UNUSED for unused variables

* SYCL: remove extra empty lines and a comment

* Remove TODO

* cleanup spaces

* add a stdout for unsupported op

* use sycl printf over fprintf

* remove prints for CI

* SYCL ggml-sycl: pool2D use sycl::nan and remove if-else block

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-12-13 12:12:15 +05:30
Karol Kontny d583cd03f6 ggml : Fix compilation issues on ARM platform when building without fp16 (#10811) 2024-12-13 01:04:19 +01:00
Xuan Son Nguyen adffa6ffd5 common : improve -ctv -ctk CLI arguments (#10806)
* common : improve ctv ctk cli argument

* regenerate docs

* even better approach

* use std::vector
2024-12-12 22:53:05 +01:00
Xuan Son Nguyen 274ec65af6 contrib : add ngxson as codeowner (#10804) 2024-12-12 20:52:28 +01:00
a3sh 8faa1d4dd4 CUDA: faster non-contiguous concat (#10760)
* faster uncontiguous concat

* Use a lambda to avoid code duplication

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/concat.cu

* add constexpr  and static assert

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-12 19:09:50 +01:00
Diego Devesa cb13ef85a4 remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797)
other windows build fixes
2024-12-12 19:02:49 +01:00
0cc4m 4064c0e3b6 Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (#10798) 2024-12-12 18:36:00 +01:00
0cc4m dc5301d565 Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (#10721)
* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* Fix subgroup size control extension support check

Add accf32 and accf16 checks for coopmats

* Also disable coopmats on amdvlk
2024-12-12 18:35:37 +01:00
Xuan Son Nguyen 9fdb124304 common : add missing env var for speculative (#10801) 2024-12-12 16:57:32 +01:00
CentricStorm 5555c0c1f6 docs: update server streaming mode documentation (#9519)
Provide more documentation for streaming mode.
2024-12-11 23:40:40 +01:00
Georgi Gerganov 973f328b1e Merge pull request #10788 from ggerganov/gg/gguf-py-0.11.0 2024-12-11 23:14:46 +02:00
Georgi Gerganov fb18934a97 gguf-py : bump version to 0.11.0 2024-12-11 23:13:31 +02:00
Xuan Son Nguyen 235f6e14bf server : (UI) add tok/s, get rid of completion.js (#10786)
* get rid of completion.js

* extract chat bubble to a component

* add tok/s info

* sync

* fix BASE_URL

* only extract timings when it's enabled

* fix auto scroll
2024-12-11 20:52:14 +01:00
qingy1337 1a31d0dc00 Update README.md (#10772) 2024-12-11 16:16:32 +01:00
Xuan Son Nguyen 92f77a640f ci : pin nodejs to 22.11.0 (#10779) 2024-12-11 14:59:41 +01:00
kallewoof 484d2f31ae bug-fix: snprintf prints NULL in place of the last character (#10419)
* bug-fix: snprintf prints NULL in place of the last character

We need to give snprintf enough space to print the last character and the null character, thus we allocate one extra byte and then ignore it when converting to std::string.

* add comment about extra null-term byte requirement
2024-12-11 14:48:04 +01:00
CentricStorm 4b4d92b098 docs: fix server documentation formatting (#10776) 2024-12-11 11:47:43 +01:00
Gilad S. 43041d2eb3 ggml: load all backends from a user-provided search path (#10699)
* feat: load all backends from a user-provided search path

* fix: Windows search path

* refactor: rename `ggml_backend_load_all_in_search_path` to `ggml_backend_load_all_from_path`

* refactor: rename `search_path` to `dir_path`

* fix: change `NULL` to `nullptr`

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* fix: change `NULL` to `nullptr`

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-11 01:47:21 +01:00
Jeff Bolz b685daf386 vulkan: request round-to-even for fp16 in im2col/rope_head (#10767)
Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls
feature allows rounding mode to be requested if the implementation supports it.
2024-12-10 21:23:17 +01:00
Eve dafae66cc2 vulkan: dynamic subgroup size for the remaining k quants (#10745)
* q5_k

q4_k

q3_k

q2_k

q6_k multi row example

* revert as multi row isnt faster for k quants
2024-12-10 20:33:23 +01:00
Bartowski ae4b922614 imatrix : Add imatrix to --no-context-shift (#10766)
This allows for setting the --no-context-shift value in llama-imatrix which is required for models like DeepSeek
2024-12-10 18:23:50 +01:00
Andreas Kieslinger 750cb3e246 CUDA: rename macros to avoid conflicts with WinAPI (#10736)
* Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?)

* Reverts erroneous rename in SYCL-code.

* Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A.

* Renames the rest of the compute capability macros for consistency.
2024-12-10 18:23:24 +01:00
Yüg a86ad841f1 server : add flag to disable the web-ui (#10762) (#10751)
Co-authored-by: eugenio.segala <esegala@deloitte.co.uk>
2024-12-10 18:22:34 +01:00
Jeff Bolz a05e2afcc2 vulkan: disable spirv-opt for coopmat shaders (#10763)
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly
necessary anyway.

Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it
changes.

Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
2024-12-10 18:22:20 +01:00
Johannes Gäßler 26a8406ba9 CUDA: fix shared memory access condition for mmv (#10740) 2024-12-09 20:07:12 +01:00
Srihari-mcw c37fb4cf62 Changes to CMakePresets.json to add ninja clang target on windows (#10668)
* Update cmakepreset.json to use clang with ninja by default

* Update cmakepreset.json to add clang and ninja based configs

* Updates to build.md file

* Make updates to rename preset targets

* Update with .cmake file

* Remove additional whitespaces

* Add .cmake file for x64-windows-llvm

* Update docs/build.md

* Update docs/build.md

---------

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
2024-12-09 09:40:19 -08:00
Jeff Bolz 3d98b4cb22 vulkan: fix compile warnings (#10731) 2024-12-09 08:24:01 +01:00
Borislav Stanimirov 1a05004743 cmake : simplify msvc charsets (#10672) 2024-12-09 09:15:13 +02:00
Xuan Son Nguyen ce8784bdb1 server : fix format_infill (#10724)
* server : fix format_infill

* fix

* rename

* update test

* use another model

* update test

* update test

* test_invalid_input_extra_req
2024-12-08 23:04:29 +01:00
Xuan Son Nguyen e52522b869 server : bring back info of final chunk in stream mode (#10722)
* server : bring back into to final chunk in stream mode

* clarify a bit

* traling space
2024-12-08 20:38:51 +01:00
stduhpf 06d70147e6 Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (#10723)
* Vulkan: fix NaN in tanh.comp

* Faster NaN-free tanh
2024-12-08 19:19:19 +01:00
Diego Devesa 43ed389a3f llama : use cmake for swift build (#10525)
* llama : use cmake for swift build

* swift : <> -> ""

* ci : remove make

* ci : disable ios build

* Revert "swift : <> -> """

This reverts commit d39ffd9556.

* ci : try fix ios build

* ci : cont

* ci : cont

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-08 13:14:54 +02:00
Jeff Bolz ecc93d0558 vulkan: compile a test shader in cmake to check for coopmat2 support (#10713) 2024-12-08 09:05:55 +01:00
Robert Collins 62e84d9848 llama : add 128k yarn context for Qwen (#10698)
* add 128k yarn context for Qwen

* added property for model tensors

* removing useless line
2024-12-07 23:12:27 +02:00
Xuan Son Nguyen 3573fa8e7b server : (refactor) no more json in server_task input (#10691)
* server : (refactor) no more json in server_task input

* add test for slots endpoint

* add tests for /props and /slots

* remove task inf_type

* fix CI by adding safe_json_to_str

* add "model_path" to /props

* update readme
2024-12-07 20:21:09 +01:00
Georgi Gerganov d9c3ba2b77 ggml : disable iq4_nl interleave size 8 (#10709)
ggml-ci
2024-12-07 18:38:15 +02:00
Georgi Gerganov ce4a7b8493 server : various fixes (#10704)
* server : various fixes

ggml-ci

* server : show curent seed in slot_params

ggml-ci

* fix /slots endpoint

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server : reflect endpoint response changes in the readme

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-12-07 18:02:05 +02:00
Djip007 19d8762ab6 ggml : refactor online repacking (#10446)
* rename ggml-cpu-aarch64.c to .cpp

* reformat extra cpu backend.

- clean Q4_0_N_M and IQ4_0_N_M
  - remove from "file" tensor type
  - allow only with dynamic repack

- extract cpu extra bufts and convert to C++
  - hbm
  - "aarch64"

- more generic use of extra buffer
  - generalise extra_supports_op
  - new API for "cpu-accel":
     - amx
     - aarch64

* clang-format

* Clean Q4_0_N_M ref

Enable restrict on C++

* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack

* added/corrected control on tensor size for Q4 repacking.

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add debug logs on repacks.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-07 14:37:50 +02:00
Georgi Gerganov c2a16c0bdb server : fix free of spec context and batch (#10651)
ggml-ci
2024-12-07 11:52:44 +02:00
0cc4m 3df784b305 Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (#10597)
* Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader

* Improve performance with better q4_k and q5_k dequant and store unrolling

* Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection

* Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device

* Vulkan: Implement accumulator switch for specific mul mat mat shaders

* Vulkan: Unroll more loops for more mul mat mat performance

* Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic

* Disable coopmat support on AMD proprietary driver

* Remove redundant checks

* Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support

* Fix rebase typo

* Fix coopmat2 MUL_MAT_ID pipeline selection
2024-12-07 10:24:15 +01:00
Robert Ormandi 86a1934978 metal : Extend how Llama.cpp locates metal resources (#10676)
* metal : Extend how Llama.cpp locates metal resources (#10675)

  * It searches the resource file in the directory where the current
    binary is located as well.
  * Resolves symbolic links.

Rationale:

When we plug this dependency into a Bazel build and run it in the
context of Bazel (e.g. testing):

  * the execution directory is often very different from where the files
    are located and no direct control over this (Bazel sandboxing),
  * the Bazel sandbox often use symbolic links to make files available.

With this patch, we can have the resource file added to the target,
can build and run tests in the context of Bazel.

* Update ggml/src/ggml-metal/ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-metal/ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-07 09:55:01 +02:00
Sukriti Sharma 784a14aa49 convert : add support for Roberta embeddings (#10695) 2024-12-07 09:02:14 +02:00
Georgi Gerganov c5ede3849f convert : add custom attention mapping 2024-12-06 21:33:49 +02:00
Xuan Son Nguyen f162d45a21 common : bring back --no-warmup to server (#10686) 2024-12-06 13:29:05 +01:00
Xuan Son Nguyen 6c5bc0625f server : (refactoring) do not rely on JSON internally (#10643)
* server : (refactoring) reduce usage of json internally

* move all response types to struct

* wip [no ci]

* many fixes

* add virtual function

* fix index

* minor style fix

* add std::move

* refactor handle_completions_generic

* add virtual functions

* remove server.hpp

* clarify server_sent_event RFC specs

* apply review comments

* fix model_alias and completion_probabilities

* small clean up

* remove virtual for to_json_oai_compat()

* naming oai_compat --> oaicompat

* fix unwanted recursive call

* update docs
2024-12-06 11:14:32 +01:00
Plamen Minev 7736837d62 fix(server) : not show alert when DONE is received (#10674) 2024-12-05 22:36:41 +01:00
Jeff Bolz c9c6e01dae vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (#10206) 2024-12-05 20:15:05 +01:00
Riccardo Orlando 6fe6247831 llama : add Minerva 7B model support (#10673)
* Support for Minerva 7B

* Update convert_hf_to_gguf_update.py
2024-12-05 20:30:59 +02:00
Georgi Gerganov 0cd182ebcc sync : ggml 2024-12-05 13:27:42 +02:00
PAB a8cbab201d ggml: add GGML_SET Metal kernel + i32 CPU kernel (ggml/1037)
* implemented cpu kernel

* add i32 test cases in test-backend-ops

* typedef `ggml_metal_kargs_set`

* implemented `kernel_set`

* memcpy
2024-12-05 13:27:33 +02:00
PAB c2082d93a8 ggml : add GGML_PAD_REFLECT_1D operation (ggml/1034)
* ggml_pad_reflect_1d defined in header

* implemented on CPU

* called the forward pass

* impl Metal kernel

* added Metal kernel

* added OP_PAD_REFLECT_1D in test-backend-ops.cpp

* add test-pad-reflect-1d test case

* test case support multiple backend
2024-12-05 13:27:31 +02:00
Daniel Bevenius d405804be8 py : update outdated copy-paste instructions [no ci] (#10667)
This commit updates the copy-paste instruction in
convert_hf_to_gguf_update.py to reflect that convert_hf_to_gguf.py
will have already been updated with the new get_vocab_base_pre()
function when this script completes.
2024-12-05 09:47:55 +02:00
aryantandon01 f112d198cd Update deprecation-warning.cpp (#10619)
Fixed Path Separator Handling for Cross-Platform Support (Windows File Systems)
2024-12-04 23:19:20 +01:00
Georgi Gerganov 1da7b76569 server : fix speculative decoding with context shift (#10641)
* server : fix speculative decoding with context shift

ggml-ci

* server : take into account speculative limits

ggml-ci

* server : add tests
2024-12-04 22:38:20 +02:00
Diego Devesa 59f4db1088 ggml : add predefined list of CPU backend variants to build (#10626)
* ggml : add predefined list of CPU backend variants to build

* update CPU dockerfiles
2024-12-04 14:45:40 +01:00
Diego Devesa 2803540814 ggml-cpu : fix HWCAP2_I8MM value (#10646) 2024-12-04 14:40:44 +01:00
ltoniazzi 253b7fde91 Fix HF repo commit to clone lora test models (#10649) 2024-12-04 10:45:48 +01:00
JFLFY2255 8d0cfd554a llama: Support MiniCPM-1B (with & w/o longrope) (#10559) 2024-12-04 11:42:50 +02:00
Jeff Bolz 2759916d86 vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (#10642) 2024-12-04 08:28:59 +01:00
Nicolò Scipione 40c6d79fb5 SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (#10584)
* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2024-12-04 09:29:20 +08:00
Wang Ran (汪然) 98036d5670 fix typo of README.md (#10605) 2024-12-04 02:22:50 +01:00
Frankie Robertson cd2f37b304 Avoid using __fp16 on ARM with old nvcc (#10616) 2024-12-04 01:41:37 +01:00
Benson Wong da6aac91f1 Add docs for creating a static build (#10268) (#10630)
* Add notes for a static build

* Update docs/build.md

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-04 01:40:36 +01:00
piDack 01e6d9bb71 clip : add sycl support (#10574)
Co-authored-by: piDack <pcdack@hotmail.co>
2024-12-04 01:26:37 +01:00
Jeff Bolz cc98896db8 vulkan: optimize and reenable split_k (#10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k
when there aren't enough workgroups to fill the shaders.
2024-12-03 20:29:54 +01:00
Xuan Son Nguyen 91c36c269b server : (web ui) Various improvements, now use vite as bundler (#10599)
* hide buttons in dropdown menu

* use npm as deps manager and vite as bundler

* fix build

* fix build (2)

* fix responsive on mobile

* fix more problems on mobile

* sync build

* (test) add CI step for verifying build

* fix ci

* force rebuild .hpp files

* cmake: clean up generated files pre build
2024-12-03 19:38:44 +01:00
Georgi Gerganov 1cd3df46bd scripts : remove amx sync
ggml-ci
2024-12-03 20:04:49 +02:00
Georgi Gerganov c505471857 sync : ggml 2024-12-03 20:04:49 +02:00
mahorozte e9e661bd59 CUDA: remove unnecessary warp reduce in FA (ggml/1032)
* kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit

* same problem in vec32

---------

Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>
2024-12-03 20:04:49 +02:00
PAB efb6ae9630 feat: add GGML_UNARY_OP_ARGMAX Metal kernel (ggml/1019)
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-03 20:04:49 +02:00
PAB 667d70d170 metal : add GGML_OP_CONV_TRANSPOSE_1D kernels (ggml/1026)
* wip

* wip implementation f32

* kernel conv transpose 1d f32 working

* initial commit
2024-12-03 20:04:49 +02:00
Xuan Son Nguyen 3b4f2e33e2 llama : add missing LLAMA_API for llama_chat_builtin_templates (#10636) 2024-12-03 12:54:30 +01:00
Nikolaos Pothitos 82bca2257b readme : add option, update default value, fix formatting (#10271)
* readme : document --no-display-prompt

* readme : update default prompt context size

* readme : remove unnecessary indentation

Indenting a line with four spaces makes Markdown treat that section as
plain text.

* readme : indent commands under bullets

* readme : indent commands in lettered list
2024-12-03 12:50:08 +02:00
Georgi Gerganov 0115df2f65 metal : small-batch mat-mul kernels (#10581)
* metal : small-batch mat-mul kernels

ggml-ci

* metal : add rest of types

ggml-ci

* metal : final adjustments

ggml-ci

* metal : add comments

ggml-ci
2024-12-03 11:52:33 +02:00
Georgi Gerganov 515d4e5372 github : minify link [no ci] (revert)
this doesn't work as expected
2024-12-03 11:21:43 +02:00
Georgi Gerganov 844e2e1fee github : minify link [no ci] 2024-12-03 11:20:35 +02:00
Georgi Gerganov 70b98fadbc server : fix default draft model parameters (#10586)
* server : force F16 KV cache for the draft model

ggml-ci

* server : fix draft params

ggml-ci

* server : various params fixes

ggml-ci
2024-12-03 11:20:00 +02:00
Xuan Son Nguyen 642330ac7c llama : add enum for built-in chat templates (#10623)
* llama : add enum for supported chat templates

* use "built-in" instead of "supported"

* arg: print list of built-in templates

* fix test

* update server README
2024-12-02 22:10:19 +01:00
Georgi Gerganov 8648c52101 make : deprecate (#10514)
* make : deprecate

ggml-ci

* ci : disable Makefile builds

ggml-ci

* docs : remove make references [no ci]

* ci : disable swift build

ggml-ci

* docs : remove obsolete make references, scripts, examples

ggml-ci

* basic fix for compare-commits.sh

* update build.md

* more build.md updates

* more build.md updates

* more build.md updates

* Update Makefile

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-12-02 21:22:53 +02:00
haopeng 64ed2091b2 server: Add "tokens per second" information in the backend (#10548)
* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-12-02 14:45:54 +01:00
Akarshan Biswas 991f8aabee SYCL: Fix and switch to GGML_LOG system instead of fprintf (#10579)
* Switched to GGML_LOG

* Fix missing semicolon
2024-12-02 15:04:11 +08:00
Georgi Gerganov 4cb003dd8d contrib : refresh (#10593)
* contrib : refresh

* contrib : expand [no ci]

* contrib : expand test-backend-ops instructions

* contrib : add CODEOWNERS

* prs : update template to not have checkbox [no ci]
2024-12-02 08:53:27 +02:00
Juk Armstrong 917786f43d Add mistral-v1, mistral-v3, mistral-v3-tekken and mistral-v7 chat template types (#10572)
* Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken`

* Changed system message logic and added tests for all 4

* Invalid `system_message` instead of `content` fixed

* Removed tab-indented lines

* Added template code and test for `mistral-v7`

* Added all tests. Fixed bug with `tmpl == "llama2"` test.

* Replaced tabs with spaces.

* Removed `'mistral-v2'` option as no (open) models ever used it

* Removed all references to 'v2' template from comments

* Update llama.cpp

Fixed `trim_assistant_message` bug
2024-12-01 23:09:49 +01:00
Georgi Gerganov 5e1ed95583 grammars : add English-only grammar (#10612) 2024-12-01 21:37:54 +02:00
Wang Qin 5c7a5aa0c3 ci: add error handling for Python venv creation in run.sh (#10608) 2024-12-01 20:11:42 +02:00
Diego Devesa 3420909dff ggml : automatic selection of best CPU backend (#10606)
* ggml : automatic selection of best CPU backend

* amx : minor opt

* add GGML_AVX_VNNI to enable avx-vnni, fix checks
2024-12-01 16:12:41 +01:00
alek3y 86dc11c5bc server : bind to any port when specified (#10590) 2024-12-01 13:33:12 +02:00
Georgi Gerganov 6acce39710 readme : update the usage section with examples (#10596)
* readme : update the usage section with examples

* readme : more examples
2024-12-01 11:25:17 +02:00
Wang Qin 43957ef203 build: update Makefile comments for C++ version change (#10598) 2024-12-01 04:19:44 +01:00
Adrien Gallouët 0c39f44d70 ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (#10567)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2024-11-30 09:13:18 -08:00
Georgi Gerganov 3e0ba0e604 readme : remove old badge 2024-11-30 10:09:21 +02:00
Georgi Gerganov abadba05be readme : refresh (#10587)
* readme : refresh

* readme : move section [no ci]

* readme : clarify [no ci]

* readme : fixes [no ci]

* readme : more fixes [no ci]

* readme : simplify [no ci]

* readme : clarify GGUF
2024-11-30 09:47:07 +02:00
Eve 0533e7fb38 vulkan: Dynamic subgroup size support for Q6_K mat_vec (#10536)
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
2024-11-30 08:00:02 +01:00
Diego Devesa 7cc2d2c889 ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Xuan Son Nguyen b782e5c7d4 server : add more test cases (#10569)
* server : add split model test

* add test speculative

* add invalid cases
2024-11-29 21:48:56 +01:00
Robert Collins 3a8e9af402 imatrix : support combine-only (#10492)
* imatrix-combine-only idea

* ensured that behavior consistent with log
2024-11-29 19:21:37 +02:00
Diego Devesa a3a3048e7a cleanup UI link list (#10577)
* cleanup UI link list

* sort list alphabetically

* add missing licenses
2024-11-29 17:45:08 +01:00
Georgi Gerganov f0678c5ff4 ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggml-ci
2024-11-29 16:25:39 +02:00
Shupei Fan 4b3242bbea ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580) 2024-11-29 14:49:02 +01:00
Alberto Cabrera Pérez 0f77aae560 sycl : offload of get_rows set to 0 (#10432) 2024-11-29 20:38:45 +08:00
Alberto Cabrera Pérez 266b8519ee sycl : Reroute permuted mul_mats through oneMKL (#10408)
This PR fixes the failing MUL_MAT tests for the sycl backend.
2024-11-29 09:49:43 +00:00
Chenguang Li 938f608742 CANN: RoPE operator optimization (#10563)
* [cann] RoPE operator optimization

* [CANN]Code Formatting

---------

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-29 14:46:55 +08:00
Jeff Bolz f095a649ec vulkan: get the first command buffer submitted sooner (#10499)
This is an incremental improvement over #9118 to get work to the GPU a bit
sooner. The first part is to start with a smaller number of nodes before
the first submit, and ramp it up to the current 100 nodes/submit. The
second part is to reduce the dryrun overhead for all the nodes that just
need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my
old Haswell-era CPU.
2024-11-29 07:18:02 +01:00
Ting Lou 678d7994f4 llava: return false instead of exit (#10546) 2024-11-29 01:09:46 +01:00
Georgi Gerganov dc22344088 ggml : remove redundant copyright notice + update authors 2024-11-28 20:46:40 +02:00
Georgi Gerganov 4c0a95b107 llama : add missing model types 2024-11-28 20:45:07 +02:00
Xuan Son Nguyen 6c59567689 server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
* server : (tests) don't use thread for capturing stdout/stderr

* test: bump openai to 1.55.2

* bump openai to 1.55.3
2024-11-28 19:17:49 +01:00
Johannes Gäßler 890719311b common: fix warning message when no GPU found (#10564) 2024-11-28 18:15:25 +01:00
Random Fly 7281cf13ad docs: fix outdated usage of llama-simple (#10565) 2024-11-28 16:03:11 +01:00
Diego Devesa e90688edd0 ci : fix tag name in cuda and hip releases (#10566) 2024-11-28 15:58:54 +01:00
Georgi Gerganov 76b27d29c2 ggml : fix row condition for i8mm kernels (#10561)
ggml-ci
2024-11-28 14:56:37 +02:00
Georgi Gerganov eea986f215 cmake : fix ARM feature detection (#10543)
ggml-ci
2024-11-28 14:56:23 +02:00
Shupei Fan c202cef168 ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
* ggml-cpu: support IQ4_NL_4_4 by runtime repack

* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-11-28 13:52:03 +01:00
Sergio López 2025fa67e9 kompute : improve backend to pass test_backend_ops (#10542)
* kompute: op_unary: reject unsupported parameters

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: softmax: implement ALiBi support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: rope: implement neox and phi3 support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q4_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_[q4_0|q4_1|q8_0] permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_f16 permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

* kompute: op_mul_mat_q6_k permutted support

Signed-off-by: Sergio Lopez <slp@redhat.com>

---------

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-28 12:51:38 +01:00
Ruixin Huang c6bc73951e CANN: Update cann.md to display correctly in CLion (#10538) 2024-11-28 15:27:11 +08:00
leo-pony 605fa66c50 CANN: Fix SOC_TYPE compile bug (#10519)
* CANN: Fix the bug build fail on Ascend310P under two cases:
1) Manual specify SOC_TYPE
2) Under some unusual compile environment

* Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU.

* fix CANN  compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version
2024-11-28 15:25:24 +08:00
Chenguang Li b7420131bf CANN: ROPE operator optimization (#10540)
* [cann] ROPE operator optimization

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-28 14:24:46 +08:00
Xuan Son Nguyen 9f912511bc common : fix duplicated file name with hf_repo and hf_file (#10550) 2024-11-27 22:30:52 +01:00
uvos 3ad5451f3b Add some minimal optimizations for CDNA (#10498)
* Add some minimal optimizations for CDNA

* ggml_cuda: set launch bounds also for GCN as it helps there too
2024-11-27 17:10:08 +01:00
Diego Devesa 46c69e0e75 ci : faster CUDA toolkit installation method and use ccache (#10537)
* ci : faster CUDA toolkit installation method and use ccache

* remove fetch-depth

* only pack CUDA runtime on master
2024-11-27 11:03:25 +01:00
Georgi Gerganov 9e2301f4a4 metal : fix group_norm support condition (#0) 2024-11-27 11:22:14 +02:00
Georgi Gerganov fee824a1a1 sync : ggml 2024-11-27 11:10:42 +02:00
Frankie Robertson 9150f8fef9 Do not include arm_neon.h when compiling CUDA code (ggml/1028) 2024-11-27 11:10:27 +02:00
Jeff Bolz c31ed2abfc vulkan: define all quant data structures in types.comp (#10440) 2024-11-27 08:32:54 +01:00
Jeff Bolz 5b3466bedf vulkan: Handle GPUs with less shared memory (#10468)
There have been reports of failure to compile on systems with <= 32KB
of shared memory (e.g. #10037). This change makes the large tile size
fall back to a smaller size if necessary, and makes mul_mat_id fall
back to CPU if there's only 16KB of shared memory.
2024-11-27 08:30:27 +01:00
Jeff Bolz 249a7902ec vulkan: further optimize q5_k mul_mat_vec (#10479) 2024-11-27 08:21:59 +01:00
Jeff Bolz 71a64989a5 vulkan: skip integer div/mod in get_offsets for batch_idx==0 (#10506) 2024-11-27 08:08:54 +01:00
Jeff Bolz 4a57d362e1 vulkan: optimize Q2_K and Q3_K mul_mat_vec (#10459) 2024-11-27 08:00:50 +01:00
Diego Devesa c9b00a70b0 ci : fix cuda releases (#10532) 2024-11-26 22:12:10 +01:00
Shane A de5097351c Add OLMo 2 model in docs (#10530)
* Add link to OLMo 2 model in docs

* Change link to landing page
2024-11-26 21:55:29 +01:00
Diego Devesa 5a349f2809 ci : remove nix workflows (#10526) 2024-11-26 21:13:54 +01:00
Diego Devesa 30ec398321 llama : disable warnings for 3rd party sha1 dependency (#10527) 2024-11-26 21:01:47 +01:00
Tristan Druyen be0e350c8b Fix HIP flag inconsistency & build docs (#10524)
* Fix inconsistency of HIP flags in cmake & make

* Fix docs regarding GGML_HIP
2024-11-26 19:27:28 +01:00
R0CKSTAR 249cd93da3 mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (#10516)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-26 17:00:41 +01:00
Jeff Bolz 904109ed0d vulkan: fix group_norm (#10496)
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).

Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.
2024-11-26 16:45:05 +01:00
Xuan Son Nguyen 45abe0f74e server : replace behave with pytest (#10416)
* server : replace behave with pytest

* fix test on windows

* misc

* add more tests

* more tests

* styling

* log less, fix embd test

* added all sequential tests

* fix coding style

* fix save slot test

* add parallel completion test

* fix parallel test

* remove feature files

* update test docs

* no cache_prompt for some tests

* add test_cache_vs_nocache_prompt
2024-11-26 16:20:18 +01:00
Neo Zhang Jianyu 0bbd2262a3 restore the condistion to build & update pacakge when merge (#10507)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-11-26 21:43:47 +08:00
Georgi Gerganov ab96610b1e cmake : enable warnings in llama (#10474)
* cmake : enable warnings in llama

ggml-ci

* cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS

* cmake : get_flags -> ggml_get_flags

* speculative-simple : fix warnings

* cmake : reuse ggml_get_flags

ggml-ci

* speculative-simple : fix compile warning

ggml-ci
2024-11-26 14:18:08 +02:00
Diego Devesa 7db3846a94 ci : publish the docker images created during scheduled runs (#10515) 2024-11-26 13:05:20 +01:00
Diego Devesa c6807b3f28 ci : add ubuntu cuda build, build with one arch on windows (#10456) 2024-11-26 13:05:07 +01:00
Charles Xu 25669aa92c ggml-cpu: cmake add arm64 cpu feature check for macos (#10487)
* ggml-cpu: cmake add arm64 cpu feature check for macos

* use vmmlaq_s32 for compile option i8mm check
2024-11-26 13:37:05 +02:00
Georgi Gerganov 84e1c33cde server : fix parallel speculative decoding (#10513)
ggml-ci
2024-11-26 13:36:40 +02:00
Georgi Gerganov 811872a59d speculative : simplify the implementation (#10504)
ggml-ci
2024-11-26 12:29:38 +02:00
Shanshan Shen 9a4b79bcfa CANN: Improve the Inferencing Performance for Ascend NPU Device (#10454)
* improve inferencing performance for ascend npu.

Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>

* some modification after review

* some modifications after review

* restore some modifications

* restore some modifications

---------

Co-authored-by: shanshan shen <shanshanshen333@gmail.com>
Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>
2024-11-26 18:08:37 +08:00
Chenguang Li 7066b4cce2 CANN: RoPE and CANCAT operator optimization (#10488)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-26 17:31:05 +08:00
Junil Kim 0eb4e12bee vulkan: Fix a vulkan-shaders-gen arugment parsing error (#10484)
The vulkan-shaders-gen was not parsing the --no-clean argument correctly.
Because the previous code was parsing the arguments which have a value only
and the --no-clean argument does not have a value, it was not being parsed
correctly. This commit can now correctly parse arguments that don't have values.
2024-11-26 01:47:20 +00:00
Eric Curtin 0cc63754b8 Introduce llama-run (#10291)
It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-11-25 22:56:24 +01:00
Diego Devesa 50d5cecbda ci : build docker images only once daily (#10503) 2024-11-25 22:05:39 +01:00
Georgi Gerganov 9fd8c2687f server : add more information about error (#10455) 2024-11-25 22:28:59 +02:00
Georgi Gerganov 47f931c8f9 server : enable cache_prompt by default (#10501)
ggml-ci
2024-11-25 21:50:07 +02:00
Georgi Gerganov 106964e3d2 metal : enable mat-vec kernels for bs <= 4 (#10491) 2024-11-25 21:49:31 +02:00
Shane A 80acb7b430 Rename Olmo1124 to Olmo2 (#10500) 2024-11-25 19:36:09 +01:00
Diego Devesa 10bce0450f llama : accept a list of devices to use to offload a model (#10497)
* llama : accept a list of devices to use to offload a model

* accept `--dev none` to completely disable offloading

* fix dev list with dl backends

* rename env parameter to LLAMA_ARG_DEVICE for consistency
2024-11-25 19:30:06 +01:00
Johannes Gäßler 1f922254f0 Github: update issue templates [no ci] (#10489) 2024-11-25 19:18:37 +01:00
brucepro a9a678a6b2 Add download chat feature to server chat (#10481)
* Add download chat feature to server chat

Add a download feature next to the delete chat feature in the server vue chat interface.

* code style

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-25 17:11:55 +01:00
Georgi Gerganov 9ca2e67762 server : add speculative decoding support (#10455)
* server : add speculative decoding support

ggml-ci

* server : add helper function slot.can_speculate()

ggml-ci
2024-11-25 16:31:38 +02:00
Diego Devesa 5931c1f233 ggml : add support for dynamic loading of backends (#10469)
* ggml : add support for dynamic loading of backends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Georgi Gerganov f6d12e7df8 tests : fix compile warning 2024-11-25 15:17:32 +02:00
Georgi Gerganov b756441104 metal : minor code formatting 2024-11-25 15:08:04 +02:00
Neo Zhang Jianyu 5a8987793f [SYCL] Fix building Win package for oneAPI 2025.0 update (#10483)
* fix build package for 2025.0

* debug

* debug

* fix

* rm debug

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-11-25 17:31:10 +08:00
Georgi Gerganov d9d54e498d speculative : refactor and add a simpler example (#10362)
* speculative : refactor and add a simpler example

ggml-ci

* speculative : clean-up and add comments and TODOs [no ci]

* speculative : manage context in common_speculative

ggml-ci

* speculative : simplify

ggml-ci

* speculative : simplify (cont)

ggml-ci

* speculative : add --draft-min CLI arg

* speculative : minor fixup

* make : build fixes

* speculative : do not redraft previous drafts

ggml-ci

* speculative : fix the draft sampling

ggml-ci

* speculative : fix compile warning

* common : refactor args

ggml-ci

* common : change defaults [no ci]

* common : final touches

ggml-ci
2024-11-25 09:58:41 +02:00
Georgi Gerganov cce5a90075 flake.lock: Update (#10470)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15)
  → 'github:NixOS/nixpkgs/23e89b7da85c3640bbc2173fe04f4bd114342367?narHash=sha256-y/MEyuJ5oBWrWAic/14LaIr/u5E0wRVzyYsouYY3W6w%3D' (2024-11-19)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-24 08:03:25 -08:00
Diego Devesa dc39012cba llama : fix op mul check with command-r-plus (#10476) 2024-11-24 16:10:26 +01:00
Gabe Goodhart 9336db462c convert : XLMRoberta Type Vocab Size (#10458)
This matches the key in common bert-based embedding models and may have a
value other than 1 in it.

Branch: XLMRobertaTypeVocabSize

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-24 11:02:34 +02:00
momonga 96fa2c5e2d fix gguf-py: Conversion error when multiple licenses are configured (#9807)
* fix general.license list to str

* fix join license list

---------

Co-authored-by: momonga <115213907+mmnga@users.noreply.github.com>
2024-11-24 01:09:22 +01:00
Diego Devesa 55ed008b2d ggml : do not use ARM features not included in the build (#10457) 2024-11-23 14:41:12 +01:00
蕭澧邦 6dfcfef078 ci: Update oneAPI runtime dll packaging (#10428)
This is the minimum runtime dll dependencies for oneAPI 2025.0
2024-11-22 10:44:08 +01:00
Johannes Gäßler 599b3e0cd4 GitHub: ask for more info in issue templates (#10426)
* GitHub: ask for more info in issues [no ci]

* refactor issue templates to be component-specific

* more understandable issue description

* add dropdown for llama.cpp module
2024-11-22 08:32:40 +01:00
leo-pony c18610b4ee CANN: Support Ascend310P to accelerate F32 and F16 Model (#10216)
* CANN Support Ascend310P to accelerate F32 and F16 Model

* Add compile option soc type macro ASCEND_310P to ggml-cann lib

* Remove unused code

* Remove the ascend soc_type hard code compile option in CMakelist.txt
2024-11-22 14:07:20 +08:00
Diego Devesa a5e47592b6 cuda : optimize argmax (#10441)
* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-11-21 18:18:50 +01:00
Georgi Gerganov 1bb30bf28c llama : handle KV shift for recurrent models (#10402)
ggml-ci
2024-11-21 10:22:47 +02:00
Georgi Gerganov 87a533be57 sync : ggml 2024-11-21 09:22:11 +02:00
slaren 59b9172822 ggml/sched : do not skip views in pre-assignments 2024-11-21 09:22:05 +02:00
Johannes Gäßler 02e4eaf22f ggml-opt: fix data corruption (ggml/1022) 2024-11-21 09:22:02 +02:00
Jeff Bolz 9abe9eeae9 vulkan: predicate max operation in soft_max shaders/soft_max (#10437)
Fixes #10434
2024-11-20 20:47:36 +01:00
bandoti f95caa7954 cmake: add link dependencies to cmake find pkg (#10433)
* cmake pkg: find accelerate, openmp, memkind libs

* cmake pkg: find BLAS libs

* try BLAS_LIBRARIES instead

* Add BLAS link opts

* Add more link deps. and set GGML_ vars
2024-11-20 17:22:19 +01:00
Diego Devesa fab5d30ff6 llama : add .clang-format file (#10415) 2024-11-20 12:57:53 +01:00
Jeff Bolz 8fd4b7fa29 vulkan: copy iq4_nl LUT into shared memory (#10409) 2024-11-20 08:40:18 +01:00
Jeff Bolz 1bacb9f625 vulkan: further optimize mul_mat_vec using larger loads (#10387)
* vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.

Add some early returns for nonexistent rows in mul_mat_vec shaders. These
can only be hit when dispatching a 2D grid of workgroups. Fix the logic
for the 2D grid of workgroups to round up.

Enable the pipeline robustness extension if it's available, and use it to
disable robustness for these pipelines. The instructions to do the bounds
checking contend for the same ALU resources as the bit twiddling dequant
instructions.

* vulkan: Add GLSL structure aliases for quant types to allow larger loads

In Vulkan it's not possible to cast pointer types, so instead you have to
declare an aliased binding for the memory with a different type. This
commit adds aliases for the quant formats using 16b ints, and in a few
places where the struct size is a multiple of 4 also using 32b ints.
Currently only q4_k's aliases are used, but others will be used in
subsequent commits.

* vulkan: use larger loads in q5_k and q6_k shaders.

Similar to the optimization I did in q4_k recently, this vectorizes some loads
and reduces the number of bit twiddling instructions.

* vulkan: use larger K step per iteration in mul_mat_vec.

Add vec4 dequantization functions, and use them to do K=8 per iteration in
mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B
which helps reduce the load on the memory system.

The K_PER_ITER==2 logic is still there, just for F16/F32, and really only
because they support unaligned sizes.

Tweak the num_iters/unrolling logic to be simpler and catch a couple missed
unrolling opportunities.
2024-11-20 08:11:00 +01:00
Neo Zhang Jianyu ad21c9e1f1 update rel to 4040 (#10395)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-11-20 13:54:25 +08:00
Anthony Van de Gejuchte 3952a221af Fix missing file renames in Makefile due to changes in commit ae8de6d50a (#10413) 2024-11-19 23:18:17 +01:00
haopeng 42ae10bbcd add cmake rvv support (#10411) 2024-11-19 21:10:31 +01:00
Georgi Gerganov 9fe0fb0626 sync : ggml 2024-11-19 20:03:21 +02:00
Plamen Minev 611fabd792 metal : fox offset integer overflows in im2col (ggml/1015)
-- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations
2024-11-19 20:03:21 +02:00
PAB 12b0ad953a metal : add GGML_UNARY_OP_ELU kernel (ggml/1018) 2024-11-19 20:03:21 +02:00
蕭澧邦 342397dc7e cmake: force MSVC compiler charset to utf-8 (#9989) 2024-11-19 18:42:00 +01:00
bandoti 2a11b6b094 Add required ggml-base and backend libs to cmake pkg (#10407) 2024-11-19 17:10:30 +01:00
Diego Devesa 3ee6382d48 cuda : fix CUDA_FLAGS not being applied (#10403) 2024-11-19 14:29:38 +01:00
Georgi Gerganov 8e752a777b llama : add check for KV cache shifts (#10401)
ggml-ci
2024-11-19 13:29:26 +02:00
Shane A a88ad007de llama : add OLMo November 2024 support (#10394)
* Add OLMo November 2024 constants

* Add OLMo November 2024 converter

* Add loading of OLMo November 2024 tensors and hyper parameters

* Add building of OLMo November 2024 model
2024-11-19 11:04:08 +02:00
Romain Biessy 2a1507c162 sycl : Add option to set the SYCL architecture for all targets (#10266)
* Add option to set the SYCL architecture for all targets
* Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option
* Document that setting GGML_SYCL_ARCH can improve the performance
2024-11-19 08:02:23 +00:00
Jeff Bolz b3e585988f vulkan: Optimize soft_max (#10301)
* vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.

* vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
2024-11-19 08:25:17 +01:00
Alberto Cabrera Pérez 557924f222 sycl: Revert MUL_MAT_OP support changes (#10385) 2024-11-19 08:50:04 +08:00
Diego Devesa d3481e6316 cuda : only use native when supported by cmake (#10389) 2024-11-18 18:43:40 +01:00
bandoti 531cb1c233 Skip searching root path for cross-compile builds (#10383) 2024-11-18 16:23:58 +01:00
Jeff Bolz f139d2ea61 vulkan: remove use of null initializer (#10372)
Seems like this isn't working for vulkan-over-metal when the array is sized
by a spec constant. Maybe a spirv-cross limitation?
2024-11-18 08:28:42 -06:00
Georgi Gerganov 2eb76b2a5e flake.lock: Update (#10346)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)
  → 'github:NixOS/nixpkgs/5e4fbfb6b3de1aa2872b76d49fafc942626e2add?narHash=sha256-OZiZ3m8SCMfh3B6bfGC/Bm4x3qc1m2SVEAlkV6iY7Yg%3D' (2024-11-15)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-18 06:08:20 -08:00
0cc4m 9b75f03cd2 Vulkan: Fix device info output format specifiers (#10366)
* Vulkan: Fix device info output format specifiers

* Vulkan: Use zu printf specifier for size_t instead of ld
2024-11-18 11:02:43 +01:00
Johannes Gäßler 75207b3a88 docker: use GGML_NATIVE=OFF (#10368) 2024-11-18 00:21:53 +01:00
Johannes Gäßler 76e9e58b78 CUDA: fix MMV kernel being used for FP16 src1 (#10357) 2024-11-17 23:20:42 +01:00
Johannes Gäßler ce2e59ba10 CMake: fix typo in comment [no ci] (#10360) 2024-11-17 12:59:38 +01:00
Diego Devesa be5caccef9 llama : only use default buffer types for the KV cache (#10358) 2024-11-17 12:25:45 +01:00
Georgi Gerganov 20a780c7b6 gitignore : ignore local run scripts [no ci] 2024-11-17 13:12:22 +02:00
Georgi Gerganov cf32a9b93a metal : refactor kernel args into structs (#10238)
* metal : add kernel arg structs (wip)

* metal : fattn args

ggml-ci

* metal : cont + avoid potential int overflow [no ci]

* metal : mul mat struct (wip)

* cont : mul mat vec

* cont : pass by reference

* cont : args is first argument

* cont : use char ptr

* cont : shmem style

* cont : thread counters style

* cont : mul mm id

ggml-ci

* cont : int safety + register optimizations

ggml-ci

* metal : GGML_OP_CONCAT

ggml-ci

* metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV

* metal : GGML_OP_REPEAT

* metal : GGML_OP_CPY

* metal : GGML_OP_RMS_NORM

* metal : GGML_OP_NORM

* metal : add TODOs for rest of ops

* ggml : add ggml-metal-impl.h

ggml-ci
2024-11-17 11:23:01 +02:00
FirstTimeEZ a43178299c ggml : fix undefined reference to 'getcpu' (#10354)
https://github.com/ggerganov/llama.cpp/issues/10352
2024-11-17 10:39:22 +02:00
Johannes Gäßler c3ea58aca4 CUDA: remove DMMV, consolidate F16 mult mat vec (#10318) 2024-11-17 09:09:55 +01:00
Johannes Gäßler 467576b6cc CMake: default to -arch=native for CUDA build (#10320) 2024-11-17 09:06:34 +01:00
Diego Devesa eda7e1d4f5 ggml : fix possible buffer use after free in sched reserve (#9930) 2024-11-17 08:31:17 +02:00
Georgi Gerganov 24203e9dd7 ggml : inttypes.h -> cinttypes (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
Georgi Gerganov 5d9e59979c ggml : adapt AMX to tensor->grad removal (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
Georgi Gerganov a4200cafad make : add ggml-opt (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
Georgi Gerganov 84274a10c3 tests : remove test-grad0 2024-11-17 08:30:29 +02:00
Georgi Gerganov 68fcb4759c ggml : fix compile warnings (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
Johannes Gäßler 8a43e940ab ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
Georgi Gerganov 5c9a8b22b1 scripts : update sync 2024-11-17 08:30:29 +02:00
FirstTimeEZ 0fff7fd798 docs : vulkan build instructions to use git bash mingw64 (#10303) 2024-11-17 00:29:18 +01:00
Johannes Gäßler 4e54be0ec6 llama/ex: remove --logdir argument (#10339) 2024-11-16 23:00:41 +01:00
Georgi Gerganov db4cfd5dbc llamafile : fix include path (#0)
ggml-ci
2024-11-16 20:36:26 +02:00
Georgi Gerganov 8ee0d09ae6 make : auto-determine dependencies (#0) 2024-11-16 20:36:26 +02:00
MaggotHATE bcdb7a2386 server: (web UI) Add samplers sequence customization (#10255)
* Samplers sequence: simplified and input field.

* Removed unused function

* Modify and use `settings-modal-short-input`

* rename "name" --> "label"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-16 14:26:54 +01:00
Georgi Gerganov f245cc28d4 scripts : fix missing key in compare-llama-bench.py (#10332) 2024-11-16 10:32:50 +02:00
Jeff Bolz 772703c8ff vulkan: Optimize some mat-vec mul quant shaders (#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.
2024-11-16 07:26:57 +01:00
FirstTimeEZ dd3a6ce9f8 vulkan : add cmake preset debug/release (#10306) 2024-11-16 02:59:33 +01:00
Dan Johansson 1e58ee1318 ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324) 2024-11-16 01:53:37 +01:00
FirstTimeEZ 89e4caaaf0 llama : save number of parameters and the size in llama_model (#10286)
fixes #10285
2024-11-16 01:42:13 +01:00
Srihari-mcw 74d73dc85c Make updates to fix issues with clang-cl builds while using AVX512 flags (#10314) 2024-11-15 22:27:00 +01:00
Johannes Gäßler 4047be74da scripts: update compare-llama-bench.py (#10319) 2024-11-15 21:19:03 +01:00
slaren 883d206fbd ggml : fix some build issues 2024-11-15 21:45:32 +02:00
Georgi Gerganov 09ecbcb596 cmake : fix ppc64 check (whisper/0)
ggml-ci
2024-11-15 15:44:06 +02:00
thewh1teagle 3225008973 ggml : vulkan logs (whisper/2547) 2024-11-15 15:44:06 +02:00
Georgi Gerganov cbf5541a82 sync : ggml 2024-11-15 15:44:06 +02:00
Eve 18429220bd AVX BF16 and single scale quant optimizations (#10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
2024-11-15 12:47:58 +01:00
R0CKSTAR f0204a0ec7 ci: build test musa with cmake (#10298)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-11-15 12:47:25 +01:00
Romain Biessy 57f8355b29 sycl: Update Intel docker images to use DPC++ 2025.0 (#10305) 2024-11-15 13:10:45 +02:00
Xuan Son Nguyen 9901068ac7 server : (web UI) add copy button for code block, fix api key (#10242)
* server : (web ui) add copy btn for code blocks

* fix problem with api key

* use settings-modal-short-input component

* always show copy btn for code snippet
2024-11-15 10:48:49 +01:00
Chenguang Li 231f9360d9 cann: dockerfile and doc adjustment (#10302)
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-11-15 15:09:35 +08:00
Georgi Gerganov 4802ad350b scripts : fix regex in sync [no ci] 2024-11-15 08:38:43 +02:00
Romain Biessy 5a54af4d4f sycl: Use syclcompat::dp4a (#10267)
* sycl: Use syclcompat::dp4a

* Using the syclcompat version allow the compiler to optimize the
  operation with native function

* Update news section

* Update CI Windows oneAPI version to 2025.0

* Reword doc

* Call syclcompat::dp4a inside dpct::dp4a

This reverts commit 90cb61d692.
2024-11-15 11:09:12 +08:00
Charles Xu 1607a5e5b0 backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (#9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-11-15 01:28:50 +01:00
Diego Devesa ae8de6d50a ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Johannes Gäßler 4a8ccb37ad CUDA: no -sm row for very small matrices (#10185) 2024-11-14 13:00:15 +01:00
Georgi Gerganov 2a82891a85 speculative : fix out-of-bounds access (#10289) 2024-11-14 11:44:15 +02:00
Jeff Bolz af148c9386 vulkan: Optimize binary ops (#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader
variant for when src0/src1 are the same dimensions and additional modulus
for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that
have a fast path when the calculation isn't needed or can be done more
cheaply.
2024-11-14 06:22:55 +01:00
Jeff Bolz 66798e42fb vulkan: Use macros to make the mat mul pipeline creation more concise (#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a
pipeline. This isn't really used yet.
2024-11-13 21:59:47 +01:00
Michael Podvitskiy fb4a0ec083 llama : propagate the results of graph_compute (#9525)
* llama: propagating the results of `graph_compute` to the user interface

* llama: reverting kv_cache in case of failed compute

* llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned

* llama: restore a kv_cache in case of failed computation

* llama: correct reverting of the entire batch.
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models

* llama: updated comments

* llama : add comments about KV cache state after error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-13 20:00:35 +02:00
Georgi Gerganov 5ea926dad7 sync : ggml 2024-11-13 18:11:54 +02:00
Small Grass Forest 1ee9eea094 docs : update bindings list (#10261)
Signed-off-by: tianzixuan <tianzixuan335@hellobike.com>
2024-11-13 13:17:10 +02:00
Alexey Parfenov ff7fb670d0 server : add missing docs (#10269) 2024-11-13 13:16:30 +02:00
Jhen-Jie Hong 0e712a5acb server : fix incorrect res in validate_model_chat_template (#10272)
* server : fix validate_model_chat_template

* server : fix chat res
2024-11-13 13:15:23 +02:00
Brian a0ec17b32e metadata: Detailed Dataset Authorship Metadata (#8875)
Converter script can now read these two fields as a detailed base model and dataset source.
This was done so that it will be easier for Hugging Face to integrate detailed metadata as needed.

 -  base_model_sources (List[dict], optional)
 -  dataset_sources (List[dict], optional)

Dataset now represented as:

 - general.dataset.count
 - general.dataset.{id}.name
 - general.dataset.{id}.author
 - general.dataset.{id}.version
 - general.dataset.{id}.organization
 - general.dataset.{id}.description
 - general.dataset.{id}.url
 - general.dataset.{id}.doi
 - general.dataset.{id}.uuid
 - general.dataset.{id}.repo_url

This also adds to base model these metadata:

 - general.base_model.{id}.description
2024-11-13 21:10:38 +11:00
Alberto Cabrera Pérez 2e82ffa4af sycl : Fixes to broken builds and test-backend-ops (#10257)
* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133)

* Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops

* Fixes asserts in norm to fix debug builds.
2024-11-13 09:40:57 +00:00
Jeff Bolz 80dd7ff22f vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
2024-11-13 07:58:57 +01:00
Jeff Bolz 54ef9cfc72 vulkan: Throttle the number of shader compiles during the build step. (#10222)
Fixes #9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes"
errors on Linux. This change applies the same throttling we use for
multithreaded pipeline creation.
2024-11-11 18:13:51 +01:00
Georgi Gerganov b0cefea58a metal : more precise Q*K in FA vec kernel (#10247) 2024-11-11 08:39:13 +02:00
Georgi Gerganov b141e5f6ef server : enable KV cache defrag by default (#10233)
ggml-ci
2024-11-11 08:38:43 +02:00
Georgi Gerganov 4b3a9212b6 flake.lock: Update (#10243)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/807e9154dcb16384b1b765ebe9cd2bba2ac287fd?narHash=sha256-l253w0XMT8nWHGXuXqyiIC/bMvh1VRszGXgdpQlfhvU%3D' (2024-10-29)
  → 'github:NixOS/nixpkgs/4aa36568d413aca0ea84a1684d2d46f55dbabad7?narHash=sha256-Zwl8YgTVJTEum%2BL%2B0zVAWvXAGbWAuXHax3KzuejaDyo%3D' (2024-11-05)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-11-10 11:45:25 -08:00
MaggotHATE 505f33274d server : (web UI) Add back sampler settings (#10239)
* Add back samplers to server

* Added tooltips with basic information

* Fixed stretching of input fields.

* use component for settings input, move help msg to tooltips

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-11-10 15:42:25 -04:00
Jeff Bolz 160687b3ed vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (#10226) 2024-11-10 12:37:56 +01:00
Georgi Gerganov 6423c65aa8 metal : reorder write loop in mul mat kernel + style (#10231)
* metal : reorder write loop

* metal : int -> short, style

ggml-ci
2024-11-09 11:53:13 +02:00
Georgi Gerganov 39a334a9aa metal : fix build and some more comments (#10229) 2024-11-09 11:53:02 +02:00
Georgi Gerganov bb38cdd8ba metal : fix F32 accumulation in FA vec kernel (#10232) 2024-11-09 11:52:45 +02:00
Georgi Gerganov f018acba22 llama : fix Qwen model type strings 2024-11-09 11:26:34 +02:00
Georgi Gerganov 46323fa9ef metal : hide debug messages from normal log 2024-11-09 11:21:49 +02:00
SXX 5b359bb1e3 ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213) 2024-11-09 08:35:46 +01:00
amritahs-ibm e89213492d ggml : optimize llamafile cpu matrix multiplication for ppc64le (#10156)
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for FP32 datatype.

This change results in a consistent 90%
improvement in input processing time, and 20%
to 80% improvement in output processing time,
across various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
2024-11-09 09:17:50 +02:00
haopeng 8fc393f246 scripts : fix pattern and get n_tokens in one go (#10221) 2024-11-09 09:06:54 +02:00
Georgi Gerganov ec450d3bbf metal : opt-in compile flag for BF16 (#10218)
* metal : opt-in compile flag for BF16

ggml-ci

* ci : use BF16

ggml-ci

* swift : switch back to v12

* metal : has_float -> use_float

ggml-ci

* metal : fix BF16 check in MSL

ggml-ci
2024-11-08 21:59:46 +02:00
Georgi Gerganov 695ad752b2 metal : improve clarity (minor) (#10171) 2024-11-08 18:37:41 +02:00
Georgi Gerganov 841f27abdb metal : optimize FA kernels (#10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-08 13:47:22 +02:00
Jhen-Jie Hong d05b3127bd swift : exclude ggml-metal-embed.metal (#10211)
* llama.swift : exclude ggml-metal-embed.metal

* swift : exclude build/
2024-11-08 11:34:06 +02:00
Xuan Son Nguyen 76c6e7f105 server : minor UI fix (#10207) 2024-11-07 18:44:38 -04:00
Xuan Son Nguyen a71d81cf8c server : revamp chat UI with vuejs and daisyui (#10175)
* server : simple chat UI with vuejs and daisyui

* move old files to legacy folder

* embed deps into binary

* basic markdown support

* add conversation history, save to localStorage

* fix bg-base classes

* save theme preferences

* fix tests

* regenerate, edit, copy buttons

* small fixes

* docs: how to use legacy ui

* better error handling

* make CORS preflight more explicit

* add GET method for CORS

* fix tests

* clean up a bit

* better auto scroll

* small fixes

* use collapse-arrow

* fix closeAndSaveConfigDialog

* small fix

* remove console.log

* fix style for <pre> element

* lighter bubble color (less distract when reading)
2024-11-07 17:31:10 -04:00
Georgi Gerganov eec4d71737 scripts : add amx to sync-ggml.sh [no ci] 2024-11-07 23:11:36 +02:00
Georgi Gerganov 3b08828674 sync : ggml 2024-11-07 23:08:24 +02:00
Georgi Gerganov a2c6fd747c scripts : sync update 2024-11-07 23:07:55 +02:00
Diego Devesa 97404c4a03 ggml : add ggml-cpu.h to the public headers (#10204) 2024-11-07 18:16:08 +01:00
Faisal Zaghloul 60e17ce23c Remove identical wte/etw logic for jais (#10203) 2024-11-07 08:46:12 -08:00
wwoodsTM 5107e8cea3 DRY: Fixes clone functionality (#10192) 2024-11-07 16:20:25 +01:00
snadampal 2319126a70 fix q4_0_8_8 format for corrupted tokens issue (#10198)
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-62-167.us-west-2.compute.internal>
2024-11-07 09:02:08 +01:00
Zhiyuan Li 3bcd40b3c5 Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-11-07 15:19:10 +08:00
Georgi Gerganov 5c333e0140 metal : add BF16 support (#8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
Georgi Gerganov b11f9ba9b8 server : remove hack for extra parallel slot (#10187)
ggml-ci
2024-11-06 13:29:01 +02:00
Diego Devesa 94d8cb8be1 metal : fix from ptr buffer name (#10189) 2024-11-06 12:10:07 +01:00
Georgi Gerganov 1dc04b2dee ggml : adjust is_first_call init value (#10193)
ggml-ci
2024-11-06 11:20:10 +02:00
Georgi Gerganov a1eaf6a960 metal : add quantized FA support (#10149)
* metal : add quantized FA (vec) support

ggml-ci

* metal : add quantized FA (non-vec) support

* metal : fix support check

ggml-ci

* metal : clean-up

* metal : clean-up (cont)

* metal : fix shared memory calc + reduce smem + comments

* metal : float-correctness

* metal : minor [no ci]
2024-11-06 10:24:23 +02:00
Gabe Goodhart b8deef0ec0 llama : add <|tool_call|> formatting to Granite template (#10177)
Branch: GraniteToolCallTemplate

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-11-05 14:23:04 +02:00
Diego Devesa a9e8a9a030 ggml : fix arch check in bf16_to_fp32 (#10164) 2024-11-04 23:17:01 +01:00
Eve 3407364776 Q6_K AVX improvements (#10118)
* q6_k instruction reordering attempt

* better subtract method

* should be theoretically faster

small improvement with shuffle lut, likely because all loads are already done at that stage

* optimize bit fiddling

* handle -32 offset separately. bsums exists for a reason!

* use shift

* Update ggml-quants.c

* have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86
2024-11-04 23:06:31 +01:00
Diego Devesa d5a409e57f ggml : fix gelu tables initialization (#10172) 2024-11-04 20:06:58 +01:00
Diego Devesa 401558b7ba ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (#10167) 2024-11-04 17:34:08 +01:00
Xuan Son Nguyen 9e0ecfb697 server : clarify /slots endpoint, add is_processing (#10162)
* server : clarify /slots endpoint, add is_processing

* fix tests
2024-11-04 16:33:29 +01:00
snadampal 6a066b9978 fix build break on arm64 linux (#10166)
This fixes the build break from the recent changes
to move the CPU backend to separate files
https://github.com/ggerganov/llama.cpp/pull/10144
2024-11-04 16:08:33 +01:00
Diego Devesa ea02c753eb cuda : clear error after changing peer access (#10153) 2024-11-04 13:10:23 +01:00
Georgi Gerganov 05697f670b metal : simplify f16 and f32 dequant kernels (#0) 2024-11-04 13:49:34 +02:00
Georgi Gerganov f8e58135cf metal : move dequantize templates to beginning of MSL source (#0) 2024-11-04 13:44:06 +02:00
leo-pony 329ed914c9 CANN: adjust backend registry refactor. (#10158)
remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.
2024-11-04 19:08:22 +08:00
Georgi Gerganov ce027adfb3 sync : ggml 2024-11-04 10:33:37 +02:00
Yuri Khrustalev 284e5b0275 cmake : make it possible linking ggml as external lib (ggml/1003) 2024-11-04 10:33:11 +02:00
Plamen Minev e2292aaa17 metal : fix minor string leaks (ggml/1004) 2024-11-04 10:33:10 +02:00
Diego Devesa 9f40989351 ggml : move CPU backend to a separate file (#10144) 2024-11-03 19:34:08 +01:00
Georgi Gerganov 08828a6d7d metal : minor fixup in FA kernel (#10143)
* metal : minor fixup in FA kernel

ggml-ci

* metal : use the unrolled loop variable

* metal : remove unused var
2024-11-03 15:18:40 +02:00
Georgi Gerganov 1839f69130 flake.lock: Update (#10146) 2024-11-03 05:14:15 -08:00
Christian Köhnenkamp 9830b6923b Add apple arm to presets (#10134)
* Add apple arm to presets

* Add final new line
2024-11-02 15:35:31 -07:00
sasha0552 42cadc74bd server : fix slot selection by lru (#10126)
* server : fix slot selection by lru, migrate lcs to `size_t`

* minor debug log fix
2024-11-02 18:34:56 +02:00
Georgi Gerganov 45950415ed server : fix endpoint checks (#10135)
ggml-ci
2024-11-02 18:34:00 +02:00
Georgi Gerganov 1926d6e39d llama : adjust default context size + print warnings (#10136)
* llama : adjust default context size + print warnings

ggml-ci

* ggml-ci : add missing gpu-layers + adjust context sizes
2024-11-02 15:18:56 +02:00
Diego Devesa b634f8a26f simple-chat : only add bos on first prompt (#10129) 2024-11-02 13:08:53 +01:00
Xuan Son Nguyen 7554aa4655 convert-lora : make --base optional (#10110)
* convert-lora : make `--base` optional

* lint

* handle case where base_model_name_or_path is invalid

* do not include metadata from base model

* clarify unspecified --base

* add small comment [no ci]

* trigger ci
2024-11-02 12:53:17 +01:00
Diego Devesa a6744e43e8 llama : add simple-chat example (#10124)
* llama : add simple-chat example

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-11-01 23:50:59 +01:00
Diego Devesa e991e3127f llama : use smart pointers for ggml resources (#10117) 2024-11-01 23:48:26 +01:00
Shupei Fan 418f5eef26 vulkan : improve ggml_vk_create_buffer error handling (#9898) 2024-11-01 19:33:14 +01:00
Georgi Gerganov ba6f62eb79 readme : update hot topics 2024-11-01 17:31:51 +02:00
sasha0552 d865d1478c server : fix smart selection of available slot (#10120)
* Fix smart selection of available slot

* minor fix

* replace vectors of tokens with shorthands
2024-11-01 14:33:14 +01:00
Georgi Gerganov 1804adb0cf ggml : remove ggml_scratch (#10121)
ggml-ci
2024-11-01 12:58:45 +02:00
Georgi Gerganov 815fe72adc sync : ggml 2024-11-01 10:28:24 +02:00
Georgi Gerganov f221d56220 ggml : alloc ggml_contexts on the heap (whisper/2525) 2024-11-01 10:24:50 +02:00
Zhenwei Jin e597e50794 build: fix build error in Windows env with OneAPI setup (#10107) 2024-11-01 11:09:59 +08:00
Diego Devesa 85679d37f3 llama : improve output buffer type selection (#10098) 2024-11-01 00:49:53 +01:00
Diego Devesa 1e9f94994e quantize : fix --keep-split (#10114) 2024-11-01 00:45:34 +01:00
Diego Devesa c02e5ab2a6 llama : fix buffer checks for mamba and rwk (#10111)
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
2024-10-31 22:54:23 +01:00
Zhenwei Jin ab3d71f97f loader: refactor tensor weights storage (#9935)
* loader: refactor tensor weights storage

* use sorted map, sort weights by layer

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-10-31 19:50:39 +01:00
Kevin Gibbons 0a683e8088 server : include scheme when printing URL (#10106) 2024-10-31 14:02:35 +01:00
Diego Devesa dea5e86051 ggml : check tensor name lengths in gguf files (#10100) 2024-10-31 11:40:59 +01:00
Sergio López 1329c0a75e kompute: add mul_mat_q4_k shader (#10097)
This is a more or less direct translation from the Metal implementation
to GLSL.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-10-31 11:09:52 +02:00
Sergio López 61408e7fad kompute: add backend registry / device interfaces (#10045)
Get in line with the other backends by supporting the newer
backend/device registry interfaces.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-10-30 17:01:52 +01:00
Diego Devesa b9e02e8184 ggml : fix memory leaks when loading invalid gguf files (#10094)
* ggml : fix gguf string leak when reading kv pairs fails

* ggml : avoid crashing with GGML_ABORT when the KV has an invalid type

* ggml : avoid crashing on failed memory allocations when loading a gguf file
2024-10-30 14:51:21 +01:00
Rich Dougherty 6763f713bb readme : more lora detail in main example readme (#10064) 2024-10-30 13:22:39 +01:00
Rich Dougherty 79a2bc042d convert : more detailed convert lora usage docs (#10065) 2024-10-30 13:22:21 +01:00
xctan fc83a9e584 ggml : add Q4_0_8_8 RISC-V GEMV and GEMM kernels (#10029)
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
2024-10-30 09:00:40 +02:00
Diego Devesa c5b0f4b5d9 llama : refactor model loader with backend registry (#10026) 2024-10-30 02:01:23 +01:00
Changyeon Kim 8f275a7c45 ggml: Add POOL2D OP for GPU acceleration to the Vulkan backend in the MobileVLM model. (#9763)
* ggml: Add POOL2D OP for GPU ACC to the Vulkan.

- The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend.
- A GGML_OP_POOL_2D shader has been added. (Pooling)
- The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Correct the incorrect order of the parameters.

fix casting to int.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
2024-10-29 09:52:56 +01:00
Georgi Gerganov 8d8ff71536 llama : remove Tail-Free sampling (#10071)
ggml-ci
2024-10-29 10:42:05 +02:00
arch-btw 61715d5cc8 llama : Add IBM granite template (#10013)
* Add granite template to llama.cpp

* Add granite template to test-chat-template.cpp

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Update tests/test-chat-template.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Added proper template and expected output

* Small change to \n

Small change to \n

* Add code space &

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix spacing

* Apply suggestions from code review

* Update src/llama.cpp

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-28 18:45:33 +01:00
Georgi Gerganov 07028f9d74 flake.lock: Update (#10063)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
  → 'github:NixOS/nixpkgs/2768c7d042a37de65bb1b5b3268fc987e534c49d?narHash=sha256-AlcmCXJZPIlO5dmFzV3V2XF6x/OpNWUV8Y/FMPGd8Z4%3D' (2024-10-23)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-28 08:41:24 -07:00
R0CKSTAR 524afeec9d musa: workaround for Guilty Lockup in cleaning src0 (#10042)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-28 10:02:48 +01:00
Georgi Gerganov 8125e6cbfc server : don't overfill the batch during infill (#10018)
ggml-ci
2024-10-28 08:49:32 +02:00
Georgi Gerganov 8841ce3f43 llama : switch KQ multiplication to F32 precision by default (#10015)
ggml-ci
2024-10-27 20:59:58 +02:00
Georgi Gerganov cc2983d375 sync : ggml 2024-10-26 10:34:08 +03:00
bssrdf 8c60a8a462 increase cuda_cpy block size (ggml/996)
Co-authored-by: bssrdf <bssrdf@gmail.com>
2024-10-26 10:33:56 +03:00
Georgi Gerganov 9e4a2563ea scripts : fix amx sync [no ci] 2024-10-26 10:33:31 +03:00
Georgi Gerganov 668750357e metal : support permuted matrix multiplicaions (#10033)
* metal : support permuted matrix multiplicaions

ggml-ci

* cont : use nb01 directly for row steps

ggml-ci

* cont : add comments [no ci]

* metal : minor refactor

* metal : minor
2024-10-25 22:26:15 +03:00
wwoodsTM ff252ea48e llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Michael Podvitskiy d80fb71f8b llama: string_split fix (#10022)
* llama: Refactor string_split to use template specialization,  fixes parsing strings with spaces

* llama: Add static_assert in the string_split template to ensure the correct template specialization is used for std::string
2024-10-25 17:57:54 +02:00
Srihari-mcw 2f8bd2b901 llamafile : extend sgemm.cpp support for Q5_0 models (#10010) 2024-10-25 10:27:41 +03:00
Georgi Gerganov bc5ba007b2 server : check that the prompt fits in the slot's context (#10030)
ggml-ci
2024-10-25 10:13:46 +03:00
Xuan Son Nguyen 958367bf53 server : refactor slot input data, move tokenizer to HTTP thread (#10023)
* server : refactor slot input data, move tokenizer to HTTP thread

* move prompt_tokens.empty() check

* fix incorrect if branch

* fix infinite generation loop

* bring back infill validation

* add infill test

* try fixing format_infill

* fix test

* remove redundant code

* rename completion to inference

* update docs

* use llama_tokens everywhere
2024-10-24 21:51:22 +02:00
Georgi Gerganov 40f2555797 ci : fix cmake flags for SYCL 2024-10-24 21:23:33 +03:00
Johannes Gäßler 167a515651 CUDA: fix insufficient buffer clearing for MMQ (#10032) 2024-10-24 14:40:23 +02:00
Johannes Gäßler c39665f589 CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
2024-10-24 11:09:36 +02:00
wwoodsTM 0a1c750c80 server : samplers accept the prompt correctly (#10019) 2024-10-23 22:27:51 +03:00
Georgi Gerganov 190a37d797 sync : ggml 2024-10-23 17:23:55 +03:00
Georgi Gerganov 2d3aba9ee8 llama.vim : bump generation time limit to 3s [no ci] 2024-10-23 17:16:56 +03:00
Johannes Gäßler 80273a306d CUDA: fix 1D im2col, add tests (ggml/993) 2024-10-23 16:50:02 +03:00
Daniel Bevenius c19af0acb1 ggml : remove redundant set of contexts used field (ggml/978)
This commit removes the setting of the `used` field of the contexts in
the global state (g_state) in `ggml_init`.

The motivation for this change is that I believe that this additional
initialization might not be required after the changes in Commit
45fc4fed0b9fb5b1af4a8525cbebb95e11208732 ("sync : latest changes from
whisper.cpp"), which changed the initialization of the contexts field
from `{ 0 }` to `{ { 0 } }`:

```console
             g_state = (struct ggml_state) {
-                /*.contexts =*/ { 0 },
+                /*.contexts =*/ { { 0 } },
             };
```
My understanding is that the `{0}` initialization might not have
zero-initialized all the nested fields in every array element because of
compiler differences, and might have been the reason for having the
explicit setting of the `used` fields to false.
2024-10-23 16:50:02 +03:00
Michael Coppola ac113a0fee llama.vim : add classic vim support (#9995)
* added classic vim support

* fixed ring update, removed blank line

* minor

* minor

* minor doc update

* removed uneeded var

* minor

* minor

* fixed job_start creating new scratch buffers

* fixed job_start creating new scratch buffers

* fixed ghost text indenting when expandtab is on

* removed unused code

* minor

* unified fim_on_exit

* minor

* vim ghost text rendering now uses pos_x and pos_y parameters

* renamed *_hlgroup to hlgroup_*

* renamed *_ghost_text to ghost_text_*, moved nvim/vim detection to llama#init()

* minor

---------

Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-10-23 14:09:26 +03:00
Jun Hee Yoo 4c9388fb96 metal : add POOL2D and fix IM2COL (#9943)
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
2024-10-23 13:33:45 +03:00
github-actions[bot] 873279b159 flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)
  → 'github:NixOS/nixpkgs/4c2fcb090b1f3e5b47eaa7bd33913b574a11e0a0?narHash=sha256-/uilDXvCIEs3C9l73JTACm4quuHUsIHcns1c%2BcHUJwA%3D' (2024-10-18)
2024-10-23 01:28:07 +00:00
Xuan Son Nguyen c8c07d658a llama : fix empty batch causing llama_batch_allocr to crash (#9966)
* llama : fix empty batch cause llama_batch_allocr to crash

* move batch_allocr inside decode/encode_internal

* fix build

* add GGML_ASSERT

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-22 16:59:02 +02:00
Daniel Bevenius 19d900a756 llama : rename batch to ubatch (#9950)
This commit renames the member field batch in llm_build_context to
ubatch, and also the parameter batch in llama_build_graph, and
llama_set_inputs to ubatch.

The motivation for this change is to make the code more readable
(considering there are the structs llama_batch and llama_sbatch), and
consistent with other parts of the code base where parameters/fields of
type llama_ubatch are named ubatch.
2024-10-22 16:31:06 +03:00
Molly Sophia 11d47057a5 Rwkv chat template fix (#10001)
* llama: remove useless template matching for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Add comment about the hack for rwkv models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-22 15:22:26 +02:00
Xuan Son Nguyen c421ac072d lora : warn user if new token is added in the adapter (#9948) 2024-10-22 13:08:41 +02:00
Molly Sophia 4ff7fe1fb3 llama : add chat template for RWKV-World + fix EOT (#9968)
* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-10-22 13:33:37 +03:00
leo-pony 6b8447352d [CANN] Adapt to dynamically loadable backends mechanism (#9970)
* [CANN] Adapt to dynamically loadable backends mechanism

* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class

* Handle the review comments of this pull request
2024-10-22 16:16:01 +08:00
Daniel Bevenius 674804a996 arg : fix typo in embeddings argument help [no ci] (#9994)
This commit fixes two typos in the help text for the `--embd-normalize`
and `--embd-separator` arguments. It also updates common.h which contain
the same typo in two comments.
2024-10-22 10:40:02 +03:00
Georgi Gerganov e94a138d64 llama.vim : fix info text display [no ci] (#9787) 2024-10-22 00:37:55 +03:00
Georgi Gerganov e01c67affe llama.vim : move info to the right of screen [no ci] (#9787)
'eol' messes up the rendering with nvim v0.10.2 for some reason
2024-10-21 22:53:18 +03:00
Asghar Ghorbani 994cfb1acb readme : update UI list (#9972)
add PocketPal AI app
2024-10-21 21:20:59 +03:00
Daniel Bevenius 94008cc760 arg : fix attention non-causal arg value hint (#9985)
This commit updates the argument value hint for the `--attention`
argument to `non-causal`.

The motivation for this change is that the only values for this argument
are `causal` and `non-causal`.
2024-10-21 21:12:52 +03:00
Georgi Gerganov dbd5f2f573 llama.vim : plugin for Neovim (#9787) 2024-10-21 20:25:02 +03:00
Georgi Gerganov f594bc80ba ggml : add asserts for type conversion in fattn kernels (#9971)
ggml-ci
2024-10-21 16:20:46 +03:00
Radoslav Gerganov d5ebd79c76 rpc : pack only RPC structs (#9959) 2024-10-21 13:35:40 +03:00
Georgi Gerganov 55e47786e3 llama : default sampling changes + greedy update (#9897)
* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci
2024-10-21 09:46:40 +03:00
Georgi Gerganov bc21975084 speculative : fix handling of some input params (#9963)
* speculative : fix batch sizes at initialization

ggml-ci

* speculative : handle params.n_predict == -1

* speculative : limit batch size to llama_n_batch
2024-10-21 09:37:12 +03:00
Neo Zhang Jianyu 1db8c84fc6 fix mul_mat_vec_q and *_vec_q error (#9939)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-10-21 14:26:09 +08:00
Loïc Carrère 45f097645e readme : update bindings list (#9951)
Update the binding list by adding LM-Kit.NET (C# & VB.NET)
2024-10-20 19:25:41 +03:00
icppWorld 7cab2083c7 readme : update infra list (#9942)
llama_cpp_canister allows you to run llama.cpp as a Smart Contract on the Internet Computer. The smart contract runs as WebAssembly in a so-called 'canister'.
2024-10-20 19:01:34 +03:00
Xuan Son Nguyen cda0e4b648 llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
* refactor llama_batch_get_one

* adapt all examples

* fix simple.cpp

* fix llama_bench

* fix

* fix context shifting

* free batch before return

* use common_batch_add, reuse llama_batch in loop

* null terminated seq_id list

* fix save-load-state example

* fix perplexity

* correct token pos in llama_batch_allocr
2024-10-18 23:18:01 +02:00
Radoslav Gerganov afd9909a64 rpc : backend refactoring (#9912)
* rpc : refactor backend

Use structs for RPC request/response messages

* rpc : refactor server
2024-10-18 14:33:58 +03:00
Ouadie EL FAROUKI 87421a23e8 [SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
2024-10-18 06:46:16 +01:00
Ma Mingfei 60ce97c9d8 add amx kernel for gemm (#8998)
add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
2024-10-18 13:34:36 +08:00
Georgi Gerganov 8901755ba3 server : add n_indent parameter for line indentation requirement (#9929)
ggml-ci
2024-10-18 07:32:19 +03:00
Daniel Bevenius 6f55bccbb8 llama : rename batch_all to batch (#8881)
This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.
2024-10-18 01:41:51 +02:00
Georgi Gerganov 17bb928080 readme : remove --memory-f32 references (#9925) 2024-10-17 23:43:05 +03:00
Georgi Gerganov 9f45fc1e99 llama : change warning to debug log 2024-10-17 23:27:42 +03:00
Georgi Gerganov 99bd4ac28c llama : infill sampling handle very long tokens (#9924)
* llama : infill sampling handle very long tokens

ggml-ci

* cont : better indices

ggml-ci
2024-10-17 22:32:47 +03:00
Tim Wang 3752217ed5 readme : update bindings list (#9918)
Co-authored-by: Tim Wang <tim.wang@ing.com>
2024-10-17 09:57:14 +03:00
Diego Devesa f010b77a37 vulkan : add backend registry / device interfaces (#9721)
* vulkan : add backend registry / device interfaces

* llama : print devices used on model load
2024-10-17 02:46:58 +02:00
Gilad S. 2194200278 fix: allocating CPU buffer with size 0 (#9917) 2024-10-17 01:34:22 +02:00
Gilad S. 73afe681aa fix: use vm_allocate to allocate CPU backend buffer on macOS (#9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
2024-10-17 00:36:51 +02:00
Daniel Bevenius 9e04102448 llama : suppress conversion from 'size_t' to 'int' (#9046)
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
2024-10-16 20:34:28 +03:00
Daniel Bevenius dbf18e4de9 llava : fix typo in error message [no ci] (#9884) 2024-10-16 20:24:05 +03:00
Joe Eli McIlvain 66c2c93082 grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
2024-10-16 19:03:24 +03:00
Molly Sophia 10433e8b45 llama : add tensor name for "result_norm" (#9907)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-10-16 13:10:21 +03:00
Alexey Parfenov 1f66b699c4 server : fix the disappearance of the end of the text (#9867)
* server: fix the disappearance of the end of the text when streaming with stop strings

* simplify "send text" checks
2024-10-16 11:35:53 +03:00
Georgi Gerganov 0e41b300ed sync : ggml 2024-10-16 11:28:14 +03:00
Daniel Bevenius cd60b88bf7 ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
This commit removes the buffer_id field from the leaf_alloc struct.

The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
2024-10-16 11:28:01 +03:00
leo-pony becfd387f6 [CANN] Fix cann compilation error (#9891)
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
2024-10-16 08:51:46 +08:00
Georgi Gerganov 755a9b2bf0 llama : add infill sampler (#9896)
ggml-ci
2024-10-15 16:35:33 +03:00
Georgi Gerganov 223c25a72f server : improve infill context reuse (#9894)
ggml-ci
2024-10-15 16:28:55 +03:00
MaggotHATE fbc98b748e sampling : add XTC sampler (#9742)
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
2024-10-15 12:54:55 +02:00
Georgi Gerganov dcdd535302 server : update preact (#9895) 2024-10-15 12:48:44 +03:00
Michał Tuszyński 4c42f93b22 readme : update bindings list (#9889) 2024-10-15 11:20:34 +03:00
VoidIsVoid a89f75e1b7 server : handle "logprobs" field with false value (#9871)
Co-authored-by: Gimling <huangjl@ruyi.ai>
2024-10-14 10:04:36 +03:00
agray3 13dca2a54a Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-14 02:49:08 +02:00
Georgi Gerganov d4c19c0f5c server : accept extra_context for the infill endpoint (#9874)
* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci
2024-10-13 21:31:35 +03:00
Georgi Gerganov c7181bd294 server : reuse cached context chunks (#9866)
ggml-ci
2024-10-13 18:52:48 +03:00
Georgi Gerganov 92be9f1216 flake.lock: Update (#9870)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
  → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-12 20:11:26 -07:00
Georgi Gerganov edc265661c server : add option to time limit the generation phase (#9865)
ggml-ci
2024-10-12 16:14:27 +03:00
Georgi Gerganov 1bde94dd02 server : remove self-extend features (#9860)
* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
2024-10-12 16:06:31 +03:00
Georgi Gerganov 95c76e8e92 server : remove legacy system_prompt feature (#9857)
* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props
2024-10-12 14:51:54 +03:00
Georgi Gerganov 11ac9800af llama : improve infill support and special token detection (#9798)
* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (#9800)

* gguf : deprecate old FIM token KVs
2024-10-12 08:21:51 +03:00
R0CKSTAR 943d20b411 musa : update doc (#9856)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-12 08:09:53 +03:00
Diego Devesa 96776405a1 ggml : move more prints to the ggml log system (#9839)
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
2024-10-11 15:34:45 +02:00
Diego Devesa 7eee341bee common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
Diego Devesa 0e9f760eb1 rpc : add backend registry / device interfaces (#9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
2024-10-10 20:14:55 +02:00
R0CKSTAR cf8e0a3bb9 musa: add docker image support (#9685)
* mtgpu: add docker image support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable docker workflow

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-10-10 20:10:37 +02:00
Diego Devesa c7499c557c examples : do not use common library in simple example (#9803)
* examples : do not use common library in simple example

* add command line parser, simplify code
2024-10-10 19:50:49 +02:00
Diego Devesa c81f3bbb05 cmake : do not build common library by default when standalone (#9804) 2024-10-09 18:49:52 +02:00
Georgi Gerganov e7022064ab perplexity : fix integer overflow (#9783)
* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci
2024-10-09 17:00:18 +03:00
Georgi Gerganov 3dc48fe75a examples : remove llama.vim
An updated version will be added in #9787
2024-10-09 10:55:42 +03:00
Diego Devesa dca1d4b58a ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-10-08 14:21:43 +02:00
Xuan Son Nguyen 458367a906 server : better security control for public deployments (#9776)
* server : more explicit endpoint access settings

* protect /props endpoint

* fix tests

* update server docs

* fix typo

* fix tests
2024-10-08 13:27:04 +02:00
standby24x7 fa42aa6d89 scripts : fix spelling typo in messages and comments (#9782)
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-10-08 09:19:53 +03:00
Diego Devesa 6374743747 ggml : add backend registry / device interfaces to BLAS backend (#9752)
* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers
2024-10-07 21:55:08 +02:00
Andrew Minh Nguyen f1af42fa8c Update building for Android (#9672)
* docs : clarify building Android on Termux

* docs : update building Android on Termux

* docs : add cross-compiling for Android

* cmake : link dl explicitly for Android
2024-10-07 09:37:31 -07:00
Georgi Gerganov 6279dac039 flake.lock: Update (#9753)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/bcef6817a8b2aa20a5a6dbb19b43e63c5bf8619a?narHash=sha256-HO4zgY0ekfwO5bX0QH/3kJ/h4KvUDFZg8YpkNwIbg1U%3D' (2024-09-12)
  → 'github:hercules-ci/flake-parts/3d04084d54bedc3d6b8b736c70ef449225c361b1?narHash=sha256-K5ZLCyfO/Zj9mPFldf3iwS6oZStJcU4tSpiXTMYaaL0%3D' (2024-10-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/356624c12086a18f2ea2825fed34523d60ccc4e3.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
  → 'https://github.com/NixOS/nixpkgs/archive/fb192fec7cc7a4c26d51779e9bab07ce6fa5597a.tar.gz?narHash=sha256-0xHYkMkeLVQAMa7gvkddbPqpxph%2BhDzdu1XdGPJR%2BOs%3D' (2024-10-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)
  → 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-10-07 09:35:42 -07:00
Georgi Gerganov d5ac8cf2f2 ggml : add metal backend registry / device (#9713)
* ggml : add metal backend registry / device

ggml-ci

* metal : fix names [no ci]

* metal : global registry and device instances

ggml-ci

* cont : alternative initialization of global objects

ggml-ci

* llama : adapt to backend changes

ggml-ci

* fixes

* metal : fix indent

* metal : fix build when MTLGPUFamilyApple3 is not available

ggml-ci

* fix merge

* metal : avoid unnecessary singleton accesses

ggml-ci

* metal : minor fix [no ci]

* metal : g_state -> g_ggml_ctx_dev_main [no ci]

* metal : avoid reference of device context in the backend context

ggml-ci

* metal : minor [no ci]

* metal : fix maxTransferRate check

* metal : remove transfer rate stuff

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-10-07 18:27:51 +03:00
Paul Tsochantaris 96b6912103 metal : single allocation of encode_async block (#9747)
* Single allocation of encode_async block with non-ARC capture in ggml-metal.m

* Moving Block_release to the deallocation code

* Release encode block when re-setting encoding buffer count if needed

* Update ggml/src/ggml-metal.m

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-07 15:26:31 +03:00
Georgi Gerganov d5cb86844f contrib : simplify + minor edits [no ci] 2024-10-06 14:15:27 +03:00
Georgi Gerganov f4b2dcdf49 readme : fix typo [no ci] 2024-10-06 13:49:41 +03:00
Georgi Gerganov b6d6c5289f sync : llama.cpp 2024-10-06 12:53:28 +03:00
SRHMorris b0915d5b51 vulkan : retry allocation with fallback flags (whisper/2451)
Co-authored-by: Samuel Morris <samuel.morris@artlist.io>
2024-10-06 12:52:11 +03:00
Georgi Gerganov 8c475b97b8 rerank : use [SEP] token instead of [BOS] (#9737)
* rerank : use [SEP] token instead of [BOS]

ggml-ci

* common : sanity check for non-NULL tokens

ggml-ci

* ci : adjust rank score interval

ggml-ci

* ci : add shebang to run.sh

ggml-ci
2024-10-05 15:55:04 +03:00
Georgi Gerganov 58b16695e1 sync : ggml 2024-10-05 15:53:49 +03:00
Georgi Gerganov 905f5485b2 metal : zero-init buffer contexts (whisper/0) 2024-10-05 15:53:00 +03:00
Viet-Anh NGUYEN (Andrew) 71967c2a6d Add Llama Assistant (#9744) 2024-10-04 20:29:35 +02:00
Georgi Gerganov 17880771ad sync : ggml 2024-10-04 18:50:25 +03:00
Daniel Bevenius 55951c018d ggml : fix typo in example usage ggml_gallocr_new (ggml/984) 2024-10-04 18:50:05 +03:00
Diego Devesa ff565769f2 ggml : fixes after sync (ggml/983)
ggml : remove test-backend-buffer

ggml : fix CUDA build warnings
2024-10-04 18:50:04 +03:00
Xuan Son Nguyen f3fdcfaa79 ci : fine-grant permission (#9710) 2024-10-04 11:47:19 +02:00
Daniel Kleine 133c7b46b3 Fixed RNG seed docs (#9723)
* Update README.md

fixed RNG seed info

* changed print format to unsigned
2024-10-04 10:54:44 +02:00
Georgi Gerganov d5ed2b929d metal : remove abort (skip) (ggml/0) 2024-10-03 21:18:19 +03:00
Georgi Gerganov 1bb8a64ebf sync : ggml 2024-10-03 21:17:49 +03:00
Johannes Gäßler fabdc3bda3 ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
Johannes Gäßler eee39bdc96 ggml: refactor cross entropy loss CPU impl. (ggml/976) 2024-10-03 21:17:26 +03:00
Jack Mousseau 5d5ab1e5cc metal : fix compute pass descriptor autorelease crash (#9718) 2024-10-03 21:01:46 +03:00
Diego Devesa a7ad553513 ggml-backend : add device description to CPU backend (#9720) 2024-10-03 17:39:18 +02:00
bandoti d6fe7abf04 ggml: unify backend logging mechanism (#9709)
* Add scaffolding for ggml logging macros

* Metal backend now uses GGML logging

* Cuda backend now uses GGML logging

* Cann backend now uses GGML logging

* Add enum tag to parameters

* Use C memory allocation funcs

* Fix compile error

* Use GGML_LOG instead of GGML_PRINT

* Rename llama_state to llama_logger_state

* Prevent null format string

* Fix whitespace

* Remove log callbacks from ggml backends

* Remove cuda log statement
2024-10-03 17:39:03 +02:00
compilade e3c355ba65 convert : handle tokenizer merges format from transformers 4.45 (#9696) 2024-10-03 17:22:15 +03:00
Radoslav Gerganov 841713e1e4 rpc : enable vulkan (#9714)
closes #8536
2024-10-03 13:00:52 +03:00
Ouadie EL FAROUKI 5639971466 Fixed dequant precision issues in Q4_1 and Q5_1 (#9711) 2024-10-03 07:50:44 +01:00
Diego Devesa c83ad6d01e ggml-backend : add device and backend reg interfaces (#9707)
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-03 01:49:47 +02:00
Xuan Son Nguyen a39ab216aa llama : reduce compile time and binary size (#9712)
* llama : speed up compile time

* fix build

* fix build (2)
2024-10-02 15:49:55 +02:00
Alberto Cabrera Pérez f536f4c439 [SYCL] Initial cmake support of SYCL for AMD GPUs (#9658)
sycl: initial cmake support of SYCL for AMD GPUs
2024-10-02 13:57:18 +01:00
Radoslav Gerganov 00b7317e63 vulkan : do not use tensor->extra (#9407)
* vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as
tensor->extra is no longer used.

Ref: #8536

* Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (#2)

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-10-02 13:49:16 +03:00
Zhenwei Jin 76b37d1541 gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time

* update gguf-split params parse logic

* Update examples/gguf-split/gguf-split.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-10-02 10:21:57 +03:00
Georgi Gerganov 148844fe97 examples : remove benchmark (#9704)
ggml-ci
2024-10-02 10:14:44 +03:00
Paweł Wodnicki 3f1ae2e32c Update README.md (#9591)
Add Bielik model.
2024-10-01 19:18:46 +02:00
Georgi Gerganov f1b8c42711 sync : ggml 2024-10-01 16:09:42 +03:00
Johannes Gäßler e98c1c188e test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974) 2024-10-01 16:07:40 +03:00
Salvatore Mesoraca cb00020504 vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16,
it is possible for loadstride_a (mul_mm.comp:114)
and loadstride_b (mul_mm.comp:115) to be set to 0.
Because they are calculated as: the workgroup size,
multiplied by LOAD_VEC_* (which can be 1) and divided by 16.
And the workgroup size is set to be the same as the
warp/subgroup size.

The loadstride_* variables are used as increments in the
loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop.
But infinite loops without side-effects are UB and the
values of loadstride_* are known at compile time.
So, the compiler quietly optimizes all the loops away.
As a consequence, the buffers are not populated and
the multiplication result is just a matrix with all elements
set to 0.

We prevent the UB by making sure that the workgroup size
will never be less than 16, even if our device has a
smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-10-01 16:07:39 +03:00
Borislav Stanimirov 6c5322481a ggml : fix ggml_cast (ggml/973) 2024-10-01 16:07:39 +03:00
Johannes Gäßler 7254cdf7e8 ggml: fix gradient allocation logic (ggml/966)
* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg
2024-10-01 16:07:38 +03:00
Georgi Gerganov cad341d889 metal : reduce command encoding overhead (#9698)
* metal : reduce command encoding overhead

ggml-ci

* metal : add comments
2024-10-01 16:00:25 +03:00
Georgi Gerganov a90484c6d9 llama : print correct model type for Llama 3.2 1B and 3B 2024-10-01 11:42:01 +03:00
compilade 1927378bcc convert : refactor rope_freqs generation (#9396)
* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.
2024-10-01 09:31:36 +03:00
serhii-nakon 6f1d9d71f4 Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS (#9641)
* Fix Docker ROCM builds, use AMDGPU_TARGETS instead of GPU_TARGETS

* Set ROCM_DOCKER_ARCH as string due it incorrectly build and cause OOM exit code
2024-09-30 20:57:12 +02:00
compilade 511636df0c ci : reduce severity of unused Pyright ignore comments (#9697) 2024-09-30 14:13:16 -04:00
vb 08a43d05b6 py : update transfomers version (#9694)
* update transfomers version.

* update hfh version.
2024-09-30 18:03:47 +03:00
Georgi Gerganov ace4f4be37 flake.lock: Update (#9680)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)
  → 'github:NixOS/nixpkgs/1925c603f17fc89f4c8f6bf6f631a802ad85d784?narHash=sha256-J%2BPeFKSDV%2BpHL7ukkfpVzCOO7mBSrrpJ3svwBFABbhI%3D' (2024-09-26)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-09-30 07:48:49 -07:00
Ruchira Hasaranga 8277a817f1 console : utf-8 fix for windows stdin (#9690)
* utf-8 fix for windows stdin

* Update common/console.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-30 11:23:42 +03:00
Georgi Gerganov c919d5db39 ggml : define missing HWCAP flags (#9684)
ggml-ci

Co-authored-by: Willy Tarreau <w@1wt.eu>
2024-09-29 21:18:23 +03:00
Georgi Gerganov d0b1d663e4 sync : ggml 2024-09-29 21:16:07 +03:00
Johannes Gäßler aaa4099925 CUDA: remove bad assert (ggml/972) 2024-09-29 21:15:37 +03:00
Jeff Bolz 641002fba8 vulkan : multithread pipeline creation (ggml/963) 2024-09-29 21:15:37 +03:00
Jeff Bolz 0de8b203f1 vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961) 2024-09-29 21:15:37 +03:00
Salvatore Mesoraca 544f409b4b vulkan : argsort barriers must be under uniform control flow (ggml/951)
a return before a barrier (that happens only in some threads in
a workgroup) leads to UB.
While the old code actually works on some devices,
it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants
when the graph is built, in that way the local workgroup
could be sized appropriately.
But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-29 21:15:37 +03:00
Georgi Gerganov 6084bfb261 ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969) 2024-09-29 21:15:35 +03:00
matiaslin faac0bae26 common : ensure llama_batch size does not exceed max size (#9668)
A crash was observed when the number of tokens added to a batch exceeds
llama_batch size. An assertion in llama_batch_add was added to protect
against llama_batch size overflow.
2024-09-29 15:25:00 +03:00
nopperl f99d3f8367 py : add model class for Chameleon conversion (#9683) 2024-09-29 15:02:06 +03:00
Georgi Gerganov 589b48d41e contrib : add Resources section (#9675) 2024-09-29 14:38:18 +03:00
Georgi Gerganov f4d2b8846a llama : add reranking support (#9510)
* py : add XLMRobertaForSequenceClassification [no ci]

* py : fix scalar-tensor conversion [no ci]

* py : fix position embeddings chop [no ci]

* llama : read new cls tensors [no ci]

* llama : add classigication head (wip) [no ci]

* llama : add "rank" pooling type

ggml-ci

* server : add rerank endpoint

ggml-ci

* llama : aboud ggml_repeat during classification

* rerank : cleanup + comments

* server : accept /rerank endpoint in addition to /v1/rerank [no ci]

* embedding : parse special tokens

* jina : support v1 reranker

* vocab : minor style

ggml-ci

* server : initiate tests for later

ggml-ci

* server : add docs

* llama : add comment [no ci]

* llama : fix uninitialized tensors

* ci : add rerank tests

ggml-ci

* add reranking test

* change test data

* Update examples/server/server.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add `--reranking` argument

* update server docs

* llama : fix comment [no ci]

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-28 17:42:03 +03:00
slaren 1b2f992cd2 test-backend-ops : use flops for some performance tests (#9657)
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
2024-09-28 14:32:46 +02:00
Georgi Gerganov 739842703e llama : add comment about thread-safety [no ci] (#9449) 2024-09-28 15:13:42 +03:00
Zhenwei Jin 6102037bbb vocab : refactor tokenizer to reduce init overhead (#9449)
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-28 15:10:58 +03:00
nopperl 9a913110cf llama : add support for Chameleon (#8543)
* convert chameleon hf to gguf

* add chameleon tokenizer tests

* fix lint

* implement chameleon graph

* add swin norm param

* return qk norm weights and biases to original format

* implement swin norm

* suppress image token output

* rem tabs

* add comment to conversion

* fix ci

* check for k norm separately

* adapt to new lora implementation

* fix layer input for swin norm

* move swin_norm in gguf writer

* add comment regarding special token regex in chameleon pre-tokenizer

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix punctuation regex in chameleon pre-tokenizer (@compilade)

Co-authored-by: compilade <git@compilade.net>

* fix lint

* trigger ci

---------

Co-authored-by: compilade <git@compilade.net>
2024-09-28 15:08:43 +03:00
Aarni Koskela 43bcdd9703 readme : add tool (#9655) 2024-09-28 15:07:14 +03:00
Dan Johansson 6a0f779484 ggml : add run-time detection of neon, i8mm and sve (#9331)
* ggml: Added run-time detection of neon, i8mm and sve

Adds run-time detection of the Arm instructions set features
neon, i8mm and sve for Linux and Apple build targets.

* ggml: Extend feature detection to include non aarch64 Arm arch

* ggml: Move definition of ggml_arm_arch_features to the global data section
2024-09-28 15:06:16 +03:00
Markus Tavenrath 89f9944981 Enable use to the rebar feature to upload buffers to the device. (#9251) 2024-09-28 12:05:05 +02:00
Georgi Gerganov b5de3b74a5 readme : update hot topics 2024-09-27 20:57:51 +03:00
Borislav Stanimirov 44f59b4301 cmake : add option for common library (#9661) 2024-09-27 10:42:06 +03:00
Neo Zhang Jianyu 95bc82fbc0 [SYCL] add missed dll file in package (#9577)
* update oneapi to 2024.2

* use 2024.1

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-26 17:38:31 +08:00
R0CKSTAR 7691654c68 mtgpu: enable VMM (#9597)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-26 03:27:40 +02:00
Xuan Son Nguyen ea9c32be71 ci : fix docker build number and tag name (#9638)
* ci : fix docker build number and tag name

* fine-grant permissions
2024-09-25 17:26:01 +02:00
Charles Xu 1e43630218 ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (#9217)
* ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels

* added fallback mechanism when the offline re-quantized model is not
optimized for the underlying target.

* fix for build errors

* remove prints from the low-level code

* Rebase to the latest upstream
2024-09-25 16:12:20 +03:00
Xuan Son Nguyen afbbfaa537 server : add more env vars, improve gen-docs (#9635)
* server : add more env vars, improve gen-docs

* update server docs

* LLAMA_ARG_NO_CONTEXT_SHIFT
2024-09-25 14:05:13 +02:00
Gabe Goodhart 3d6bf6919f llama : add IBM Granite MoE architecture (#9438)
* feat(gguf-py): Add granitemoe architecture

This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add GraniteMoeModel

GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granitemoe convert): Split the double-sized input layer into gate and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: alex.brooks@ibm.com

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(granitemoe): Implement granitemoe

GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Typo fix in docstring

Co-Authored-By: ggerganov@gmail.com

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(conversion): Simplify tensor name mapping in conversion

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Remove unused tensor name mappings

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Sanity check on merged FFN tensor sizes

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow "output" layer in granite moe architecture (convert and cpp)

Branch: GraniteMoE

Co-Authored-By: git@compilade.net
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granite): Add missing 'output' tensor for Granite

This is a fix for the previous `granite` architecture PR. Recent snapshots
have included this (`lm_head.weights`) as part of the architecture

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-25 10:06:52 +03:00
Dou Xinpeng 904837e0cb cann: fix crash when llama-bench is running on multiple cann devices (#9627) 2024-09-25 11:30:38 +08:00
Eric Zhang 70392f1f81 ggml : add AVX512DQ requirement for AVX512 builds (#9622) 2024-09-24 11:03:21 +03:00
Georgi Gerganov bb5f819975 sync : ggml 2024-09-24 11:01:18 +03:00
Georgi Gerganov c038931615 examples : adapt to ggml.h changes (ggml/0)
ggml-ci
2024-09-24 11:00:52 +03:00
Georgi Gerganov 31ac5834fe llama : keep track of all EOG tokens in the vocab (#9609)
ggml-ci
2024-09-24 10:16:06 +03:00
Georgi Gerganov cea1486ecf log : add CONT level for continuing previous log entry (#9610) 2024-09-24 10:15:35 +03:00
StrangeBytesDev 0aa15011e3 server : add newline after chat example (#9616) 2024-09-24 09:04:39 +03:00
Georgi Gerganov b0f27361f3 sampling : avoid expensive softmax during greedy sampling (#9605)
* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 09:03:17 +03:00
Max Krasnyansky c087b6f11d threads: fix msvc build without openmp (#9615)
We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.
2024-09-23 21:18:48 -07:00
Ivan 116efee0ee cuda: add q8_0->f32 cpy operation (#9571)
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
2024-09-24 02:14:24 +02:00
Xuan Son Nguyen 0b3bf966f4 server : add --no-context-shift option (#9607)
* server : add --no-context-shift option

* small fix

* Update examples/server/tests/features/embeddings.feature

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : minor fix

* revert usage of GGML_ASSERT

* update server documentation

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 22:23:54 +02:00
Max Krasnyansky f0c7b5edf8 threads: improve ggml_barrier scaling with large number of threads (#9598)
Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing.
This optimization shows performance improvements even for n_threads <= 8 cases.

Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write
in the normal case and just use thread-fence as originally intended.

---
Here is the original description and suggestions from Willy Tarreau :

There's currently some false sharing between n_barrier and
n_barrier_passed that is amplified in ggml_barrier() by the fact that
all threads need to increment n_barrier when entering, while all
previous threads continue to read n_barrier_passed, waiting for the last
one to release them all. The side effect is that all these readers are
slowing down all new threads by making the cache line bounce back and
forth between readers and writers.

Just placing them in two distinct cache lines is sufficient to boost
the performance by 21% on a 80-core ARM server compared to the
no-openmp version, and by 3% compared to the openmp version.

Note that the variables could have been spread apart in the structure
as well, but it doesn't seem that the size of this threadpool struct is
critical so here we're simply aligning them.

Finally, the same issue was present when leaving the barrier since all
threads had to update the n_barrier_passed counter, though only one
would add a non-zero value. This alone is responsible for half of the
cost due to undesired serialization.

It might be possible that using a small array of n_barrier counters
could make things even faster on many-core systems, but it would likely
complicate the logic needed to detect the last thread.

Co-authored-by: Willy Tarreau <w@1wt.eu>
2024-09-23 11:42:43 -07:00
Riceball LEE 1d48e98e4f readme : add programmable prompt engine language CLI (#9599) 2024-09-23 18:58:17 +03:00
Georgi Gerganov f3979df762 flake.lock: Update (#9586)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/4f807e8940284ad7925ebd0a0993d2a1791acb2f?narHash=sha256-IiA3jfbR7K/B5%2B9byVi9BZGWTD4VSbWe8VLpp9B/iYk%3D' (2024-09-11)
  → 'github:NixOS/nixpkgs/c04d5652cfa9742b1d519688f65d1bbccea9eb7e?narHash=sha256-PmUr/2GQGvFTIJ6/Tvsins7Q43KTMvMFhvG6oaYK%2BWk%3D' (2024-09-19)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-09-23 08:43:40 -07:00
Srihari-mcw 1e7b9299c6 ggml : AVX512 gemm for Q4_0_8_8 (#9532)
* AVX512 version of ggml_gemm_q4_0_8x8_q8_0

* Remove zero vector parameter passing

* Rename functions and rearrange order of macros

* Edit commments

* style : minor adjustments

* Update x to start from 0

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-23 17:06:38 +03:00
Georgi Gerganov 37f8c7b4c9 perplexity : remove extra new lines after chunks (#9596) 2024-09-23 11:28:02 +03:00
Georgi Gerganov bf9c1013ac metal : use F32 prec for K*Q in vec FA (#9595)
ggml-ci
2024-09-23 11:27:47 +03:00
Akarshan Biswas e62e9789cd Revert "[SYCL] fallback mmvq (#9088)" (#9579)
This reverts commit 50addec9a5.
2024-09-23 11:28:06 +08:00
R0CKSTAR c35e586ea5 musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526)
* mtgpu: add mp_21 support

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: enable unified memory

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-22 16:55:49 +02:00
Molly Sophia 912c331d3d Fix merge error in #9454 (#9589)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-22 15:26:50 +02:00
Johannes Gäßler a5b57b08ce CUDA: enable Gemma FA for HIP/Pascal (#9581) 2024-09-22 09:34:52 +02:00
Shankar ecd5d6b65b llama: remove redundant loop when constructing ubatch (#9574) 2024-09-22 04:30:34 +02:00
Molly Sophia 2a63caaa69 RWKV v6: RWKV_WKV op CUDA implementation (#9454)
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-22 04:29:12 +02:00
slaren d09770cae7 ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG (#9573) 2024-09-21 14:24:23 +02:00
agray3 41f477879f Update CUDA graph on scale change plus clear nodes/params (#9550)
* Avoid using saved CUDA graph if scale changes and reset nodes/params on update

Fixes https://github.com/ggerganov/llama.cpp/issues/9451

* clear before resize
2024-09-21 02:41:07 +02:00
Huang Qi e948a7da7a CI: Provide prebuilt windows binary for hip (#9467) 2024-09-21 02:39:41 +02:00
slaren 63351143b2 quantize : improve type name parsing (#9570)
quantize : do not ignore invalid types in arg parsing

quantize : ignore case of type and ftype arguments
2024-09-20 20:55:36 +02:00
Georgi Gerganov d13edb17ed ggml : fix builds (#0)
ggml-ci
2024-09-20 21:15:05 +03:00
Georgi Gerganov 27609c49b9 ggml : fix trailing whitespace (#0)
ggml-ci
2024-09-20 21:15:05 +03:00
Georgi Gerganov 4301535326 sync : ggml
ggml-ci
2024-09-20 21:15:05 +03:00
Johannes Gäßler 424c5d00a9 ggml/examples: add backend support for numerical optimization (ggml/949)
* CUDA eval works

* stochastic gradient descent op

* Adam except decay

* CUDA CROSS_ENTROPY_LOSS_BACK

* CUDA mnist-fc training works

* backend CLI arg

* refactor gguf load

* remove sched from opt_step_adam

* implement l1 regularization (weight decay)

* extra call to add optimizer

* initialize gradients with ggml_graph_reset

* gradient accumulation

* increment iter per eval instead of epoch

* adjust backend interfaces

* fix ggml_graph_reset without backend

* fix ggml graph export/import

* fixup

* rename

* revert ggml_opt changes

* more general CUDA repeat_back

* update documentation, fix CNN

* validation split

* add clarifying comment

* optimize PyTorch training

* adjust buffer size, thread count

* fix 0.0f validation split

* Update examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix gradient accumulation

* tensor flag for accumulators -> tensor hash set

* Update include/ggml.h

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* fix test prints

* Update src/ggml-backend.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* better CUDA support for noncontiguous out_prod

* add comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-20 21:15:05 +03:00
Georgi Gerganov a6809c6a2e examples : add null threadpool args where needed (ggml/0)
ggml-ci
2024-09-20 21:15:05 +03:00
Johannes Gäßler 5cb12f6839 CUDA: fix sum.cu compilation for CUDA < 11.7 (#9562) 2024-09-20 18:35:35 +02:00
Georgi Gerganov d39e26741f examples : flush log upon ctrl+c (#9559) 2024-09-20 11:46:56 +03:00
Sigbjørn Skjæret 722ec1eb51 perplexity : do not escape input data by default (#9548) 2024-09-20 09:38:10 +03:00
Georgi Gerganov 6026da52d6 server : clean-up completed tasks from waiting list (#9531)
ggml-ci
2024-09-19 12:44:53 +03:00
Sigbjørn Skjæret eca0fab44e imatrix : disable prompt escape by default (#9543) 2024-09-19 10:58:14 +03:00
slaren 64c6af3195 ggml : fix n_threads_cur initialization with one thread (#9538)
* ggml : fix n_threads_cur initialization with one thread

* Update ggml/src/ggml.c

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
2024-09-18 10:13:08 -07:00
Georgi Gerganov 0d2f22e45c scripts : verify py deps at the start of compare (#9520) 2024-09-18 18:34:32 +03:00
Daniel Bevenius 6443ddd985 llama : use reserve/emplace_back in sampler_sample (#9534)
This commit updates the llama_sampler_sample function to use reserve and
emplace_back for the vector of llama_token_data structs.

The motivation for this change is to avoid the creation of n_vocab
default-constructed llama_token_data structs which are then
immediately overwritten.
2024-09-18 14:42:36 +03:00
Vinesh Janarthanan 8a308354f6 server : match OAI structured output response (#9527) 2024-09-18 09:50:34 +03:00
Eric Zhang f799155ab8 server : fix OpenSSL build (remove obsolete LOG_INFO) (#9529) 2024-09-18 09:28:20 +03:00
Neo Zhang Jianyu faf67b3de4 [SYCL]set context default value to avoid memory issue, update guide (#9476)
* set context default to avoid memory issue, update guide

* Update docs/backend/SYCL.md

Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-09-18 08:30:31 +08:00
Michael Podvitskiy 7be099fa81 llama-bench: correct argument parsing error message (#9524) 2024-09-17 22:41:38 +02:00
Bert Wagner 8b836ae731 arg : add env variable for parallel (#9513)
* add env variable for parallel

* Update README.md with env:  LLAMA_ARG_N_PARALLEL
2024-09-17 16:35:38 +03:00
Michael Podvitskiy 8344ef58f8 llama : fix n_vocab init for 'no_vocab' case (#9511)
* llama: fixed n_vocab for `no_vocab` models

* llama: updated error output for `llama_decode_internal` and `llama_encode_internal`

* llama: log warning if there's no vocab_size in metadata

* llama: correct vocab size for logging

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-17 13:18:22 +03:00
Max Krasnyansky 0226613853 threadpool : skip polling for unused threads (#9461)
* threadpool: skip polling for unused threads

Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur).

n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written
from one thread and read from other threads (not a race conditions).

* threadpool: further simplify and improve ggml_barrier

Avoid using strict memory order while polling, yet make sure that all threads go through
full memory barrier (memory fence) on ggml_barrier entrace and exit.

* threads: add simple barrier test

This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.

* threadpool: improve thread sync for new-graphs

Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order
to keep it efficient, once the new graph is detected we do full fence using read-modify-write
with strict memory order.

* threadpool: improve abort handling

Do not use threadpool->ec (exit code) to decide whether to exit the compute loop.
threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it.

Instead introduce atomic threadpool->abort flag used for this. This is consistent with
how we handle threadpool->stop or pause.

While at it add an explicit atomic_load for n_threads_cur for consistency.

* test-barrier: release threadpool before releasing the context

fixes use-after-free detected by gcc thread-sanitizer on x86-64
for some reason llvm sanitizer is not detecting this issue.
2024-09-17 11:19:46 +03:00
Yuri Khrustalev 503147a9f9 unicode : add <algorithm> (#9508) 2024-09-17 09:51:15 +03:00
Gabe Goodhart 0d2ec43833 llama : support IBM Granite architecture (#9412)
* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2024-09-17 09:44:58 +03:00
Michael Podvitskiy 37f3a3810e llama : add llama_n_head() (#9512) 2024-09-17 09:23:30 +03:00
slaren 23e0d70bac ggml : move common CPU backend impl to new header (#9509) 2024-09-16 16:22:07 +02:00
Daniel Bevenius acb2c32c33 llama : rename n_embed to n_embd in rwkv6_time_mix (#9504)
This commit renames n_embed to n_embd in llm_build_rwkv6_time_mix.

The motivation for this change is consistency with the other rwkv6
functions like build_rwkv6 (and other parts of the code base).
2024-09-16 14:07:13 +03:00
Michael Podvitskiy a6a3a5c531 ggml : link MATH_LIBRARY not by its full path (#9339) 2024-09-16 14:06:50 +03:00
compilade d54c21df7e convert : identify missing model files (#9397) 2024-09-16 10:30:22 +03:00
Georgi Gerganov 19514d632e cmake : do not hide GGML options + rename option (#9465)
* cmake : do not hide GGML options

ggml-ci

* build : rename flag GGML_CUDA_USE_GRAPHS -> GGML_CUDA_GRAPHS

for consistency

ggml-ci
2024-09-16 10:27:50 +03:00
Eve 5c3d0f1824 ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)
* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
2024-09-16 09:48:24 +03:00
Shane A 0aadac10c7 llama : support OLMoE (#9462) 2024-09-16 09:47:37 +03:00
CarryFun 95ca85168b llama : support MiniCPM3 (#9322)
Co-authored-by: 范睿凯 <fanruikai@modelbest.cn>
2024-09-16 09:45:20 +03:00
Vinesh Janarthanan 441b72b91f main : option to disable context shift (#9484)
* added cli arg to disable context shift

* reverted precommit

* updated README.md for main

* white space

* allow disabling context shift in the server

* Update common/arg.cpp

no-context-shift only works for main example

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added server example to --no-context-shift args

* removed server changes

* white space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-16 09:20:01 +03:00
Georgi Gerganov c4965a64f7 metal : handle zero-sized allocs (#9466) 2024-09-16 09:05:56 +03:00
Georgi Gerganov 90a2fff0e7 flake.lock: Update (#9488) 2024-09-15 19:14:23 -07:00
Georgi Gerganov 6262d13e0b common : reimplement logging (#9418)
https://github.com/ggerganov/llama.cpp/pull/9418
2024-09-15 20:46:12 +03:00
slaren e6deac31f7 gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging

* gguf-split : error when too many arguments are passed
2024-09-15 19:02:27 +02:00
Michael Podvitskiy 6988da94a2 cmake : correct order of sycl flags (#9497) 2024-09-15 19:55:52 +03:00
Csaba Kecskemeti 3c7989fd29 py : add "LLaMAForCausalLM" conversion support (#9485)
Co-authored-by: Csaba Kecskemeti <csabakecskemeti@Csabas-Mac-Pro.local>
2024-09-15 10:48:25 +03:00
OSecret d6b37c881f readme : update tools list (#9475)
* Added link to proprietary wrapper for Unity3d into README.md

Wrapper has prebuild library and was tested on iOS, Android, WebGL, PC, Mac platforms, has online demos like [this](https://d23myu0xfn2ttc.cloudfront.net/rich/index.html) and [that](https://d23myu0xfn2ttc.cloudfront.net/).

* Update README.md

Fixes upon review
2024-09-15 10:36:53 +03:00
Michael Podvitskiy 7596487beb cmake : try to fix sycl+intel build (#9487) 2024-09-15 10:06:38 +03:00
Yuri Khrustalev 822b6322de ggml : ggml_type_name return "NONE" for invalid values (#9458)
When running on Windows, the quantization utility attempts to print the types that are not set which leads to a crash.
2024-09-14 12:54:37 +03:00
VoidIsVoid dcdcee3a74 server: add data: [DONE] to /chat/completions stream response (#9459) 2024-09-14 11:36:44 +02:00
Georgi Gerganov 1f4111e540 cmake : use list(APPEND ...) instead of set() + dedup linker (#9463)
* cmake : use list(APPEND ...) instead of set() + dedup linker

ggml-ci

* cmake : try fix sycl

* cmake : try to fix sycl 2

* cmake : fix sycl build (#9469)

* try fix sycl build

* use CMAKE_CXX_FLAGS as a string variable

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* one more CMAKE_CXX_FLAGS fix (#9471)

---------

Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com>
2024-09-14 10:55:05 +03:00
Daniel Bevenius befaf1197f llama : make cell_id const in inp_s_mask block (#9470)
This commit makes the cell_id variable const in the inp_s_mask block.

The motivation for this change is consistency with the code in the
inp_s_copy block.
2024-09-14 10:50:12 +03:00
Xuan Son Nguyen feff4aa846 server : add loading html page while model is loading (#9468)
* Adding loading page for '/' server requests

* set content when model is loading

* removed loading html file

* updated cmakelist

* updated makefile

* cleaned up whitespace

* cleanup for PR removed error

* updated server test to handle 503 HTML

* updated server test to handle 503 HTML

* ca†ch 503 before parsing json

* revert test

* account for both api and web browser requests

* precommit corrections

* eol fix

* revert changes to pre-commit

* removed print statement

* made loading message more descriptive

* also support .html files

---------

Co-authored-by: VJHack <flymyplane21@gmail.com>
Co-authored-by: Vinesh Janarthanan <36610342+VJHack@users.noreply.github.com>
2024-09-13 14:23:11 +02:00
Georgi Gerganov 0abc6a2c25 llama : llama_perf + option to disable timings during decode (#9355)
* llama : llama_perf + option to disable timings during decode

ggml-ci

* common : add llama_arg

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* perf : separate functions in the API

ggml-ci

* perf : safer pointer handling + naming update

ggml-ci

* minor : better local var name

* perf : abort on invalid sampler pointer

ggml-ci

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-09-13 09:53:38 +03:00
Gilad S. bd35cb0ae3 feat: remove a sampler from a chain (#9445)
* feat: remove a sampler from a chain

* fix: return removed sampler

* fix: safer casting
2024-09-13 03:54:49 +02:00
Mathijs Henquet 78203641fe server : Add option to return token pieces in /tokenize endpoint (#9108)
* server : added with_pieces functionality to /tokenize endpoint

* server : Add tokenize with pieces tests to server.feature

* Handle case if tokenizer splits along utf8 continuation bytes

* Add example of token splitting

* Remove trailing ws

* Fix trailing ws

* Maybe fix ci

* maybe this fix windows ci?

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-09-12 22:30:11 +02:00
Dou Xinpeng e6b7801bd1 cann: Add host buffer type for Ascend NPU (#9406)
* feat: Add host buffer type for Ascend NPU(CANN backend)

* fix some checking errors

* Add a few comments
2024-09-12 19:46:43 +08:00
fengerhu1 e665744317 llava : fix the script error in MobileVLM README (#9054)
Signed-off-by: Erhu Feng <2748250768@qq.com>
2024-09-12 14:34:22 +03:00
Xuan Son Nguyen d4c3c10fad lora : raise error if lm_head is ignored (#9103)
* lora : raise error if lm_head is ignored

* fix style

* clarify comment
2024-09-12 14:33:57 +03:00
Michael Podvitskiy 2a825116b6 cmake : fix for builds without GGML_CDEF_PUBLIC (#9338)
* `GGML_TARGET_DEFINES-NOTFOUND` fix for builds without `GGML_CDEF_PUBLIC`

* Update CMakeLists.txt, spaces fix
2024-09-12 14:30:01 +03:00
Huang Qi 4dc4f5f14a ci : update HIP SDK to 24.Q3 (ROCm 6.1) (#9329) 2024-09-12 14:28:43 +03:00
daminho c837981bba py : add Phi-1.5/Phi-2 tokenizer (#9361)
* add phi2 tokenizer

* add phi name to convert_hf_to_gguf_update.py

* make tokenizer_pre consistent; llama.cpp work
2024-09-12 14:28:20 +03:00
Trivikram Kamat 3c26a1644d ci : bump actions/checkout to v4 (#9377) 2024-09-12 14:27:45 +03:00
Michael Podvitskiy ff76e18516 cmake : fixed the order of linking libraries for llama-quantize (#9450) 2024-09-12 14:27:14 +03:00
Molly Sophia 39f852f440 py : add special tokens in hf_converter for RWKV v6 (#9428)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-12 14:25:16 +03:00
Ahmad Tameem 2b00fa7997 riscv : modify Makefile and add a RISCV_VECT to print log info (#9442)
- Added ggml_cpu_has_riscv_v() in GGML to print system info in log
- Modified Makefile to only use flag when cross compiling for RISC-V
2024-09-12 14:24:31 +03:00
Georgi Gerganov d6a04f872d ggml : hide ggml_object, ggml_cgraph, ggml_hash_set (#9408)
* ggml : hide ggml_object, ggml_cgraph, ggml_hash_set

ggml-ci

* ggml : add ggml-impl.h to backends

* ggml : fix compiler warnings

ggml-ci

* ggml : add assert upon adding nodes
2024-09-12 14:23:49 +03:00
Neo Zhang Jianyu c9c8575a1a enhance run script to be easy to change the parameters (#9448)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-12 17:44:17 +08:00
Xinpeng Dou df4b7945ae cann: Fix error when running a non-exist op (#9424) 2024-09-12 09:02:35 +08:00
Faisal Zaghloul 449ccfb6f5 Add Jais to list of supported models (#9439)
Co-authored-by: fmz <quic_fzaghlou@quic.com>
2024-09-12 02:29:53 +02:00
slaren 1b28061400 llama : skip token bounds check when evaluating embeddings (#9437) 2024-09-11 17:52:13 +02:00
Pavel Zloi 8db003a19d py : support converting local models (#7547)
* Support of converting local models added to convert-hf-to-gguf-update.py

* Description fixed

* shutil added to imports
2024-09-11 15:29:51 +03:00
Xuan Son Nguyen 0996c5597f llava : correct args for minicpmv-cli (#9429) 2024-09-11 12:59:13 +02:00
Xuan Son Nguyen 5bb2c5dbd2 files : remove accidentally added lora_test submodule (#9430) 2024-09-11 13:02:09 +03:00
Farbod Bijary 67155ab7f5 feat: Implements retrying logic for downloading models using --model-url flag (#9255)
* feat: Implements retrying logic for downloading models using --model-url flag

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Update common/common.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* apply comments

* implements a retry function to avoid duplication

* fix editorconfig

* change function name

---------

Co-authored-by: farbod <farbod.bjary82@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-09-11 11:22:37 +02:00
Johannes Gäßler 5af118efda CUDA: fix --split-mode row race condition (#9413) 2024-09-11 10:22:40 +02:00
Georgi Gerganov d2b496bff4 batched-bench : remove unused code (#9305) 2024-09-11 10:03:54 +03:00
R0CKSTAR b34e023480 musa: remove Clang builtins mapping (#9421)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-09-11 03:46:55 +02:00
Alberto Cabrera Pérez 51b6038636 sycl : update support conditions (#9394)
* sycl : update support condition to im2col

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Added TODO to remind supporting FP32 im2col

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
2024-09-11 08:53:42 +08:00
Georgi Gerganov cb9c933eb2 flake.lock: Update (#9360)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/af510d4a62d071ea13925ce41c95e3dec816c01d?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
  → 'github:hercules-ci/flake-parts/567b938d64d4b4112ee253b9274472dc3a346eb6?narHash=sha256-%2Bebgonl3NbiKD2UD0x4BszCZQ6sTfL4xioaM49o5B3Y%3D' (2024-09-01)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/a5d394176e64ab29c852d03346c1fc9b0b7d33eb.tar.gz?narHash=sha256-uFf2QeW7eAHlYXuDktm9c25OxOyCoUOQmh5SZ9amE5Q%3D' (2024-08-01)
  → 'https://github.com/NixOS/nixpkgs/archive/356624c12086a18f2ea2825fed34523d60ccc4e3.tar.gz?narHash=sha256-Ss8QWLXdr2JCBPcYChJhz4xJm%2Bh/xjl4G0c0XlP6a74%3D' (2024-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)
  → 'github:NixOS/nixpkgs/574d1eac1c200690e27b8eb4e24887f8df7ac27c?narHash=sha256-v3rIhsJBOMLR8e/RNWxr828tB%2BWywYIoajrZKFM%2B0Gg%3D' (2024-09-06)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-09-10 15:46:59 -07:00
Xuan Son Nguyen 6cd4e03444 arg : bring back missing ifdef (#9411)
* arg : bring back missing ifdef

* replace with llama_supports_gpu_offload
2024-09-10 22:41:29 +02:00
matteo 8d300bd35f enable --special arg for llama-server (#9419)
Co-authored-by: matteo serva <matteo.serva@gmail.com>
2024-09-10 22:40:59 +02:00
slaren 49006c67b4 llama : move random seed generation to the samplers (#9398)
* llama_sampler_penalties : clamp penalty_last_n to zero
2024-09-10 18:04:25 +02:00
Georgi Gerganov 00ba2ff781 metal : fix compile warning with GGML_METAL_NDEBUG (#0) 2024-09-10 10:17:43 +03:00
Daniel Bevenius 83008b7cfe llama : update llm_build_copy_mask_state comment [no ci] (#9385)
This commit updates the comment, which seems to contain a typo or be an
outdated comment, in the copy_mask_state function changing the variable
n_rs to n_kv.

I believe this change is correct and what the comment wants to
convey is to copy the states that are not going to be used in the
upcoming processing, which are the tokens states from n_seqs up to
the number of possible token states n_kv.
2024-09-10 10:03:21 +03:00
Molly Sophia 0b4ac75772 RWKV v6: Add time_mix_decay_w1/w2 in quant exclusion list (#9387)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-10 10:02:30 +03:00
slaren fb3f249815 make : do not run llama-gen-docs when building (#9399) 2024-09-10 09:23:33 +03:00
Xuan Son Nguyen bfe76d4a17 common : move arg parser code to arg.cpp (#9388)
* common : move arg parser to arg.cpp

* better categorize args

* add cmake

* missing climits

* missing cstdarg

* common : more explicit includes

* fix build

* refactor gpt_params_parse

* update server readme

* fix test

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 23:36:09 +02:00
Radoslav Gerganov 293bebe077 rpc : fix segfault with nkvo (#9389)
* rpc : fix nkvo

* rpc : buf_size must not be static

ref: #9337

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-09 18:40:10 +03:00
Prashant Vithule 5fac4d5764 ggml : vector length agnostic SVE support (#9290)
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-09 18:37:18 +03:00
slaren 5fb5e24811 llama : minor sampling refactor (2) (#9386) 2024-09-09 17:10:46 +02:00
Georgi Gerganov 38ca6f644b readme : update hot topics 2024-09-09 15:51:37 +03:00
Johannes Gäßler 8e6e2fbe14 CUDA: fix variable name conflict for Windows build (#9382) 2024-09-09 14:22:53 +02:00
Antonis Makropoulos 5ed087573e readme : add LLMUnity to UI projects (#9381)
* add LLMUnity to UI projects

* add newline to examples/rpc/README.md to fix editorconfig-checker unit test
2024-09-09 14:21:38 +03:00
Radoslav Gerganov 54f376d0b9 rpc : update README [no ci] (#9320)
Update README with instructions how to offload model layers to both
local and remote devices
2024-09-09 11:04:39 +03:00
Dan Johansson b2e89a3274 Arm AArch64: Documentation updates (#9321)
* Arm AArch64: Documentation updates

* Update docs/build.md to include information on how to enable the Arm optimized gemm/gemv kernels

* Update examples/quantize/README.md with information on the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats

* Add newline to the end of docs/build.md
2024-09-09 10:02:45 +03:00
Markus Tavenrath daa9623ab0 Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (#9118)
* Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.

* fix compile issues

* Fix issues where the last submit wasn't executed or handled properly.

* remove trailing whitespace

* Repair GGML_VULKAN_CHECK_RESULTS

* Increase submit counter only if actual work has been submitted and increase submit count to 100.

* Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
2024-09-08 21:43:48 +02:00
Georgi Gerganov e079bffb66 cuda : fix FA Q src index (1 -> 0) (#9374) 2024-09-08 22:01:02 +03:00
Xuan Son Nguyen 3f7ccfd649 common : bring back missing args, add env var duplication check (#9375)
* common : bring back missing args

* move duplication check to test-arg-parser

* add check for duplicated env var

* correct default values
2024-09-08 18:08:55 +02:00
slaren a249843d89 common : restore --n-gpu-layers (#9371) 2024-09-08 16:44:42 +02:00
slaren 19f4a7b296 llama : refactor samplers internal implementation (#9370) 2024-09-08 15:52:07 +02:00
Neo Zhang Jianyu 2a358fb0c4 [SYCL] add check malloc result on device (#9346)
* add check malloc result on device

* update for review comments, check all malloc_device() result

---------

Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-09-08 19:05:29 +08:00
slaren eae597182c llama : sanitize tokens in the upper bound (#9359) 2024-09-08 12:41:51 +02:00
Xuan Son Nguyen 00b02bb249 imatrix : fix arg parser for imatrix (#9366)
* imatrix : fix arg parser

* beautify printing first arg
2024-09-08 12:12:17 +02:00
Georgi Gerganov a876861455 metal : update support condition for im2col + fix warning (#0) 2024-09-08 11:05:55 +03:00
Georgi Gerganov 385decbd63 sync : ggml 2024-09-08 11:05:55 +03:00
Georgi Gerganov 60a3107ccd scripts : option to increase git patch context 2024-09-08 11:05:55 +03:00
Salvatore Mesoraca 406c1a32a1 vulkan: add dryrun support to sin and cos ops (ggml/947)
sin and cos failed test-backend-ops because they
tried to dereference a context pointer that is null
on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-08 11:05:55 +03:00
Salvatore Mesoraca 9cb9260861 vulkan: correctly report support for OP_CONT (ggml/946)
test-backend-ops fails because ggml_cont aborts
when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-09-08 11:05:55 +03:00
Johannes Gäßler 202084d31d tests: add gradient tests for all backends (ggml/932)
* tests: add gradient checking to test-backend-ops

* remove old comment

* reorder includes

* adjust SIN/COS parameters

* add documentation, use supports_op if possible
2024-09-08 11:05:55 +03:00
Johannes Gäßler dbbebcab33 ggml: fix ggml_graph_cpy undefined behavior (ggml/943) 2024-09-08 11:05:55 +03:00
Georgi Gerganov ba1cf846ed cann : fix doxy (ggml/0) 2024-09-08 11:05:55 +03:00
Mengqing Cao d2d3200b38 cann : add Ascend NPU support (whisper/2336)
* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
2024-09-08 11:05:55 +03:00
Georgi Gerganov 51d964a4ef cuda : mark BF16 CONT as unsupported 2024-09-08 11:05:55 +03:00
Salvatore Mesoraca efe6a83e30 ggml : fix cont with transposed tensors when one dimension is 1 (ggml/934)
* ggml_cont: fix issue with transposed tensors when one dimension is 1

when using multiple threads, it is not enough
to check for the tensors to be contiguous for
ggml_compute_forward_dup_same_cont to work correctly.
The tensors strides also need to match.

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Add ggml_cont tests

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Remove dead code

it isn't possible to reach this code because
all these functions are invoked by ggml_compute_forward_dup
if and only if src0->type != dst->type

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Make ggml_compute_forward_dup_same_cont work with contiguous tensors

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

---------

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-08 11:05:55 +03:00
Kevin Gibbons fbb7fcffbc llama : set attrs of mislabelled EOT/EOM tokens (#9348) 2024-09-08 08:51:00 +03:00
Georgi Gerganov a5b5d9a101 llama.android : fix build (#9350) 2024-09-08 00:33:50 +03:00
Georgi Gerganov f12295b8a9 llama : fix empty ring buffer push (#9358) 2024-09-08 00:33:33 +03:00
Georgi Gerganov faf69d4237 llama : sanitize invalid tokens (#9357)
* common : do not add null tokens during warmup

ggml-ci

* llama : check that the input tokens are valid

ggml-ci

* tests : fix batch size of bert model

ggml-ci
2024-09-08 00:33:13 +03:00
Eve e536426ded llamafile : disable sgemm for batch-size 1 (#9330) 2024-09-07 22:02:26 +03:00
Xuan Son Nguyen 1b9ae5189c common : refactor arg parser (#9308)
* (wip) argparser v3

* migrated

* add test

* handle env

* fix linux build

* add export-docs example

* fix build (2)

* skip build test-arg-parser on windows

* update server docs

* bring back missing --alias

* bring back --n-predict

* clarify test-arg-parser

* small correction

* add comments

* fix args with 2 values

* refine example-specific args

* no more lamba capture

Co-authored-by: slaren@users.noreply.github.com

* params.sparams

* optimize more

* export-docs --> gen-docs
2024-09-07 20:43:51 +02:00
slaren e32d0816ed ggml : always check bounds on get_rows operations (#9354) 2024-09-07 20:23:07 +02:00
Georgi Gerganov df270ef745 llama : refactor sampling v2 (#9294)
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
2024-09-07 15:16:19 +03:00
Xuan Son Nguyen 947538acb8 ggml : fix missing cpu_set_t on emscripten (#9336)
* ggml : fix missing cpu_set_t on emscripten

* better version

* bring back android part
2024-09-07 12:01:34 +02:00
slaren 6c89eb0b47 ci : disable rocm image creation (#9340) 2024-09-07 10:48:54 +03:00
Xuan Son Nguyen 9b2c24c099 server : simplify state machine for slot (#9283)
* server : simplify state machine for slot

* add SLOT_STATE_DONE_PROMPT

* pop_deferred_task

* add missing notify_one

* fix passkey test

* metrics : add n_busy_slots_per_decode

* fix test step

* add test

* maybe fix AddressSanitizer?

* fix deque ?

* missing lock

* pop_deferred_task: also notify

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-06 23:21:29 +02:00
Aarni Koskela 134bc38ecf llama-bench : log benchmark progress (#9287)
* llama-bench : add optional progress messages
2024-09-06 23:03:01 +02:00
Aarni Koskela 815b1fb20a batched-bench : add --output-format jsonl option (#9293)
`--output-format` is modeled after `llama-bench`'s options
2024-09-06 17:59:58 +02:00
Changyeon Kim 409dc4f8bb ggml : fix build break for the vulkan-debug (#9265)
- windows build : Ok.
- linux build : Ok.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
2024-09-06 15:54:50 +03:00
Xuan Son Nguyen 4a1411b4f1 server : fix missing lock (#9334) 2024-09-06 14:06:04 +02:00
Markus Tavenrath 8ebe8ddebd Improve Vulkan shader build system (#9239)
* Improve Vulkan shader builds system

- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools

* remove not required self dependency
2024-09-06 08:56:17 +02:00
compilade 9bc6db28d0 ggml-quants : ternary packing for TriLMs and BitNet b1.58 (#8151)
* ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b

* ggml-quants : faster 1.625 bpw AVX2 vec_dot

Not using a lookup table anymore makes it match q4_0 speed.

* gguf-py : fix formatting

* llama : remove spaces on empty line

* ggml-quants : subtract 1 when back in epi8

This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.

* ggml-quants : Q2_2 now faster than Q4_K on with AVX2

* ggml-quants : cleanup Q1_3 code formatting

* ggml-quants : ARM NEON vec_dot for q2_2 and q1_3

* ggml-quants : use ceiling division when quantizing q1_3

* convert-hf : simplify BitNet pre-quantization

This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.

* convert-hf : allow converting the weird BitNet 1.3B

Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.

* bitnet : replace 1.58b with b1.58, as in the paper

* ggml-quants : fix build failure on Windows

* ggml-quants : attempt to fix Arm 32-bit support

* ggml : add some informative comments in q1_3 vec_dot

* ggml : add TQ1_0 and TQ2_0 ternary quantization types

* ggml : even faster TQ2_0

* ggml : also faster TQ1_0

Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.

* ggml : fix build issues in certain environments

* ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0

* ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat

The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.

* ggml : remove q1_3 and q2_2

No more 1.625 bpw and 2.000 bpw,
now instead using 1.6875 bpw and 2.0625 bpw
with TQ1_0 and TQ2_0, respectively.

* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.

* ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency

* ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot

Not yet tested on hardware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.

* gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0

* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0

This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.

* convert : allow direct conversion to TQ1_0 and TQ2_0

The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.

* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0

Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.

* ggml-quants : allow using ARM dot product instructions for TQ1_0

* ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support

* ggml : remove unused ggml_mul special case

It would otherwise conflict with the more general
optimization coming with Mamba-2.

* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators

* test-backend-ops : add TQ1_0 and TQ2_0 comments for later

Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.
2024-09-05 21:48:47 -04:00
awatuna 32b2ec88bc Update build.yml (#9184)
build rpc-server for windows cuda
2024-09-06 00:34:36 +02:00
Michael Podvitskiy 1031771faa CMake fix: host for msvc compiler can only be x86 or x64 (#8624) 2024-09-06 00:14:12 +02:00
slaren 4db04784f9 cuda : fix defrag with quantized KV (#9319) 2024-09-05 11:13:11 +02:00
slaren bdf314f38a llama-bench : fix NUL terminators in CPU name (#9313) 2024-09-05 02:19:39 +02:00
Srihari-mcw 581c305186 ggml : AVX2 support for Q4_0_8_8 (#8713)
* Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions

* Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC

* Update comments and indentation

* Make updates to reduce number of load instructions
2024-09-04 19:51:22 +03:00
Ouadie EL FAROUKI 5910ea9427 [SYCL] Fix DMMV dequantization (#9279)
Fixed dmmv dequant for ncols== GGML_SYCL_DMMV_X
2024-09-04 16:26:33 +01:00
杨朱 · Kiki c8671ae282 Fix broken links in docker.md (#9306) 2024-09-04 13:45:28 +02:00
Radoslav Gerganov 82e3b03c11 rpc : make RPC servers come first in the device list (#9296)
* rpc : make RPC servers come first in the device list

* rpc : disable options for non-RPC builds

* rpc : rpc_count always zero for non-RPC builds
2024-09-04 11:08:32 +03:00
Pascal Patry 9379d3cc17 readme : rename result_format to response_format (#9300) 2024-09-04 09:45:40 +03:00
Georgi Gerganov 7605ae7daf flake.lock: Update (#9261)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/8471fe90ad337a8074e957b69ca4d0089218391d?narHash=sha256-XOQkdLafnb/p9ij77byFQjDf5m5QYl9b2REiVClC%2Bx4%3D' (2024-08-01)
  → 'github:hercules-ci/flake-parts/af510d4a62d071ea13925ce41c95e3dec816c01d?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/c374d94f1536013ca8e92341b540eba4c22f9c62?narHash=sha256-Z/ELQhrSd7bMzTO8r7NZgi9g5emh%2BaRKoCdaAv5fiO0%3D' (2024-08-21)
  → 'github:NixOS/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-09-03 16:36:43 -07:00
Aarni Koskela 8962422b1c llama-bench : add JSONL (NDJSON) output mode (#9288)
* llama-bench : add JSONL (NDJSON) output mode

* llama-bench : update usage docs
2024-09-03 19:58:54 +02:00
Georgi Gerganov b69a480af4 readme : refactor API section + remove old hot topics 2024-09-03 10:00:36 +03:00
Xuan Son Nguyen 48baa61ecc server : test script : add timeout for all requests (#9282) 2024-09-02 22:08:38 +02:00
Zhenwei Jin f1485161e5 src: make tail invalid when kv cell is intersection for mamba (#9249) 2024-09-02 13:53:23 -04:00
slaren 048de848ee docker : fix missing binaries in full-cuda image (#9278) 2024-09-02 18:11:13 +02:00
yuri@FreeBSD f771d064a9 ggml : add pthread includes on FreeBSD (#9258) 2024-09-02 18:25:30 +03:00
Xuan Son Nguyen 6e7d133a5f server : refactor multitask handling (#9274)
* server : remove multitask from server_task

* refactor completions handler

* fix embeddings

* use res_ok everywhere

* small change for handle_slots_action

* use unordered_set everywhere

* (try) fix test

* no more "mutable" lambda

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* use deque

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-02 17:11:51 +02:00
Guoliang Hua b60074f1c2 llama-cli : remove duplicated log message (#9275) 2024-09-02 15:36:43 +03:00
Tushar 9c1ba55733 build(nix): Package gguf-py (#5664)
* style: format with nixfmt/rfc101-style

* build(nix): Package gguf-py

* build(nix): Refactor to new scope for gguf-py

* build(nix): Exclude gguf-py from devShells

* build(nix): Refactor gguf-py derivation to take in exact deps

* build(nix): Enable pytestCheckHook and pythonImportsCheck for gguf-py

* build(python): Package python scripts with pyproject.toml

* chore: Cleanup

* dev(nix): Break up python/C devShells

* build(python): Relax pytorch version constraint

Nix has an older version

* chore: Move cmake to nativeBuildInputs for devShell

* fmt: Reconcile formatting with rebase

* style: nix fmt

* cleanup: Remove unncessary __init__.py

* chore: Suggestions from review

- Filter out non-source files from llama-scripts flake derivation
- Clean up unused closure
- Remove scripts devShell

* revert: Bad changes

* dev: Simplify devShells, restore the -extra devShell

* build(nix): Add pyyaml for gguf-py

* chore: Remove some unused bindings

* dev: Add tiktoken to -extra devShells
2024-09-02 14:21:01 +03:00
Georgi Gerganov c6d4cb4655 llama : minor style 2024-09-02 11:52:37 +03:00
Molly Sophia 8f1d81a0b6 llama : support RWKV v6 models (#8980)
* convert_hf_to_gguf: Add support for RWKV v6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add RWKV tokenization

* Fix build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Do not use special tokens when matching in RWKV tokenizer

* Fix model loading

* Add (broken) placeholder graph builder for RWKV

* Add workaround for kv cache

* Add logits conversion to rwkv5

* Add rwkv5 layer norms

* Add time mix KVRG & correct merge mistake

* Add remaining time mix parameters

* Add time mix output loading

* Add placeholder llm_build_time_mix

* Fix build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Load more tensors for rwkv v6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix rwkv tokenizer

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add unary operator Exp

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6 graph building

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add ``rescale_every_n_layers`` parameter

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add ``wkv.head_size`` key for RWKV

so it doesn't reuse Mamba ssm parameters

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix offloading layers to CUDA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix parallel inferencing for RWKV

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Remove trailing whitespaces

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* build_rwkv: Avoid using inplace operations

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* convert_hf_to_gguf: rwkv: Avoid using ``eval``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* convert_hf_to_gguf: rwkv tokenizer: Don't escape sequences manually

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* ggml: Add backward computation for unary op ``exp``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Use MODEL_ARCH.RWKV6 instead of MODEL_ARCH.RWKV

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* build_rwkv6: Simplify graph

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Detect model.type

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Fix tensor loading for 7B/14B models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Fix group_norm assertion failure with Metal

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Clean up

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add quantization tensor exclusion

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Use the new advanced batch splits

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama: rwkv6: Use ``ggml_norm`` instead of ``ggml_group_norm``

Co-authored-by: compilade <git@compilade.net>

* llama: rwkv6: Apply code style and misc changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Use class name ``Rwkv6Model``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Make use of key ``feed_forward_length``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add kv ``time_mix_extra_dim`` and ``time_decay_extra_dim``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Match ``new_name`` instead of ``name`` for float32 explicit tensors

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Keep ``time_mix_w1/w2`` as F32

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Remove unused nodes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Apply code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: rwkv6: Add lora for some supported tensors

Currently att.key/receptance/value/gate/output, ffn.receptance/key/value, as well as head.weight

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* rwkv : speed-up tokenization using trie

* minor : style + indentation

* llama: rwkv6: Avoid division by zero

Co-authored-by: compilade <git@compilade.net>

* ggml: rwkv_wkv: Avoid copying the state

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Layl Bongers <3094382+LaylBongers@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-01 17:38:17 +03:00
Echo Nolan a47667cff4 nix: fix CUDA build - replace deprecated autoAddOpenGLRunpathHook
The CUDA nix build broke when we updated nixpkgs in
8cd1bcfd3f. As far as I can tell all
that happened is cudaPackages.autoAddOpenGLRunpathHook got moved to
pkgs.autoAddDriverRunpath. This commit fixes it.
2024-08-31 08:44:21 +00:00
Srihari-mcw ea5d7478b1 sgemm : improved Q4_0 and Q8_0 performance via 4xN and Mx4 gemm (#8908) 2024-08-31 11:20:35 +03:00
Daniel Bevenius 49271efbaf llama : fix typo in xcda_array_view comment [no ci] (#9132) 2024-08-31 10:50:22 +03:00
Sutou Kouhei 0ab30f8d82 llama : fix llama_split_mode enum values in main_gpu document (#9057)
LLAMA_SPLIT_* were renamed to LLAMA_SPLIT_MODE_* in #5697.
2024-08-30 20:08:10 +02:00
蕭澧邦 cddae4884c Correct typo run_llama2.sh > run-llama2.sh (#9149) 2024-08-30 22:10:01 +10:00
tc-mb 7ea8d80d53 llava : the function "clip" should be int (#9237) 2024-08-30 07:21:57 +02:00
Faisal Zaghloul 42c76d1358 Threadpool: take 2 (#8672)
* Introduce ggml_compute_threadpool

- OpenMP functional: check
- Vanilla ggml functional: Check
- ggml w/threadpool functional: Check
- OpenMP no regression: No glaring problems
- Vanilla ggml no regression: No glaring problems
- ggml w/threadpool no regression: No glaring problems

* Minor fixes

* fixed use after release bug

* fixed a harmless race condition

* Fix Android bulid issue

* fix more race conditions

* fix deadlock for cases where cgraph.n_nodes == 1

and fix --poll case

* threadpool: use cpu_get_num_math to set the default number of threadpool threads

This way we avoid using E-Cores and Hyperthreaded siblings.

* bench: create fresh threadpool for each test

For benchmarking it's better to start a fresh pool for each test with the exact number of threads
needed for that test. Having larger pools is suboptimal (causes more load, etc).

* atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier

This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior.

* threadpool: make polling the default to match openmp behavior

All command line args now allow for setting poll to 0 (false).

* threadpool: do not wakeup threads in already paused threadpool

* fix potential race condition in check_for_work

* threadpool: do not create two threadpools if their params are identical

* threadpool: reduce pause/resume/wakeup overhead in common cases

We now start threadpool in paused state only if we have two.
The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead.

* threadpool: add support for hybrid polling

poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var.
poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ...

The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms.
We can tune this further as things evolve.

* threadpool: reduce the number of barrier required

New work is now indicated with an atomic counter that is incremented for
each new graph that needs to be computed.
This removes the need for extra barrier for clearing the "new_work" and
removes the special case for trivial graphs.

* threadpool: remove special-casing for disposable threadpools

With the efficient hybrid polling there is no need to make disposable pools any different.
This simplifies the overall logic and reduces branching.

Include n_threads in debug print for disposable threadpool.

Declare pause and stop flags as atomic_bool
This doesn't actually generate any memory barriers and simply informs
the thread sanitizer that these flags can be written & read by different
threads without locking.

* threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs)

This fixes the race condition with very small graphs where the main thread happens to
start a new graph while the workers are just about to exit from barriers.

* threadpool: use relaxed order for chunk sync

Full memory barrier is an overkill for this since each thread works on different chunk

* threadpool: remove abort_callback from threadpool state

* threadpool: better naming for thread/cpumask releated functions

* threadpool: consistent use of int type for n_threads params

* threadpool: add support for ggml_threadpool_params_default/init

Also removes the need for explicit mask_specified param.
all-zero cpumask means use default (usually inherited) cpu affinity mask.

* threadpool: move typedef into ggml.h

* threadpool: fix apply_priority() function name

* threadpool: fix swift wrapper errors due to n_threads int type cleanup

* threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled

* threadpool: replace checks for compute_thread ret code with proper status check

* threadpool: simplify threadpool init logic and fix main thread affinity application

Most of the init code is now exactly the same between threadpool and openmp.

* threadpool: update threadpool resume/pause function names

* threadpool: enable openmp by default for now

* threadpool: don't forget to free workers state when omp is enabled

* threadpool: avoid updating process priority on the platforms that do not require it

On Windows we need to change overall process priority class in order to set thread priorities,
but on Linux, Mac, etc we do not need to touch the overall process settings.

* threadpool: update calling thread prio and affinity only at start/resume

This avoids extra syscalls for each graph_compute()

* llama-bench: turn threadpool params into vectors, add output headers, etc

* llama-bench: add support for cool off between tests --delay

This helps for long running tests on platforms that are thermally limited (phones, laptops, etc).
--delay (disabled by default) introduces the sleep for N seconds before starting each test.

* threadpool: move process priority setting into the apps (bench and cli)

This avoids changing the overall process priority on Windows for the apps
that use ggml/llama.cpp directy.

* threadpool: move all pause/resume logic into ggml

* threadpool: futher api cleanup and prep for future refactoring

All threadpool related functions and structs use ggml_threadpool prefix.

* threadpool: minor indent fixes

* threadpool: improve setprioty error message

* Update examples/llama-bench/llama-bench.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* threadpool: fix indent in set_threadpool call

* use int32_t for n_thread type in public llama.cpp API

* threadpool: use _new and _free instead of _create and _release

* fix two more public APIs to use int32_t for n_threads

* build: set _GNU_SOURCE for Adroid

---------

Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>
Co-authored-by: fmz <quic_fzaghlou@quic.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-30 01:20:53 +02:00
Jan Boon 9f7d4bcf5c server : fix crash when error handler dumps invalid utf-8 json (#9195) 2024-08-30 07:15:26 +08:00
Georgi Gerganov 1d1ccce676 flake.lock: Update (#9162)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/c3aa7b8938b17aebd2deecf7be0636000d62a2b9?narHash=sha256-med8%2B5DSWa2UnOqtdICndjDAEjxr5D7zaIiK4pn0Q7c%3D' (2024-08-14)
  → 'github:NixOS/nixpkgs/c374d94f1536013ca8e92341b540eba4c22f9c62?narHash=sha256-Z/ELQhrSd7bMzTO8r7NZgi9g5emh%2BaRKoCdaAv5fiO0%3D' (2024-08-21)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-08-28 21:28:14 -07:00
slaren 9fe94ccac9 docker : build images only once (#9225) 2024-08-28 17:28:00 +02:00
slaren 66b039a501 docker : update CUDA images (#9213) 2024-08-28 13:20:36 +02:00
Georgi Gerganov 20f1789dfb vulkan : fix build (#0)
ggml-ci
2024-08-27 22:41:27 +03:00
Georgi Gerganov 231cff5f6f sync : ggml 2024-08-27 22:41:27 +03:00
Xie Yanbo 3246fe84d7 Fix minicpm example directory (#9111) 2024-08-27 14:33:08 +02:00
compilade 78eb487bb0 llama : fix qs.n_attention_wv for DeepSeek-V2 (#9156) 2024-08-27 13:09:23 +03:00
Xuan Son Nguyen a77feb5d71 server : add some missing env variables (#9116)
* server : add some missing env variables

* add LLAMA_ARG_HOST to server dockerfile

* also add LLAMA_ARG_CONT_BATCHING
2024-08-27 11:07:01 +02:00
CausalLM 2e59d61c1b llama : fix ChatGLM4 wrong shape (#9194)
This should fix THUDM/glm-4-9b-chat-1m and CausalLM/miniG
2024-08-27 09:58:22 +03:00
Carsten Kragelund Jørgensen 75e1dbbaab llama : fix llama3.1 rope_freqs not respecting custom head_dim (#9141)
* fix: llama3.1 rope_freqs not respecting custom head_dim

* fix: use potential head_dim for Exaone
2024-08-27 09:53:40 +03:00
arch-btw ad76569f8e common : Update stb_image.h to latest version (#9161)
* Update stb_image.h to latest version

Fixes https://github.com/ggerganov/llama.cpp/issues/7431

* Update .ecrc
2024-08-27 08:58:50 +03:00
slaren 7d787ed96c ggml : do not crash when quantizing q4_x_x with an imatrix (#9192) 2024-08-26 19:44:43 +02:00
Georgi Gerganov 06658ad7c3 metal : separate scale and mask from QKT in FA kernel (#9189)
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
2024-08-26 18:31:02 +03:00
Georgi Gerganov fc18425b6a ggml : add SSM Metal kernels (#8546)
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
2024-08-26 17:55:36 +03:00
Georgi Gerganov 879275ac98 tests : fix compile warnings for unreachable code (#9185)
ggml-ci
2024-08-26 16:30:25 +03:00
Georgi Gerganov 7a3df798fc ci : add VULKAN support to ggml-ci (#9055) 2024-08-26 12:19:39 +03:00
Georgi Gerganov e5edb210cd server : update deps (#9183) 2024-08-26 12:16:57 +03:00
slaren 0c41e03ceb metal : gemma2 flash attention support (#9159) 2024-08-26 11:08:59 +02:00
slaren f12ceaca0c ggml-ci : try to improve build time (#9160) 2024-08-26 11:03:30 +02:00
Justine Tunney 436787f170 llama : fix time complexity of string replacement (#9163)
This change fixes a bug where replacing text in a very long string could
cause llama.cpp to hang indefinitely. This is because the algorithm used
was quadratic, due to memmove() when s.replace() is called in a loop. It
seems most search results and LLM responses actually provide the O(n**2)
algorithm, which is a great tragedy. Using a builder string fixes things
2024-08-26 09:09:53 +03:00
Herman Semenov 93bc3839f9 common: fixed not working find argument --n-gpu-layers-draft (#9175) 2024-08-26 00:54:37 +02:00
Johannes Gäßler f91fc5639b CUDA: fix Gemma 2 numerical issues for FA (#9166) 2024-08-25 22:11:48 +02:00
Johannes Gäßler e11bd856d5 CPU/CUDA: Gemma 2 FlashAttention support (#8542)
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
2024-08-24 21:34:59 +02:00
João Dinis Ferreira 8f824ffe8e quantize : fix typo in usage help of quantize.cpp (#9145) 2024-08-24 09:22:45 +03:00
Xuan Son Nguyen 3ba780e2a8 lora : fix llama conversion script with ROPE_FREQS (#9117) 2024-08-23 12:58:53 +02:00
piDack a07c32ea54 llama : use F32 precision in GLM4 attention and no FA (#9130) 2024-08-23 10:27:17 +03:00
Akarshan Biswas 11b84eb457 [SYCL] Add a space to supress a cmake warning (#9133) 2024-08-22 22:09:47 +08:00
luoyu-intel 1731d4238f [SYCL] Add oneDNN primitive support (#9091)
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
2024-08-22 12:50:10 +08:00
compilade a1631e53f6 llama : simplify Mamba with advanced batch splits (#8526)
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 17:58:11 -04:00
Xuan Son Nguyen fc54ef0d1c server : support reading arguments from environment variables (#9105)
* server : support reading arguments from environment variables

* add -fa and -dt

* readme : specify non-arg env var
2024-08-21 11:04:34 +02:00
Younes Belkada b40eb84895 llama : support for falcon-mamba architecture (#9074)
* feat: initial support for llama.cpp

* fix: lint

* refactor: better refactor

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* fix: address comments

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* fix: add more cleanup and harmonization

* fix: lint

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* fix: change name

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

* add in operator

* fix: add `dt_b_c_rms` in `llm_load_print_meta`

* fix: correct printf format for bool

* fix: correct print format

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* llama : quantize more Mamba tensors

* llama : use f16 as the fallback of fallback quant types

---------

Co-authored-by: compilade <git@compilade.net>
2024-08-21 11:06:36 +03:00
fairydreaming f63f603c87 llava : zero-initialize clip_ctx structure fields with aggregate initialization 908)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-21 09:45:49 +02:00
Daniel Bevenius 8455340b87 llama : std::move llm_bigram_bpe from work_queue (#9062)
* llama : std::move llm_bigram_bpe from work_queue

This commit updates the retrieval of llm_bigram_bpe objects from
work_queue.top() by using std::move.

The motivation for this is to avoid the copying of the std::string
`text` member of the llm_bigram_bpe struct.

* squash! llama : std::move llm_bigram_bpe from work_queue

Introduced a MovablePriorityQueue class to allow moving elements
out of the priority queue for llm_bigram_bpe.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename MovablePriorityQueue to lama_priority_queue.

* squash! llama : std::move llm_bigram_bpe from work_queue

Rename lama_priority_queue -> llama_priority_queue.
2024-08-21 10:32:58 +03:00
Changyeon Kim 2f3c1466ff llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. (#8984)
* llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model.

- The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
- A GGML_OP_ACC shader has been added.
- The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* fix-up coding style.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix-up the missing initial parameter to resolve the compilation warning.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Add missing parameters.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* [fix] Use nb1 and nb2 for dst.

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>

* Fix check results ggml_acc call

---------

Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
Co-authored-by: 0cc4m <picard12@live.de>
2024-08-20 21:00:00 +02:00
Meng, Hengyu 50addec9a5 [SYCL] fallback mmvq (#9088)
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
2024-08-20 23:50:17 +08:00
zhentaoyu 4f8d19ff17 [SYCL] Fix SYCL im2col and convert Overflow with Large Dims (#9052)
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl:refine convert

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-20 23:06:51 +08:00
fairydreaming 90db8146d5 tests : add missing comma in grammar integration tests (#9099)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-20 12:09:55 +03:00
wangshuai09 cfac111e2b cann: add doc for cann backend (#8867)
Co-authored-by: xuedinge233 <damow890@gmail.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2024-08-19 16:46:38 +08:00
Radoslav Gerganov 1b6ff90ff8 rpc : print error message when failed to connect endpoint (#9042) 2024-08-19 10:11:45 +03:00
Radoslav Gerganov 18eaf29f4c rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
2024-08-19 10:10:21 +03:00
Georgi Gerganov 554b049068 flake.lock: Update (#9068) 2024-08-18 07:43:32 -07:00
ltoniazzi 2339a0be1c tests : add integration test for lora adapters (#8957)
* Add printing to check weights match torch version

* minor code style changes

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2024-08-18 11:58:04 +02:00
Yoshi Suhara 2fb9267887 Fix incorrect use of ctx_split for bias tensors (#9063) 2024-08-17 15:34:21 +02:00
Xuan Son Nguyen 8b3befc0e2 server : refactor middleware and /health endpoint (#9056)
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-16 17:19:05 +02:00
tc-mb d565bb2fd5 llava : support MiniCPM-V-2.6 (#8967)
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* support minicpmv2.6

* modify convert script of minicpmv

* modify convert

* modify convert

* add readme

* add resampler of v2.6

* modify clip

* modify readme

* fix type-check

* fix type-check

* fix type-check

* fix type-check

* modify convert script and readme

* fix convert script and readme

* fix convert

* fix num in convert

* fix type-check

---------

Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
2024-08-16 16:34:41 +03:00
Farbod Bijary ee2984bdaf py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928)
Co-authored-by: farbod <farbod.bjary82@gmail.com>
2024-08-16 13:36:30 +03:00
Aisuko c8ddce8560 Fix inference example lacks required parameters (#9035)
Signed-off-by: Aisuko <urakiny@gmail.com>
2024-08-16 11:08:59 +02:00
compilade 23fd453544 gguf-py : bump version from 0.9.1 to 0.10.0 (#9051) 2024-08-16 09:36:11 +03:00
Minsoo Cheong c679e0cb5c llama : add EXAONE model support (#9025)
* add exaone model support

* add chat template

* fix whitespace

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add ftype

* add exaone pre-tokenizer in `llama-vocab.cpp`

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* fix lint

Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com>

* add `EXAONE` to supported models in `README.md`

* fix space

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-16 09:35:18 +03:00
Liu Jia fb487bb567 common : add support for cpu_get_num_physical_cores() on Windows (#8771)
* Add support for cpu_get_num_phsical_cores() on Windows

* fix build bug on msys2-clang64 and ucrt64

* avoid adding new function

* add new macros to avoid windows+mingw64

* Add error checking to return default value
2024-08-16 09:23:12 +03:00
Yoshi Suhara 2a24c8caa6 Add Nemotron/Minitron GGUF Conversion & Inference Support (#8922)
* Add nemotron GGUF conversion & inference support

* Fix formatting issues

* Remove unnecessary write_tensors()

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Address comments by @compilade

* Replace ggml_mul_mat()->llm_build_lora_mm()

* Remove mutable variable

* Use  for bias tensors

* Cover corner case for role_scaling not in config.json

---------

Co-authored-by: compilade <git@compilade.net>
2024-08-16 04:23:33 +02:00
Nico Bosshard e3f6fd56b1 ggml : dynamic ggml_sched_max_splits based on graph_size (#9047)
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
2024-08-16 04:22:55 +02:00
gtygo 4b9afbbe90 retrieval : fix memory leak in retrieval query handling (#8955)
* retrieval

* Reuse querybatch to reduce frequent memory allocation

* delete unused white space
2024-08-15 10:40:12 +03:00
Riceball LEE 37501d9c79 server : fix duplicated n_predict key in the generation_settings (#8994) 2024-08-15 10:28:05 +03:00
Zhenwei Jin 4af8420afb common : remove duplicate function llama_should_add_bos_token (#8778) 2024-08-15 10:23:23 +03:00
Esko Toivonen 6bda7ce6c3 llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850) 2024-08-15 10:17:12 +03:00
Georgi Gerganov d5492f0525 ci : disable bench workflow (#9010) 2024-08-15 10:11:11 +03:00
Jiří Podivín 234b30676a server : init stop and error fields of the result struct (#9026)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-08-15 09:21:57 +03:00
0cc4m 5fd89a70ea Vulkan Optimizations and Fixes (#8959)
* Optimize Vulkan REPEAT performance

* Use Vulkan GLSL fused multiply-add instruction where possible

* Add GGML_VULKAN_PERF option to output performance data per operator

* Rework and fix Vulkan descriptor set and descriptor pool handling

* Fix float32 concat f16 shader validation error

* Add Vulkan GROUP_NORM eps parameter

* Fix validation error with transfer queue memory barrier flags

* Remove trailing whitespaces
2024-08-14 18:32:53 +02:00
compilade 98a532d474 server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt

* server : fix parallel generation with very small batch sizes

* server : fix typo in comment
2024-08-14 09:51:02 +03:00
Georgi Gerganov 43bdd3ce18 cmake : remove unused option GGML_CURL (#9011) 2024-08-14 09:14:49 +03:00
Daniel Bevenius 06943a69f6 ggml : move rope type enum to ggml.h (#8949)
* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-13 21:13:15 +02:00
Xuan Son Nguyen 828d6ff7d7 export-lora : throw error if lora is quantized (#9002) 2024-08-13 11:41:14 +02:00
Diogo Teles Sant'Anna fc4ca27b25 ci : fix github workflow vulnerable to script injection (#9008)
Signed-off-by: Diogo Teles Sant'Anna <diogoteles@google.com>
2024-08-12 19:28:23 +03:00
Radoslav Gerganov 1f67436c5e ci : enable RPC in all of the released builds (#9006)
ref: #8912
2024-08-12 19:17:03 +03:00
Nico Bosshard 0fd93cdef5 llama : model-based max number of graph nodes calculation (#8970)
* llama : model-based max number of graph nodes calculation

* Update src/llama.cpp

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 17:13:59 +02:00
Frank Mai 84eb2f4fad docs: introduce gpustack and gguf-parser (#8873)
* readme: introduce gpustack

GPUStack is an open-source GPU cluster manager for running large
language models, which uses llama.cpp as the backend.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* readme: introduce gguf-parser

GGUF Parser is a tool to review/check the GGUF file and estimate the
memory usage without downloading the whole model.

Signed-off-by: thxCode <thxcode0824@gmail.com>

---------

Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-08-12 14:45:50 +02:00
DavidKorczynski 1262e7ed13 grammar-parser : fix possible null-deref (#9004)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 15:36:41 +03:00
DavidKorczynski df5478fbea ggml: fix div-by-zero (#9003)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724

In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-12 14:21:41 +02:00
Liu Jia 2589292cde Fix a spelling mistake (#9001) 2024-08-12 11:46:03 +02:00
Georgi Gerganov d3ae0ee8d7 py : fix requirements check '==' -> '~=' (#8982)
* py : fix requirements check '==' -> '~='

* cont : fix the fix

* ci : run on all requirements.txt
2024-08-12 11:02:01 +03:00
Georgi Gerganov 5ef07e25ac server : handle models with missing EOS token (#8997)
ggml-ci
2024-08-12 10:21:50 +03:00
compilade 4134999e01 gguf-py : Numpy dequantization for most types (#8939)
* gguf-py : Numpy dequantization for most types

* gguf-py : Numpy dequantization for grid-based i-quants
2024-08-11 14:45:41 -04:00
Georgi Gerganov 8cd1bcfd3f flake.lock: Update (#8979) 2024-08-11 06:58:58 -07:00
Neo Zhang a21c6fd450 update guide (#8909)
Co-authored-by: Neo Zhang <>
2024-08-11 14:07:43 +05:30
fairydreaming 33309f661a llama : check all graph nodes when searching for result_embd_pooled (#8956)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-11 10:35:26 +02:00
Markus Tavenrath 7c5bfd57f8 Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (#8943)
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-08-11 10:09:09 +02:00
slaren 6e02327e8b metal : fix uninitialized abort_callback (#8968) 2024-08-10 15:42:10 +02:00
Xuan Son Nguyen 7eb23840ed llama : default n_swa for phi-3 (#8931)
* default n_swa for phi-3

* fix

* double check swa
2024-08-10 13:04:40 +02:00
fairydreaming 7c3f55c100 Add support for encoder-only T5 models (#8900)
* gguf-py : add T5ENCODER model architecture

* common : call llama_decode() during warmup only if the model has decoder

* convert-hf : add T5EncoderModel

* llama : add llama_model_has_decoder() API function

* llama : split build_t5() into build_t5_encoder() and build_t5_decoder()

* llama : add support for LLM_ARCH_T5ENCODER

* llama-embedding : add support for LLAMA_POOLING_TYPE_NONE

* llama-embedding : add support for encoder-only models

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-10 11:43:26 +02:00
Matteo Mortari 911b437f22 gguf-py : fix double call to add_architecture() (#8952)
Signed-off-by: tarilabs <matteo.mortari@gmail.com>
2024-08-10 08:58:49 +03:00
Georgi Gerganov b72942fac9 Merge commit from fork 2024-08-09 23:03:21 +03:00
fairydreaming 6afd1a99dc llama : add support for lora adapters in T5 model (#8938)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-09 18:53:09 +02:00
Georgi Gerganov 272e3bd95e make : fix llava obj file race (#8946)
ggml-ci
2024-08-09 18:24:30 +03:00
Georgi Gerganov 45a55b91aa llama : better replace_all (cont) (#8926)
* llama : better replace_all (cont)

ggml-ci

* code : deduplicate replace_all

ggml-ci
2024-08-09 18:23:52 +03:00
tc-mb 3071c0a5f2 llava : support MiniCPM-V-2.5 (#7599)
* init

* rename

* add run android for termux in readme

* add android readme

* add instructions in readme

* change name in readme

* Update README.md

* fixed line

* add result in readme

* random pos_embed

* add positions index

* change for ollama

* change for ollama

* better pos_embed in clip

* support ollama

* updata cmakelist

* updata cmakelist

* rename wrapper

* clear code

* replace and organize code

* add link

* sync master

* fix warnings

* fix warnings

* fix bug in bicubic resize when need resize iamge smaller

* receive review comments and modify

* receive review comments and modify

* put all code into llava dir

* fix quality problem in pr code

* change n_layer

* add space in "-1"

* imitate reshape bug of python code

* fix bug in clip

* fix issues for merging

* fix llama-minicpmv-cli in cmake file

* change pr readme

* fix code review

* remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir

* fix cmakefile

* add warn

* fix KEY_HAS_MINICPMV_PROJ

* remove load_image_size into clip_ctx

* remove the extern "C", MINICPMV_API

* fix uhd code for review comment

* delete minicpmv-wrapper in pr

* remove uhd_image_embed

* Modify 2 notes

* clip : style changes

* del common.h in clip

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix Type-Check error

* fix makefile error

* fix ubuntu-make error

* try fix clip

* try fix 1

---------

Co-authored-by: Hongji Zhu <fireyoucan@gmail.com>
Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-09 13:33:53 +03:00
Georgi Gerganov 4305b57c80 sync : ggml 2024-08-09 10:03:48 +03:00
Matt Stephenson 70c0ea3560 whisper : use vulkan as gpu backend when available (whisper/2302)
* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

---------

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
2024-08-09 10:03:44 +03:00
Daniel Bevenius 5b2c04f492 embedding : add --pooling option to README.md [no ci] (#8934)
This commit adds the `--pooling` option to the README.md file in the
`examples/embedding` directory.

The motivation for adding this options is that currently if the model
used does not specify a pooling type the embedding example will fail
with the following error message:
```console
main: error: pooling type NONE not supported
```

This commit also updates the name of the executable in the examples
section.
2024-08-09 09:33:30 +03:00
Daniel Bevenius 6f6496bb09 llama : fix typo in llama_tensor_get_type comment [no ci] (#8937) 2024-08-09 09:32:23 +03:00
Mathieu Geli daef3ab233 server : add one level list nesting for embeddings (#8936) 2024-08-09 09:32:02 +03:00
compilade 345a686d82 llama : reduce useless copies when saving session (#8916)
* llama : avoid useless copies in dummy session writer

* llama : avoid double tensor copy when saving session to buffer
2024-08-08 23:54:00 -04:00
compilade 3a14e00366 gguf-py : simplify support for quant types (#8838)
* gguf-py : use classes for quants

* convert_hf : simplify internal quantization type selection

* gguf-py : fix flake8 lint

* gguf-py : fix BF16 numpy view type

* gguf-py : remove LlamaFileTypeMap

Too specific to 'llama.cpp', and would be a maintenance burden
to keep up to date.

* gguf-py : add generic quantize and dequantize functions

The quant classes no longer need to be known,
only the target or the source type,
for 'quantize' and 'dequantize', respectively.
2024-08-08 13:33:09 -04:00
Georgi Gerganov afd27f01fe scripts : sync cann files (#0) 2024-08-08 14:56:52 +03:00
Georgi Gerganov 366d486c16 scripts : fix sync filenames (#0) 2024-08-08 14:40:12 +03:00
Georgi Gerganov e44a561ab0 sync : ggml 2024-08-08 13:19:47 +03:00
Borislav Stanimirov f93d49ab1e ggml : ignore more msvc warnings (ggml/906) 2024-08-08 13:19:31 +03:00
Georgi Gerganov 5b33ea1ee7 metal : fix struct name (ggml/912)
ggml-ci
2024-08-08 13:19:31 +03:00
Conrad Kramer 85fca8deb6 metal : add abort callback (ggml/905) 2024-08-08 13:19:30 +03:00
Pablo Duboue ebd541a570 make : clean llamafile objects (#8923)
`ggml/src/llamafile/sgemm.o` was not deleted on `make clean`
2024-08-08 11:44:51 +03:00
slaren 15fa07a5c5 make : use C compiler to build metal embed object (#8899)
* make : use C compiler to build metal embed object

* use rm + rmdir to avoid -r flag in rm
2024-08-07 18:24:05 +02:00
slaren be55695eff ggml-backend : fix async copy from CPU (#8897)
* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same
2024-08-07 13:29:02 +02:00
Ouadie EL FAROUKI 0478174d59 [SYCL] Updated SYCL device filtering (#8901)
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
2024-08-07 11:25:36 +01:00
Johannes Gäßler a8dbc6f753 CUDA/HIP: fix tests/test-backend-ops (#8896) 2024-08-07 09:07:52 +02:00
Zhenwei Jin 506122d854 llama-bench : add support for getting cpu info on Windows (#8824)
* Add support for getting cpu info on Windows for llama_bench

* refactor

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-07 03:01:06 +02:00
Daniel Bevenius 725e3d9437 quantize : update usage comment in quantize.cpp (#8889)
This commit updates the usage comment in quantize.cpp to reflect the
new name of the executable, which is llama-quantize.
2024-08-07 01:43:00 +02:00
Nexes the Old 31958546c3 typo correction (#8891) 2024-08-07 01:41:54 +02:00
Xuan Son Nguyen 1e6f6554aa server : add lora hotswap endpoint (WIP) (#8857)
* server : add lora hotswap endpoint

* handle lora_no_apply

* fix build

* updae docs

* clean up struct def

* fix build

* add LoRA test

* fix style
2024-08-06 17:33:39 +02:00
Johannes Gäßler 641f5dd2a6 CUDA: fix padding logic for FP16/FP32 (#8884) 2024-08-06 17:13:55 +02:00
Daniel Bevenius 5f4dcb1e60 simple : update name of executable to llama-simple (#8885)
This commit updates the name of the executable in README.md from
`simple` to `llama-simple`.
2024-08-06 16:44:35 +02:00
Jaeden Amero db20f50cf4 cmake : Link vulkan-shaders-gen with pthreads (#8835)
When using CMake to build with Vulkan support, compiling
vulkan-shaders-gen fails due to missing a CMakeLists.txt specification
to link vulkan-shaders-gen with the threading library, resulting in the
following error.

    [5/172] Linking CXX executable bin/vulkan-shaders-gen
    FAILED: bin/vulkan-shaders-gen
    : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen   && :
    ld: error: undefined symbol: pthread_create
    >>> referenced by vulkan-shaders-gen.cpp
    >>>               ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread**,
    >>>               void* (*)(void*), void*))
    c++: error: linker command failed with exit code 1 (use -v to see invocation)
    [6/172] Generating build details from Git
    -- Found Git: /usr/local/bin/git (found version "2.45.2")
    ninja: build stopped: subcommand failed.

Add the CMakeLists.txt specification to link vulkan-shaders-gen with the
threading library and fix the above error.

Fixes #8834
2024-08-06 15:21:47 +02:00
MaggotHATE efda90c93a [Vulkan] Fix compilation of vulkan-shaders-gen on w64devkit after e31a4f6 (#8880)
* Fix compilation issue in `vulkan-shaders-gen`

https://github.com/ggerganov/llama.cpp/commit/e31a4f679779220312c165b0f5994c680a610e38 broke compilation on w64devkit. Including `algorithm` seems to fix that.

* Guard it under `#ifdef _WIN32`
2024-08-06 13:32:03 +02:00
Georgi Gerganov 0bf16de07b contributing : add note about write access 2024-08-06 11:48:01 +03:00
Molly Sophia 2d5dd7bb3f ggml : add epsilon as a parameter for group_norm (#8818)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-08-06 10:26:46 +03:00
Douglas Hanley cdd1889de6 convert : add support for XLMRoberta embedding models (#8658)
* add conversion for bge-m3; small fix in unigram tokenizer

* clean up and simplify XLMRoberta conversion
2024-08-06 10:20:54 +03:00
Mengqing Cao c21a896405 [CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871)
* cann: fix ggml_backend_cann_buffer_get_tensor

 1. fix data ptr offset
 2. enable the acquisition of incomplete tensors

* fix backend cann set_tensor
2024-08-06 12:42:42 +08:00
Neo Zhang d4ff847153 [SYCL] correct cmd name (#8877) 2024-08-06 09:09:12 +08:00
Liu Jia 0a4ce78681 common : Changed tuple to struct (TODO fix) (#8823)
* common : Changed tuple to struct (TODO fix)

Use struct `llama_init_result` to replace the previous
std::tuple<struct llama_model *, struct llama_context *>

* delete llama_init_default_params()

* delete the extra whitespace
2024-08-05 18:14:10 +02:00
wangshuai09 bc0f887e15 cann: fix buffer_num and runtime speed slowly error (#8865) 2024-08-05 21:10:37 +08:00
Eric Curtin b42978e7e4 readme : add ramalama to the availables UI (#8811)
ramalama is a repo agnostic boring CLI tool that supports pulling from
ollama, huggingface and oci registries.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2024-08-05 15:45:01 +03:00
Justine Tunney b9dfc25ca3 ggml : fix overflows in elu function (#8866)
It's helpful to use expm1f(x), because expf(x)-1 will result in overflow
for 25% of single-precision floating point numbers.
2024-08-05 15:43:40 +03:00
Brian 1ef14b3007 py: Add more authorship metadata from model card (#8810)
* py: add more authorship metadata from model card

* fixup! py: add more authorship metadata from model card
2024-08-05 21:15:28 +10:00
fairydreaming d3f0c7166a Stop the generation when <|eom_id|> token is encountered - needed for Llama 3.1 tool call support (#8858)
* gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token

* llama : find Llama-3.1 <|eom_id|> token id during vocab loading

* llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-08-05 09:38:01 +02:00
stduhpf e31a4f6797 cmake: fix paths for vulkan shaders compilation on Windows (#8573)
* Vulkan-shaders: attempt fix compilation on windows

* fix miss-matched parenthesis
2024-08-05 08:18:27 +02:00
BarfingLemurs 400ae6f65f readme : update model list (#8851) 2024-08-05 08:54:10 +03:00
Georgi Gerganov f1ea5146d7 llama : better replace_all (#8852) 2024-08-05 08:53:39 +03:00
0cc4m 064cdc265f vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855)
* Fix Vulkan mul mat vec invalid results when ncols < warp size

* Only run backend ops mul mat vec block size test if block size not already covered
2024-08-05 08:52:55 +03:00
Georgi Gerganov 5587e57a76 sync : ggml
ggml-ci
2024-08-05 08:50:57 +03:00
0cc4m a3738b2fa7 vulkan : implement Stable Diffusion operators (ggml/904)
* Fix Vulkan repeat op

* Implement Vulkan concat op

* Delete old Vulkan shader generator

* Implement Vulkan im2col op

* Implement Vulkan unary gelu_quick op

* Implement Vulkan group_norm op

* Implement Vulkan timestep_embedding op

* Implement Vulkan upscale op

* Fix Vulkan vk_context tensor extra index issue

* Fix Vulkan matmul shader parameter bug

* Properly fix Vulkan matmul shader parameter bug

* Add Vulkan ADD f16 + f32 -> f16 operator support

* Implement Vulkan tanh op

* Fix Vulkan group count too large Validation error on non-Nvidia GPUs

* Throw error when too much memory is requested

* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs

* Fix matmul MMQ condition

* Implement Vulkan pad op

* Fix Vulkan crash when tensor is used multiple times in a compute graph

* Add Vulkan CONCAT f16 + f16 -> f16 op

* Add Vulkan LEAKY_RELU op
2024-08-05 08:50:57 +03:00
Daniel Bevenius 655858ace0 ggml : move c parameter comment to ggml_rope_ext (ggml/901)
This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-05 08:50:57 +03:00
wangshuai09 c02b0a8a4d cann: support q4_0 model (#8822) 2024-08-05 12:22:30 +08:00
Brandon Squizzato 0d6fb52be0 Install curl in runtime layer (#8693) 2024-08-04 20:17:16 +02:00
ardfork 978ba3d83d Server: Don't ignore llama.cpp params (#8754)
* Don't ignore llama.cpp params

* Add fallback for max_tokens
2024-08-04 20:16:23 +02:00
Brian Cunnie ecf6b7f23e batched-bench : handle empty -npl (#8839)
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-08-04 13:55:03 +03:00
Daniel Bevenius 01aae2b497 baby-llama : remove duplicate vector include 2024-08-04 13:24:59 +03:00
Georgi Gerganov 4b77ea95f5 flake.lock: Update (#8847) 2024-08-03 19:53:20 -07:00
jdomke 76614f352e ggml : reading the runtime sve config of the cpu (#8709)
* ggml : reading the runtime sve config of the cpu

* change to one time init to prevent performance drop

* prefix variable to avoid possible conflicts

* revert xxhash fix and add brackets

---------

Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>
2024-08-03 18:34:41 +02:00
Sigbjørn Skjæret b72c20b85c Fix conversion of unnormalized BF16->BF16 weights (#7843)
* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-02 15:11:39 -04:00
Mengqing Cao e09a800f9a cann: Fix ggml_cann_im2col for 1D im2col (#8819)
* fix ggml_cann_im2col for 1D im2col

* fix build warning
2024-08-02 16:50:53 +08:00
Ouadie EL FAROUKI 0fbbd88458 [SYCL] Fixing wrong VDR iq4nl value (#8812) 2024-08-02 08:55:17 +08:00
matteo afbb4c1322 ggml-cuda: Adding support for unified memory (#8035)
* Adding support for unified memory

* adding again the documentation about unified memory

* refactoring: Moved the unified memory code in the correct location.

* Fixed compilation error when using hipblas

* cleaning up the documentation

* Updating the documentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* adding one more case where the PR should not be enabled

---------

Co-authored-by: matteo serva <matteo.serva@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-08-01 23:28:28 +02:00
Alex O'Connell b7a08fd5e0 Build: Only include execinfo.h on linux systems that support it (#8783)
* Only enable backtrace on GLIBC linux systems

* fix missing file from copy

* use glibc macro instead of defining a custom one
2024-08-01 18:53:46 +02:00
slaren 7a11eb3a26 cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (#8800)
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X

* update asserts

* only use dmmv for supported types

* add test
2024-08-01 15:26:22 +02:00
wangshuai09 c8a0090922 cann: support q8_0 for Ascend backend (#8805) 2024-08-01 10:39:05 +08:00
Igor Okulist afbbcf3c04 server : update llama-server embedding flag documentation (#8779)
Fixes #8763
2024-07-31 19:59:09 -04:00
Clint Herron ed9d2854c9 Build: Fix potential race condition (#8781)
* Fix potential race condition as pointed out by @fairydreaming in #8776

* Reference the .o rather than rebuilding every time.

* Adding in CXXFLAGS and LDFLAGS

* Removing unnecessary linker flags.
2024-07-31 15:51:06 -04:00
pculliton 398ede5efe Adding Gemma 2 2B configs (#8784)
* Adding Gemma 2 2B configs

Updates to Q scaling and Gemma 2 model sizes to match v2 2B model.

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-31 17:12:10 +02:00
Borislav Stanimirov 44d28ddd5c cmake : fix use of external ggml (#8787) 2024-07-31 15:40:08 +02:00
Someone 268c566006 nix: cuda: rely on propagatedBuildInputs (#8772)
Listing individual outputs no longer necessary to reduce the runtime closure size after https://github.com/NixOS/nixpkgs/pull/323056.
2024-07-30 13:35:30 -07:00
Brian 7e72aa74fd py: add_array() will not add to kv store if value is an empty array (#8774)
* gguf_writer.py: add_array() should not add to kv store if empty

* Apply suggestions from code review

I was wondering if there was a specific reason for `if val` but good to hear we can safely use `len(val == 0`

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-31 00:57:03 +10:00
l3utterfly 7c27a19b2e added android implementation of ggml_print_backtrace_symbols (#8751)
* added android implementation of ggml_print_backtrace_symbols

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-30 16:40:18 +02:00
Georgi Gerganov 140074bb86 flake.lock: Update (#8729) 2024-07-30 05:58:57 -07:00
wangshuai09 6e2b6000e5 cann: update cmake (#8765) 2024-07-30 12:37:35 +02:00
zhentaoyu c887d8b017 [SYCL] Add TIMESTEP_EMBEDDING OP (#8707)
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-07-30 14:56:51 +08:00
CarterLi999 75af08c475 ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (#8748)
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.

Co-authored-by: carter.li <carter.li@starfivetech.com>
2024-07-29 18:38:34 +02:00
R0CKSTAR 439b3fc75a cuda : organize vendor-specific headers into vendors directory (#8746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-07-29 14:56:12 +02:00
Meng, Hengyu 0832de7236 [SYCL] add conv support (#8688) 2024-07-29 10:50:27 +08:00
Johannes Gäßler 6eeaeba126 cmake: use 1 more thread for non-ggml in CI (#8740) 2024-07-28 22:32:44 +02:00
Austin 4730faca61 chore : Fix vulkan related compiler warnings, add help text, improve CLI options (#8477)
* chore: Fix compiler warnings, add help text, improve CLI options

* Add prototypes for function definitions
* Invert logic of --no-clean option to be more intuitive
* Provide a new help prompt with clear instructions

* chore : Add ignore rule for vulkan shader generator

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>

* Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp

Co-authored-by: 0cc4m <picard12@live.de>

* chore : Remove void and apply C++ style empty parameters

* chore : Remove void and apply C++ style empty parameters

---------

Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: 0cc4m <picard12@live.de>
2024-07-28 09:52:42 +02:00
compilade 4c676c85e5 llama : refactor session file management (#8699)
* llama : refactor session file management

* llama : saving and restoring state checks for overflow

The size of the buffers should now be given to the functions working
with them, otherwise a truncated file could cause out of bound reads.

* llama : stream from session file instead of copying into a big buffer

Loading session files should no longer cause a memory usage spike.

* llama : llama_state_get_size returns the actual size instead of max

This is a breaking change, but makes that function *much* easier
to keep up to date, and it also makes it reflect the behavior
of llama_state_seq_get_size.

* llama : share code between whole and seq_id-specific state saving

Both session file types now use a more similar format.

* llama : no longer store all hparams in session files

Instead, the model arch name is stored.
The layer count and the embedding dimensions of the KV cache
are still verified when loading.
Storing all the hparams is not necessary.

* llama : fix uint64_t format type

* llama : various integer type cast and format string fixes

Some platforms use "%lu" and others "%llu" for uint64_t.
Not sure how to handle that, so casting to size_t when displaying errors.

* llama : remove _context suffix for llama_data_context

* llama : fix session file loading

llama_state_get_size cannot be used to get the max size anymore.

* llama : more graceful error handling of invalid session files

* llama : remove LLAMA_MAX_RNG_STATE

It's no longer necessary to limit the size of the RNG state,
because the max size of session files is not estimated anymore.

* llama : cast seq_id in comparison with unsigned n_seq_max
2024-07-28 00:42:05 -04:00
R0CKSTAR e54c35e4fb feat: Support Moore Threads GPU (#8383)
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-07-28 01:41:25 +02:00
Georgi Gerganov 5e2727fe03 scripts : sync vulkan-shaders (#0) 2024-07-27 18:08:47 +03:00
Georgi Gerganov 56f20aa25d scripts : sync ggml-aarch64 sources 2024-07-27 18:07:33 +03:00
Georgi Gerganov 345c8c0c87 ggml : add missing semicolon (#0)
ggml-ci
2024-07-27 17:43:44 +03:00
Georgi Gerganov ae7985cd7b sync : ggml
ggml-ci
2024-07-27 17:43:44 +03:00
Mahesh Madhav a05ca93697 ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
2024-07-27 17:43:44 +03:00
Ivan Filipov 9f77d899b7 ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations

* Add support for float16 input tensors in 2d pooling operations

* code cleanup

remove unnecessary casting during srow ptr initialization

---------

Co-authored-by: vanaka11 <vanaka1189@gmail.com>
2024-07-27 17:43:44 +03:00
Tony Wasserka 203b7f1531 vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.

Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>
2024-07-27 17:43:44 +03:00
Borislav Stanimirov d2b851bfa1 cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885) 2024-07-27 17:43:44 +03:00
Daniel Bevenius c12b6e8ee7 ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-27 17:43:44 +03:00
Jeffrey Morgan b5e95468b1 llama : add support for llama 3.1 rope scaling factors (#8676)
* Add llama 3.1 rope scaling factors to llama conversion and inference

This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

* address comments

* address comments

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-27 15:03:45 +03:00
Georgi Gerganov 92090eca21 llama : add function for model-based max number of graph nodes (#8622)
* llama : model-based max number of graph nodes

ggml-ci

* llama : disable 405B max_nodes path due to lack of complaints

ggml-ci
2024-07-27 14:59:29 +03:00
Daniel Bevenius 9d03d085dd common : add --no-warmup option for main/llama-cli (#8712)
This commit adds a --no-warmup option for llama-cli.

The motivation for this is that it can be convenient to skip the
warmup llama_decode call when debugging.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-27 13:45:02 +03:00
wangshuai09 bfb4c74981 cann: Fix Multi-NPU execution error (#8710)
* cann: fix multi-npu exec error

* cann: update comment  for ggml_backend_cann_supports_buft
2024-07-27 16:36:44 +08:00
slaren 2b1f616b20 ggml : reduce hash table reset cost (#8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-07-27 04:41:55 +02:00
Judd 01245f5b16 llama : fix order of parameters (#8706)
usage of `aclrtGetMemInfo` is correct:

https://www.hiascend.com/doc_center/source/zh/canncommercial/63RC2/inferapplicationdev/aclcppdevg/aclcppdevg_03_0103.html

Co-authored-by: Judd <foldl@boxvest.com>
2024-07-26 11:38:12 +03:00
Yaiko 01aec4a631 server : add Speech Recognition & Synthesis to UI (#8679)
* server : add Speech Recognition & Synthesis to UI

* server : add Speech Recognition & Synthesis to UI (fixes)
2024-07-26 00:10:16 +02:00
Xuan Son Nguyen 41cd47caab examples : export-lora : fix issue with quantized base models (#8687) 2024-07-25 23:49:39 +02:00
DavidKorczynski 49ce0ab6d4 ggml: handle ggml_init failure to fix NULL pointer deref (#8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.
2024-07-25 23:23:05 +02:00
Georgi Gerganov 4226a8d10e llama : fix build + fix fabs compile warnings (#8683)
ggml-ci
2024-07-25 19:57:31 +03:00
Andreas (Andi) Kunar bf5a81df37 ggml : fix build on Windows with Snapdragon X (#8531)
* Improvements for Windows with Snapdragon X

* Revert "Improvements for Windows with Snapdragon X"

This reverts commit bf21397ae5.

* Improvements for Windows with Snapdragon X

* WOA build clarifications

* WIndows on ARM build clarifications

* cmake build for Windows clarifications

* Update docs/build.md

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: AndreasKunar <andreaskmsn.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-25 19:01:00 +03:00
Georgi Gerganov 88954f7fbd tests : fix printfs (#8068) 2024-07-25 18:58:04 +03:00
Chen Xi ed67bcb24f [SYCL] fix multi-gpu issue on sycl (#8554)
---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-07-25 19:45:18 +08:00
Georgi Gerganov eddcb5238b ggml : add and use ggml_cpu_has_llamafile() (#8664) 2024-07-25 12:37:42 +03:00
Xuan Son Nguyen be6d7c0791 examples : remove finetune and train-text-from-scratch (#8669)
* examples : remove finetune and train-text-from-scratch

* fix build

* update help message

* fix small typo for export-lora
2024-07-25 10:39:04 +02:00
Ujjawal Panchal 4b0eff3df5 docs : Quantum -> Quantized (#8666)
* docfix: imatrix readme, quantum models -> quantized models.

* docfix: server readme: quantum models -> quantized models.
2024-07-25 11:13:27 +03:00
Fan Shupei 8a4bad50a8 llama: use sliding window for phi3 (#8627)
* use sliding window for phi3

* fix typo, "data_swa" -> "data"

* [conver_hf_to_gguf.py] add phi3 sliding window
2024-07-25 10:21:09 +03:00
MorganRO8 68504f0970 readme : update games list (#8673)
Added link to game I made that depends on llama
2024-07-24 19:48:00 +03:00
Joe Todd f19bf99c01 Build Llama SYCL Intel with static libs (#8668)
Ensure SYCL CI builds both static & dynamic libs for testing purposes

Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-07-24 14:36:00 +01:00
Thorsten Sommer 3a7ac5300a readme : update UI list [no ci] (#8505) 2024-07-24 15:52:30 +03:00
Xuan Son Nguyen 96952e7181 llama : fix llama_chat_format_single for mistral (#8657)
* fix `llama_chat_format_single` for mistral

* fix typo

* use printf
2024-07-24 13:48:46 +02:00
Joe Todd 79167d9e49 Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (#8667) 2024-07-24 11:55:26 +01:00
Xuan Son Nguyen b115105f05 add llama_lora_adapter_clear (#8653) 2024-07-24 11:25:19 +02:00
Xuan Son Nguyen de280085e7 examples : Fix llama-export-lora example (#8607)
* fix export-lora example

* add more logging

* reject merging subset

* better check

* typo
2024-07-23 23:48:37 +02:00
Vali Malinoiu b841d07408 server : fix URL.parse in the UI (#8646) 2024-07-23 17:37:42 +03:00
Joe Todd 64cf50a0ed sycl : Add support for non-release DPC++ & oneMKL (#8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-07-23 14:58:37 +01:00
Georgi Gerganov 938943cdbf llama : move vocab, grammar and sampling into separate files (#8508)
* llama : move sampling code into llama-sampling

ggml-ci

* llama : move grammar code into llama-grammar

ggml-ci

* cont

ggml-ci

* cont : pre-fetch rules

* cont

ggml-ci

* llama : deprecate llama_sample_grammar

* llama : move tokenizers into llama-vocab

ggml-ci

* make : update llama.cpp deps [no ci]

* llama : redirect external API to internal APIs

ggml-ci

* llama : suffix the internal APIs with "_impl"

ggml-ci

* llama : clean-up
2024-07-23 13:10:17 +03:00
0cc4m 751fcfc6c3 Vulkan IQ4_NL Support (#8613)
* Fix Vulkan matmul tests compile errors

* Add Vulkan IQ4_NL support

* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
2024-07-23 10:56:49 +02:00
Jeroen Mostert 46e47417aa Allow all RDNA2 archs to use sdot4 intrinsic (#8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-07-23 10:50:40 +02:00
Georgi Gerganov e7e6487ba0 contrib : clarify PR squashing + module names (#8630)
* contrib : clarify PR squashing

* contrib : fix typo + add list of modules
2024-07-23 11:28:38 +03:00
luoyu-intel 063d99ad11 [SYCL] fix scratch size of softmax (#8642) 2024-07-23 15:43:28 +08:00
Keke Han 081fe431aa llama : fix codeshell support (#8599)
* llama : fix codeshell support

* llama : move codeshell after smollm below to respect the enum order
2024-07-22 19:43:43 +03:00
Jason Stillerman d94c6e0ccb llama : add support for SmolLm pre-tokenizer (#8609)
* Adding SmolLM Pre Tokenizer

* Update convert_hf_to_gguf_update.py

Co-authored-by: compilade <git@compilade.net>

* Update src/llama.cpp

Co-authored-by: compilade <git@compilade.net>

* handle regex

* removed .inp and out .out ggufs

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-22 17:43:01 +03:00
Jiří Podivín 566daa5a5b *.py: Stylistic adjustments for python (#8233)
* Superflous parens in conditionals were removed.
* Unused args in function were removed.
* Replaced unused `idx` var with `_`
* Initializing file_format and format_version attributes
* Renaming constant to capitals
* Preventing redefinition of the `f` var

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-22 23:44:53 +10:00
Georgi Gerganov 6f11a83e4e llama : allow overrides for tokenizer flags (#8614)
ggml-ci
2024-07-22 13:33:22 +03:00
Georgi Gerganov e093dd2382 tests : re-enable tokenizer tests (#8611)
* models : remove duplicated gpt-2 vocab

* models : remove old stablelm vocab

* tests : re-enable MPT tokenizer tests

* tests : re-enable DeepSeek tokenizer tests

* cmake : sort

ggml-ci
2024-07-22 13:32:49 +03:00
Douglas Hanley 50e05353e8 llama : add Mistral Nemo inference support (#8604) 2024-07-22 11:06:17 +03:00
Jan Boon 628154492a server : update doc to clarify n_keep when there is bos token (#8619) 2024-07-22 11:02:09 +03:00
Mark Zhuang 04bab6b7da ggml: fix compile error for RISC-V (#8623) 2024-07-22 10:56:45 +03:00
devojony b7c11d36e6 examples: fix android example cannot be generated continuously (#8621)
When generation ends `completion_loop()` should return a NULL, not the empty string
2024-07-22 09:54:42 +03:00
Georgi Gerganov 45f2c19cc5 flake.lock: Update (#8610) 2024-07-21 06:45:10 -07:00
M-A 22f281aa16 examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:

- Move each example into its own function. This makes the code much
  easier to read and understand.
- Make the program easy to only run one test by commenting out function
  calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
  instead of hardcoding the returned values. This makes the code more
  copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
  returned expected values. It's super useful to check for correctness.

Testing:

- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
  Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
  - I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
  - Llama-3 failed about a third of the time in example_concurrent: it
    only returned one call instead of 3. Even for F16.

Potential follow ups:

- Do not fix the prompt encoding yet. Surprisingly it mostly works even
  if the prompt encoding is not model optimized.
- Add chained answer and response.

Test only change.
2024-07-20 22:09:17 -04:00
compilade 328884f421 gguf-py : fix some metadata name extraction edge cases (#8591)
* gguf-py : fix some metadata name extraction edge cases

* convert_lora : use the lora dir for the model card path

* gguf-py : more metadata edge cases fixes

Multiple finetune versions are now joined together,
and the removal of the basename annotation on trailing versions
is more robust.

* gguf-py : add more name metadata extraction tests

* convert_lora : fix default filename

The default filename was previously hardcoded.

* convert_hf : Model.fname_out can no longer be None

* gguf-py : do not use title case for naming convention

Some models use acronyms in lowercase,
which can't be title-cased like other words,
so it's best to simply use the same case
as in the original model name.

Note that the size label still has an uppercased suffix
to make it distinguishable from the context size of a finetune.
2024-07-20 21:58:49 -04:00
compilade c69c63039c convert_hf : fix Gemma v1 conversion (#8597)
* convert_hf : fix Gemma v1 conversion

* convert_hf : allow renaming tokens, but with a warning

* convert_hf : fix Gemma v1 not setting BOS and EOS tokens
2024-07-20 21:53:01 -04:00
Johannes Gäßler 69c487f4ed CUDA: MMQ code deduplication + iquant support (#8495)
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
2024-07-20 22:25:26 +02:00
Georgi Gerganov 07283b1a90 gguf : handle null name during init (#8587) 2024-07-20 17:15:42 +03:00
Michael Coppola 940362224d llama : add support for Tekken pre-tokenizer (#8579)
* llama : Added support for Tekken pre-tokenizer (#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-20 16:43:51 +03:00
Huifeng Ou 69b9945b44 llama.swiftui: fix end of generation bug (#8268)
* fix continuing generating blank lines after getting EOT token or EOS token from LLM

* change variable name to is_done (variable name suggested by ggerganov)

* minor : fix trailing whitespace

* minor : add space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-20 16:09:37 +03:00
Brian c3776cacab gguf_dump.py: fix markddown kv array print (#8588)
* gguf_dump.py: fix markddown kv array print

* Update gguf-py/scripts/gguf_dump.py

Co-authored-by: compilade <git@compilade.net>

* gguf_dump.py: refactor kv array string handling

* gguf_dump.py: escape backticks inside of strings

* gguf_dump.py: inline code markdown escape handler added

>>> escape_markdown_inline_code("hello world")
'`hello world`'
>>> escape_markdown_inline_code("hello ` world")
'``hello ` world``'

* gguf_dump.py: handle edge case about backticks on start or end of a string

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-20 17:35:25 +10:00
slaren 87e397d00b ggml : fix quant dot product with odd number of blocks (#8549)
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (#8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-19 17:17:27 +02:00
Brian 57b1d4f9eb convert-*.py: remove add_name from ChatGLMModel class (#8590) 2024-07-20 00:04:38 +10:00
Georgi Gerganov d197545530 llama : bump max layers from 256 to 512 (#8530)
* llama : bump max layers from 256 to 512

* llama : replace asserts with exceptions
2024-07-19 16:50:47 +03:00
Georgi Gerganov be0cfb4175 readme : fix server badge 2024-07-19 14:34:55 +03:00
Clint Herron b57eb9ca4f ggml : add friendlier error message to fopen errors (#8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-07-19 14:05:45 +03:00
Frank Mai f299aa98ec fix: typo of chatglm4 chat tmpl (#8586)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-07-19 11:44:41 +02:00
Brian 3d0e4367d9 convert-*.py: add general.name kv override (#8571) 2024-07-19 17:51:51 +10:00
Johannes Gäßler a15ef8f8a0 CUDA: fix partial offloading for ne0 % 256 != 0 (#8572) 2024-07-18 23:48:47 +02:00
65a 705b7ecf60 cmake : install all ggml public headers (#8480)
Co-authored-by: 65a <65a@65a.invalid>
2024-07-18 17:47:12 +03:00
Eric Zhang 0d2c7321e9 server: use relative routes for static files in new UI (#8552)
* server: public: fix api_url on non-index pages

* server: public: use relative routes for static files in new UI
2024-07-18 12:43:49 +02:00
Brian 672a6f1018 convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor (#7499)
Main thing is that the default output filename will take this form

{name}{parameters}{finetune}{version}{encoding}{kind}

In addition this add and remove some entries in the KV store and adds a metadata class with automatic heuristics capability to derive some values based on model card content

* No Change:
  - Internal GGUF Spec
    - `general.architecture`
    - `general.quantization_version`
    - `general.alignment`
    - `general.file_type`
  - General Model Details
    - `general.name`
    - `general.author`
    - `general.version`
    - `general.description`
  - Licensing details
    - `general.license`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.url`
  - Model Source during conversion
    - `general.source.url`

* Removed:
  - Model Source during conversion
    - `general.source.huggingface.repository`

* Added:
  - General Model Details
    - `general.organization`
    - `general.finetune`
    - `general.basename`
    - `general.quantized_by`
    - `general.size_label`
  - Licensing details
    - `general.license.name`
    - `general.license.link`
  - Typically represents the converted GGUF repo (Unless made from scratch)
    - `general.doi`
    - `general.uuid`
    - `general.repo_url`
  - Model Source during conversion
    - `general.source.doi`
    - `general.source.uuid`
    - `general.source.repo_url`
  - Base Model Source
    - `general.base_model.count`
    - `general.base_model.{id}.name`
    - `general.base_model.{id}.author`
    - `general.base_model.{id}.version`
    - `general.base_model.{id}.organization`
    - `general.base_model.{id}.url` (Model Website/Paper)
    - `general.base_model.{id}.doi`
    - `general.base_model.{id}.uuid`
    - `general.base_model.{id}.repo_url` (Model Source Repository (git/svn/etc...))
  - Array based KV stores
    - `general.tags`
    - `general.languages`
    - `general.datasets`

---------

Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-18 20:40:15 +10:00
RunningLeon 3807c3de04 server : respect --special cli arg (#8553) 2024-07-18 11:06:22 +03:00
Johannes Gäßler e02b597be3 lookup: fibonacci hashing, fix crashes (#8548) 2024-07-17 23:35:44 +02:00
Al Mochkin b3283448ce build : Fix docker build warnings (#8535) (#8537) 2024-07-17 20:21:55 +02:00
Brian 30f80ca0bc CONTRIBUTING.md : remove mention of noci (#8541) 2024-07-17 17:57:06 +03:00
hipudding 1bdd8ae19f [CANN] Add Ascend NPU backend (#6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-07-17 14:23:50 +03:00
Masaya, Kato da3913d8f9 batched: fix n_predict parameter (#8527) 2024-07-17 10:34:28 +03:00
Georgi Gerganov d65a8361fe llama : disable context-shift for DeepSeek v2 (#8501) 2024-07-17 10:32:59 +03:00
Johannes Gäßler 5e116e8dd5 make/cmake: add missing force MMQ/cuBLAS for HIP (#8515) 2024-07-16 21:20:59 +02:00
Brian 1666f92dcd gguf-hash : update clib.json to point to original xxhash repo (#8491)
* Update clib.json to point to Cyan4973 original xxhash

Convinced Cyan4973 to add clib.json directly to his repo, so can now point the clib package directly to him now. Previously pointed to my fork with the clib.json package metadata

https://github.com/Cyan4973/xxHash/pull/954

* gguf-hash: readme update to point to Cyan4973 xxHash repo [no ci]
2024-07-16 10:14:16 +03:00
Steve Bonds 37b12f92ab export-lora : handle help argument (#8497)
The --help option on export-lora isn't accepted as valid. The help still gets displayed by default, but the script exits with an error message and nonzero status.
2024-07-16 10:04:45 +03:00
Georgi Gerganov 0efec57787 llama : valign + remove unused ftype (#8502) 2024-07-16 10:00:30 +03:00
compilade 7acfd4e8d5 convert_hf : faster lazy safetensors (#8482)
* convert_hf : faster lazy safetensors

This makes '--dry-run' much, much faster.

* convert_hf : fix memory leak in lazy MoE conversion

The '_lazy' queue was sometimes self-referential,
which caused reference cycles of objects old enough
to avoid garbage collection until potential memory exhaustion.
2024-07-15 23:13:10 -04:00
Xuan Son Nguyen 97bdd26eee Refactor lora adapter support (#8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-15 20:50:47 +02:00
Xuan Son Nguyen 4db8f60fe7 fix ci (#8494) 2024-07-15 19:23:10 +02:00
Daniel Bevenius 8fac431b06 ggml : suppress unknown pragma 'GCC' on windows (#8460)
This commit adds a macro guard to pragma GCC to avoid the following
warning on windows:

```console
C:\llama.cpp\ggml\src\ggml-aarch64.c(17,9): warning C4068:
unknown pragma 'GCC' [C:\lama.cpp\build\ggml\src\ggml.vcxproj]
```
2024-07-15 15:48:17 +03:00
M-A f17f39ff9c server: update README.md with llama-server --help output [no ci] (#8472)
The README.md had a stale information. In particular, the --ctx-size
"defaults to 512" confused me and I had to check the code to confirm
this was false. This the server is evolving rapidly, it's probably
better to keep the source of truth at a single place (in the source) and
generate the README.md based on that.

Did:

    make llama-server
    ./llama-server --help > t.txt
    vimdiff t.txt examples/server/README.md

I copied the content inside a backquote block. I would have preferred
proper text but it would require a fair amount of surgery to make the
current output compatible with markdown. A follow up could be to
automate this process with a script.

No functional change.
2024-07-15 15:04:56 +03:00
Georgi Gerganov 9104bc20ed common : add --no-cont-batching arg (#6358) 2024-07-15 14:54:58 +03:00
NikolaiLyssogor fc690b018e docs: fix links in development docs [no ci] (#8481)
Fixes a few links to within the repo that were broken in the reorganization of the
documentation in #8325.
2024-07-15 14:46:39 +03:00
Meng, Hengyu 16bdfa42ac [SYCL] add concat through dim 1/2 (#8483)
* add concat through dim 1/2
2024-07-15 19:32:15 +08:00
Georgi Gerganov 3dfda05956 llama : de-duplicate deepseek2 norm 2024-07-15 14:10:39 +03:00
0cc4m bda62d7999 Vulkan MMQ Fix (#8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter

* Fix Vulkan op result checker build error
2024-07-15 09:38:52 +02:00
compilade 090fca7a07 pydantic : replace uses of __annotations__ with get_type_hints (#8474)
* pydantic : replace uses of __annotations__ with get_type_hints

* pydantic : fix Python 3.9 and 3.10 support
2024-07-14 19:51:21 -04:00
Georgi Gerganov aaab2419ea flake.lock: Update (#8475)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)
  → 'github:NixOS/nixpkgs/7e7c39ea35c5cdd002cd4588b03a3fb9ece6fad9?narHash=sha256-EYekUHJE2gxeo2pM/zM9Wlqw1Uw2XTJXOSAO79ksc4Y%3D' (2024-07-12)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-14 08:54:02 -07:00
Georgi Gerganov 73cf442e7b llama : fix Gemma-2 Query scaling factors (#8473)
* 9B - query_pre_attn_scalar = 256 not 224

See https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e

Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)

* llama : fix Gemma-2 Query scaling factor

ggml-ci

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2024-07-14 14:05:09 +03:00
Brian e236528e76 gguf_hash.py: Add sha256 (#8470)
* gguf_hash.py: Add sha256

* gguf_hash.py: rename string UUIDv5 --> uuid

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-07-14 16:47:14 +10:00
compilade fa79495bb4 llama : fix pre-tokenization of non-special added tokens (#8228)
* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from #8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with #8379

* test-tokenizer-random : add a failing edge case for falcon
2024-07-13 23:35:10 -04:00
bandoti 17eb6aa8a9 vulkan : cmake integration (#8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-07-13 18:12:39 +02:00
Georgi Gerganov c917b67f06 metal : template-ify some of the kernels (#8447)
ggml-ci
2024-07-13 18:32:33 +03:00
Georgi Gerganov 4e24cffd8c server : handle content array in chat API (#8449)
* server : handle content array in chat API

* Update examples/server/utils.hpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-12 14:48:15 +03:00
Georgi Gerganov 6af51c0d96 main : print error on empty input (#8456) 2024-07-12 14:48:04 +03:00
Daniel Bevenius f53226245f llama : suppress unary minus operator warning (#8448)
This commit updates the _try_copy lambda and moves the unary minus
operator to after the cast to int32_t.

The motivation for this that currently the following warning is
generated on windows:

```console
llama.cpp\src\llama.cpp(21147,30): warning C4146: unary minus operator
applied to unsigned type, result still unsigned
```
2024-07-12 12:05:21 +03:00
Douglas Hanley c3ebcfa148 server : ensure batches are either all embed or all completion (#8420)
* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
2024-07-12 11:14:12 +03:00
Armen Kaleshian 8a4441ea1a docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
Commit b0a4699 changed the name of this script from convert-hf-to-gguf.py to
convert_hf_to_gguf.py breaking how convert is called from within a Docker
container.
2024-07-12 11:08:19 +03:00
Jiří Podivín 5aefbce27a convert : remove fsep token from GPTRefactForCausalLM (#8237)
The <filename> token used by Refact doesn't serve
the same purpose as the <file_separator> from CodeGemma.

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
2024-07-12 11:06:33 +03:00
Georgi Gerganov 71c1121d11 examples : sprintf -> snprintf (#8434)
* examples : sprintf -> snprintf

ggml-ci

* examples : use sizeof() instead of hardcoded constants
2024-07-12 10:46:14 +03:00
Georgi Gerganov 370b1f7e7a ggml : minor naming changes (#8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-07-12 10:46:02 +03:00
Chen Xi b549a1bbef [SYCL] fix the mul_mat_id ut issues (#8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-07-12 08:52:04 +08:00
Nicholai Tukanov 368645698a ggml : add NVPL BLAS support (#8329) (#8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-07-11 18:49:15 +02:00
Daniel Bevenius b078c619aa cuda : suppress 'noreturn' warn in no_device_code (#8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-11 17:53:42 +02:00
Johannes Gäßler 808aba3916 CUDA: optimize and refactor MMQ (#8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-07-11 16:47:47 +02:00
Georgi Gerganov a977c11544 gitignore : deprecated binaries 2024-07-11 11:20:40 +03:00
compilade 9a55ffe6fb tokenize : add --no-parse-special option (#8423)
This should allow more easily explaining
how parse_special affects tokenization.
2024-07-11 10:41:48 +03:00
Georgi Gerganov 7a221b672e llama : use F32 precision in Qwen2 attention and no FA (#8412) 2024-07-11 10:21:30 +03:00
Clint Herron 278d0e1846 Initialize default slot sampling parameters from the global context. (#8418) 2024-07-10 20:08:17 -04:00
Clint Herron dd07a123b7 Name Migration: Build the deprecation-warning 'main' binary every time (#8404)
* Modify the deprecation-warning 'main' binary to build every time, instead of only when a legacy binary is present. This is to help users of tutorials and other instruction sets from knowing what to do when the 'main' binary is missing and they are trying to follow instructions.

* Adjusting 'server' name-deprecation binary to build all the time, similar to the 'main' legacy name binary.
2024-07-10 12:35:18 -04:00
AidanBeltonS f4444d992c [SYCL] Use multi_ptr to clean up deprecated warnings (#8256) 2024-07-10 16:10:49 +01:00
Georgi Gerganov 6b2a849d1f ggml : move sgemm sources to llamafile subfolder (#8394)
ggml-ci
2024-07-10 15:23:29 +03:00
Dibakar Gope 0f1a39f343 ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-07-10 15:14:51 +03:00
M. Yusuf Sarıgöz 83321c6958 gguf-py rel pipeline (#8410)
* Upd gguf-py/readme

* Bump patch version for release
2024-07-10 15:12:35 +03:00
Borislav Stanimirov cc61948b1f llama : C++20 compatibility for u8 strings (#8408) 2024-07-10 14:45:44 +03:00
Borislav Stanimirov 7a80710d93 msvc : silence codecvt c++17 deprecation warnings (#8395) 2024-07-10 14:40:53 +03:00
fairydreaming a8be1e6f59 llama : add assert about missing llama_encode() call (#8400)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-07-10 14:38:58 +03:00
RunningLeon e4dd31ff89 py : fix converter for internlm2 (#8321)
* update internlm2

* remove unused file

* fix lint
2024-07-10 14:26:40 +03:00
laik 8f0fad42b9 py : fix extra space in convert_hf_to_gguf.py (#8407) 2024-07-10 14:19:10 +03:00
Clint Herron a59f8fdc85 Server: Enable setting default sampling parameters via command-line (#8402)
* Load server sampling parameters from the server context by default.

* Wordsmithing comment
2024-07-09 18:26:40 -04:00
Andy Salerno fd560fe680 Update README.md to fix broken link to docs (#8399)
Update the "Performance troubleshooting" doc link to be correct - the file was moved into a dir called 'development'
2024-07-09 14:58:44 -04:00
Clint Herron e500d6135a Deprecation warning to assist with migration to new binary names (#8283)
* Adding a simple program to provide a deprecation warning that can exist to help people notice the binary name change from #7809 and migrate to the new filenames.

* Build legacy replacement binaries only if they already exist. Check for their existence every time so that they are not ignored.
2024-07-09 11:54:43 -04:00
Johannes Gäßler a03e8dd99d make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392) 2024-07-09 17:11:07 +02:00
Alberto Cabrera Pérez 5b0b8d8cfb sycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372)
* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

* Reduced verbosity of comment
2024-07-09 22:03:15 +08:00
Borislav Stanimirov 9925ca4087 cmake : allow external ggml (#8370) 2024-07-09 11:38:00 +03:00
daghanerdonmez 9beb2dda03 readme : fix typo [no ci] (#8389)
Bakus-Naur --> Backus-Naur
2024-07-09 09:16:00 +03:00
compilade 7d0e23d72e gguf-py : do not use internal numpy types (#7472) 2024-07-09 01:04:49 -04:00
Georgi Gerganov 7fdb6f73e3 flake.lock: Update (#8342)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
  → 'github:hercules-ci/flake-parts/9227223f6d922fee3c7b190b2cc238a99527bbb7?narHash=sha256-pQMhCCHyQGRzdfAkdJ4cIWiw%2BJNuWsTX7f0ZYSyz0VY%3D' (2024-07-03)
• Updated input 'flake-parts/nixpkgs-lib':
    'https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
  → 'https://github.com/NixOS/nixpkgs/archive/5daf0514482af3f97abaefc78a6606365c9108e2.tar.gz?narHash=sha256-Fm2rDDs86sHy0/1jxTOKB1118Q0O3Uc7EC0iXvXKpbI%3D' (2024-07-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/b2852eb9365c6de48ffb0dc2c9562591f652242a?narHash=sha256-C8e9S7RzshSdHB7L%2Bv9I51af1gDM5unhJ2xO1ywxNH8%3D' (2024-06-27)
  → 'github:NixOS/nixpkgs/9f4128e00b0ae8ec65918efeba59db998750ead6?narHash=sha256-rwz8NJZV%2B387rnWpTYcXaRNvzUSnnF9aHONoJIYmiUQ%3D' (2024-07-03)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-07-08 15:36:38 -07:00
Alberto Cabrera Pérez a130eccef4 labeler : updated sycl to match docs and code refactor (#8373) 2024-07-08 22:35:17 +02:00
b4b4o c4dd11d1d3 readme : fix web link error [no ci] (#8347) 2024-07-08 17:19:24 +03:00
Alberto Cabrera Pérez 2ec846d558 sycl : fix powf call in device code (#8368) 2024-07-08 14:22:41 +01:00
Georgi Gerganov 3f2d538b81 scripts : fix sync for sycl 2024-07-08 13:51:31 +03:00
Georgi Gerganov 2ee44c9a18 sync : ggml
ggml-ci
2024-07-08 12:23:00 +03:00
Georgi Gerganov 6847d54c4f tests : fix whitespace (#0) 2024-07-08 12:23:00 +03:00
John Balis fde13b3bb9 feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 12:23:00 +03:00
Kevin Wang 470939d483 common : preallocate sampling token data vector (#8363)
`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.
2024-07-08 10:26:53 +03:00
Georgi Gerganov 6f0dbf6ab0 infill : assert prefix/suffix tokens + remove old space logic (#8351) 2024-07-08 09:34:35 +03:00
Kevin Wang ffd00797d8 common : avoid unnecessary logits fetch (#8358) 2024-07-08 09:31:55 +03:00
toyer 04ce3a8b19 readme : add supported glm models (#8360) 2024-07-08 08:57:19 +03:00
compilade 3fd62a6b1c py : type-check all Python scripts with Pyright (#8341)
* py : type-check all Python scripts with Pyright

* server-tests : use trailing slash in openai base_url

* server-tests : add more type annotations

* server-tests : strip "chat" from base_url in oai_chat_completions

* server-tests : model metadata is a dict

* ci : disable pip cache in type-check workflow

The cache is not shared between branches, and it's 250MB in size,
so it would become quite a big part of the 10GB cache limit of the repo.

* py : fix new type errors from master branch

* tests : fix test-tokenizer-random.py

Apparently, gcc applies optimisations even when pre-processing,
which confuses pycparser.

* ci : only show warnings and errors in python type-check

The "information" level otherwise has entries
from 'examples/pydantic_models_to_grammar.py',
which could be confusing for someone trying to figure out what failed,
considering that these messages can safely be ignored
even though they look like errors.
2024-07-07 15:04:39 -04:00
Denis Spasyuk a8db2a9ce6 Update llama-cli documentation (#8315)
* Update README.md

* Update README.md

* Update README.md

fixed llama-cli/main, templates on some cmds added chat template sections and fixed typos in some areas

* Update README.md

* Update README.md

* Update README.md
2024-07-07 17:08:28 +02:00
Alex Tuddenham 4090ea5501 ci : add checks for cmake,make and ctest in ci/run.sh (#8200)
* Added checks for cmake,make and ctest

* Removed erroneous whitespace
2024-07-07 17:59:14 +03:00
Andy Tai f1948f1e10 readme : update bindings list (#8222)
* adding guile_llama_cpp  to binding list

* fix formatting

* fix formatting
2024-07-07 16:21:37 +03:00
Brian f7cab35ef9 gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#8048)
CLI to hash GGUF files to detect difference on a per model and per tensor level

The hash type we support is:

- `--xxh64`: use xhash 64bit hash mode (default)
- `--sha1`: use sha1
- `--uuid`: use uuid
- `--sha256`: use sha256

While most POSIX systems already have hash checking programs like sha256sum, it
is designed to check entire files. This is not ideal for our purpose if we want
to check for consistency of the tensor data even if the metadata content of the
gguf KV store has been updated.

This program is designed to hash a gguf tensor payload on a 'per tensor layer'
in addition to a 'entire tensor model' hash. The intent is that the entire
tensor layer can be checked first but if there is any detected inconsistencies,
then the per tensor hash can be used to narrow down the specific tensor layer
that has inconsistencies.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-07 22:58:43 +10:00
toyer 905942abdb llama : support glm3 and glm4 (#8031)
* add chatglm3-6b model support huggingface model:
 https://hf-mirror.com/THUDM/chatglm3-6b

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* remove .rotary_pos_emb.inv_freq and unuse code for chatglm3 model

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* fix lint error

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* optimize convert-hf-to-gguf.py for chatglm model

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* support glm-4-9b-chat

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>

* fix eos tokens to glm4

* remove unused log

* add preprocess to chatglm3 and chatglm4

* add eos_id_list to llama.cpp

* fix code style

* fix code style

* fix conflicts

* fix conflicts

* Revert "add eos_id_list to llama.cpp"

This reverts commit 3a4d5790bf.

* set <|endoftext|> as eos and <|user|> as eot

* fix chat template bug

* add comment to glm prefix and suffix

* fix conflicts and add rope_ratio & ChatGLMForConditionalGeneration

* fix chat template bug

* fix codestyle

* fix conflicts

* modified the general name of glm model

* fix conflicts

* remove prefix and suffix

* use normal glm4 chattempalte & use LLM_FFN_SWIGLU in phi3

* fix: resolve Flake8 errors in `convert-hf-to-gguf.py`

- Fix E302 by adding two blank lines before top-level function definitions
- Replace print statements to fix NP100
- Fix E303 by ensuring only one blank line between lines of code

* fix rope ratio to solve incorrect answers

* fix by comments

---------

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Co-authored-by: XingXing Qiao <qiaoxx@dingdao.com>
Co-authored-by: Umpire2018 <138990495+Umpire2018@users.noreply.github.com>
2024-07-07 15:52:10 +03:00
Georgi Gerganov b5040086d4 llama : fix n_rot default (#8348)
ggml-ci
2024-07-07 14:59:02 +03:00
compilade d39130a398 py : use cpu-only torch in requirements.txt (#8335) 2024-07-07 14:23:38 +03:00
standby24x7 b81ba1f96b finetune: Rename command name in README.md (#8343)
Rename an old command name "finetune" to "llama-finetune"
in README.md

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-07-07 13:38:02 +03:00
standby24x7 210eb9ed0a finetune: Rename an old command name in finetune.sh (#8344)
This patch replaces an old commad "main" with "llama-cli"
in finetune.sh.
The part that I fixed is comment, so it doesn't change
the script.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
2024-07-07 13:37:47 +03:00
Bjarke Viksøe cb4d86c4d7 server: Retrieve prompt template in /props (#8337)
* server: Retrieve prompt template in /props

This PR adds the following:
- Expose the model's Jinja2 prompt template from the model in the /props endpoint.
- Change log-level from Error to Warning for warning about template mismatch.

The front-end stands a better chance of actually executing the Jinja template format correctly. Server is currently just guessing it.

Ideally this should have been inside a JSON block that expose the same key/value pairs as listed during startup in "llm_load_print_meta" function.

* Make string buffer dynamic

* Add doc and better string handling

* Using chat_template naming convention

* Use intermediate vector for string assignment
2024-07-07 11:10:38 +02:00
Derrick T. Woolworth 86e7299ef5 added support for Authorization Bearer tokens when downloading model (#8307)
* added support for Authorization Bearer tokens

* removed auth_token, removed set_ function, other small fixes

* Update common/common.cpp

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-07-06 22:32:04 +02:00
Xuan Son Nguyen 60d83a0149 update main readme (#8333) 2024-07-06 19:01:23 +02:00
Daniel Bevenius 87e25a1d1b llama : add early return for empty range (#8327)
* llama : add early return for empty range

This commit adds an early return to the llama_kv_cache_seq_add and
llama_kv_cache_seq_div functions.

The motivation for adding this is to avoid looping over the cache
when the range is empty. I ran into this when using the self-extend
feature in main.cpp.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* llama : add static_cast to fix CI warning/error

This commit attempts to fix the following warning/error:

```console
src/llama.cpp:7271:31: error:
comparison of integer expressions of different signedness:
‘int’ and ‘uint32_t’ {aka ‘unsigned int’} [-Werror=sign-compare]
 7271 |                         if (i < hparams.n_layer_dense_lead) {
      |                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This can be reproduced locally by setting -Wsign-compare in the
Makefile.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : add early return for empty range

Remove the setting of cache.head to 0 when the range is empty.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update src/llama.cpp

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-06 10:22:16 +03:00
jaime-m-p 213701b51a Detokenizer fixes (#8039)
* Add llama_detokenize():
  - Update header files location
  - UNKNOWN and CONTROL are 'special pieces'
  - Remove space after UNKNOWN and CONTROL
  - Refactor llama_token_to_piece()
  - Add flag: clean_up_tokenization_spaces
  - Symmetric params for llama_tokenize() and llama_detokenize()

* Update and fix tokenizer tests:
  - Using llama_detokenize()
  - Unexpected vocab type as test fail instead of error
    - Useful when automating tests:
    - If you don't know in advance the vocab type
    - Differenciate other loading errors
  - Skip unicode surrogaes and undefined
  - Gracefully exit threads
    - Using exit() is throwing random exceptions
  - Clean old known problematic codepoints
  - Minor: confusing hexadecimal codepoint

* Update bruteforce random tests
  - Add detokenizer checks
  - New generator: ascii_lr_strip
  - New generator: apostrophe
  - Add more vocabs files
  - Detokenize special tokens.
  - Replace errors with '\uFFFD' when detokenizing to 'utf-8'
  - More edge cases
  - Better detokenization results check

* Fix add_space_prefix, set false by default
* Better leading space removal
* Do not remove space when decoding special tokens
* Bugfix: custom regexs splits undefined unicode codepoints
* 'viking' detokenizer clean spaces
2024-07-05 19:01:35 +02:00
Xuan Son Nguyen be20e7f49d Reorganize documentation pages (#8325)
* re-organize docs

* add link among docs

* add link to build docs

* fix style

* de-duplicate sections
2024-07-05 18:08:32 +02:00
Georgi Gerganov 7ed03b8974 llama : fix compile warning (#8304) 2024-07-05 17:32:09 +03:00
Natsu 1d894a790e cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281) 2024-07-05 17:29:35 +03:00
Ouadie EL FAROUKI 1f3e1b66e2 Enabled more data types for oneMKL gemm_batch (#8236) 2024-07-05 13:23:25 +01:00
Georgi Gerganov 148ec970b6 convert : remove AWQ remnants (#8320) 2024-07-05 10:15:36 +03:00
Georgi Gerganov 2cccbaa008 llama : minor indentation during tensor loading (#8304)
* llama : minor indentation during tensor loading

ggml-ci

* llama : use int for layer iterators [no ci]
2024-07-05 10:15:24 +03:00
Johannes Gäßler 8e558309dc CUDA: MMQ support for iq4_nl, iq4_xs (#8278) 2024-07-05 09:06:31 +02:00
Daniele 0a423800ff CUDA: revert part of the RDNA1 optimizations (#8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-05 09:06:09 +02:00
Douglas Hanley d12f781074 llama : streamline embeddings from "non-embedding" models (#8087) 2024-07-05 10:05:56 +03:00
Johannes Gäßler bcefa03bc0 CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311) 2024-07-05 09:05:34 +02:00
Pieter Ouwerkerk 5a7447c569 readme : fix minor typos [no ci] (#8314) 2024-07-05 09:58:41 +03:00
Daniel Bevenius 61ecafa390 passkey : add short intro to README.md [no-ci] (#8317)
* passkey : add short intro to README.md [no-ci]

This commit adds a short introduction to the README.md file in the
examples/passkey directory.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Update examples/passkey/README.md

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-05 09:14:24 +03:00
Georgi Gerganov aa5898dc53 llama : prefer n_ over num_ prefix (#8308) 2024-07-05 09:10:03 +03:00
Georgi Gerganov 6c05752c50 contributing : update guidelines (#8316) 2024-07-05 09:09:47 +03:00
luoyu-intel a9554e20b6 [SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp
2024-07-05 13:06:13 +08:00
Georgi Gerganov e235b267a2 py : switch to snake_case (#8305)
* py : switch to snake_case

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* cont : fix link

* gguf-py : use snake_case in scripts entrypoint export

* py : rename requirements for convert_legacy_llama.py

Needed for scripts/check-requirements.sh

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-07-05 07:53:33 +03:00
Neo Zhang Jianyu f09b7cb609 rm get_work_group_size() by local cache for performance (#8286)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-07-05 10:32:29 +08:00
Xuan Son Nguyen a38b884c6c cli: add EOT when user hit Ctrl+C (#8296)
* main: add need_insert_eot

* do not format system prompt if it is empty
2024-07-04 20:55:03 +02:00
Icecream95 d7fd29fff1 llama : add OpenELM support (#7359)
* Initial OpenELM support (270M only so far)

* Fill out missing entries in llama_model_type_name

* fixup! Initial OpenELM support (270M only so far)

Fix formatting

* llama : support all OpenELM models

* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.

* llama : minor spacing changes

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : use std::array for per-layer hparams

* llama : fix save/load state

* llama : do not print hparams for vocab-only models

* llama : handle n_head == 0

* llama : use const ref for print_f and fix division by zero

* llama : fix t5 uses of n_head and n_ff

* llama : minor comment

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-04 20:14:21 +03:00
Daniel Bevenius 6f63d646c1 tokenize : add --show-count (token) option (#8299)
This commit adds a new option to the tokenize example, --show-count.
When this is set the total number of tokens are printed to stdout.

This was added as an option as I was concerned that there might be
scripts that use the output from this program and it might be better to
not print this information by default.

The motivation for this is that can be useful to find out how many
tokens a file contains, for example when trying to determine prompt
input file sizes for testing.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-04 19:38:58 +03:00
ditsuke 51d2ebadbb build: Export hf-to-gguf as snakecase 2024-07-04 15:39:13 +00:00
ditsuke 1e920018d3 doc: Add context for why we add an explicit pytorch source 2024-07-04 15:39:13 +00:00
ditsuke 01a5f06550 chore: Remove rebase artifacts 2024-07-04 15:39:13 +00:00
ditsuke 07786a61a2 chore: Fixup requirements and build 2024-07-04 15:39:13 +00:00
ditsuke de14e2ea2b chore: ignore all __pychache__ 2024-07-04 15:39:13 +00:00
ditsuke 821922916f fix: Update script paths in CI scripts 2024-07-04 15:39:13 +00:00
ditsuke b1c3f26e5e fix: Actually include scripts in build
Not namespaced though :(
2024-07-04 15:39:13 +00:00
ditsuke b0a46993df build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
fairydreaming 807b0c49ff Inference support for T5 and FLAN-T5 model families (#5763)
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-04 15:46:11 +02:00
Daniel Bevenius f8c4c0738d tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS`
to the root cmake subproject.

The motivation for this is that currently the following warnings are
displayed when compiling the tests and common cmake subprojects:
```console
test-llama-grammar.cpp
C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror':
This function or variable may be unsafe. Consider using strerror_s
instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
[C:\llama.cpp\build\tests\test-llama-grammar.vcxproj]
...
```

This compile definition is currently set for the `src` subproject
and this change moves into the root cmake project so that it is applied
to all cmake subprojects.
2024-07-04 13:53:42 +03:00
Daniel Bevenius 402d6feffa llama : suppress unref var in Windows MSVC (#8150)
* llama : suppress unref var in Windows MSVC

This commit suppresses two warnings that are currently generated for
src/llama.cpp when building on Windows MSVC

```console
C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
```

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-04 13:50:57 +03:00
Georgi Gerganov 20fc3804bf convert : fix gemma v1 tokenizer convert (#8248)
ggml-ci
2024-07-04 10:41:03 +03:00
AidanBeltonS f619024764 [SYCL] Remove unneeded semicolons (#8280) 2024-07-04 09:07:19 +08:00
Daniele d23287f122 Define and optimize RDNA1 (#8085) 2024-07-04 01:02:58 +02:00
slaren 5f2d4e60e2 ppl : fix n_seq_max for perplexity (#8277)
* ppl : fix n_seq_max for perplexity

* use 1 seq for kl_divergence
2024-07-03 20:33:31 +03:00
Xuan Son Nguyen 916248af1f fix phi 3 conversion (#8262) 2024-07-03 16:01:54 +02:00
Judd f8d6a23804 fix typo (#8267)
Co-authored-by: Judd <foldl@boxvest.com>
2024-07-03 14:40:16 +02:00
AidanBeltonS fadde67135 Dequant improvements rebase (#8255)
* Single load for half2

* Store scales in local mem

* Vec load quantized values
2024-07-03 09:55:34 +08:00
MistApproach a27152b602 fix: add missing short command line argument -mli for multiline-input (#8261) 2024-07-02 22:56:46 +02:00
Clint Herron 3e2618bc7b Adding step to clean target to remove legacy binary names to reduce upgrade / migration confusion arising from #7809. (#8257) 2024-07-02 13:19:56 -04:00
Clint Herron 07a3fc0608 Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258) 2024-07-02 12:18:10 -04:00
Faisal Zaghloul 968967376d Add JAIS model(s) (#8118)
* Add `JAIS` model(s)

* cleanup

* address review comments

* remove hack

* un-hardcode max-alibi-bias

* minor tweaks

---------

Co-authored-by: fmz <quic_fzaghlou@quic.com>
2024-07-02 16:36:00 +02:00
Daniel Bevenius 023b8807e1 convert-hf : print output file name when completed (#8181)
* convert-hf : print output file name when completed

This commit adds the output file name to the log message when the
conversion is completed.

The motivation for this change is that when `--outfile` option is not
specified it migth not be obvious where the output file is written.

With this change the output of running the script will be something like
the following:
```console
INFO:hf-to-gguf:Model successfully exported to models/gemma-2-9b-it.gguf.
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Updates the output of to support printing the directory if the output is
split into multiple files. Also the output file name is now retrieved
from the model_instance object.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use parent attribute of Path object and string interpolation.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! convert-hf : print output file name when completed

Use os.sep instead of hardcoding the path separator.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-02 09:40:49 +03:00
slaren 0e0590adab cuda : update supports_op for matrix multiplication (#8245) 2024-07-02 09:39:38 +03:00
luoyu-intel a9f3b10215 [SYCL] Fix win build conflict of math library (#8230)
* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16
2024-07-02 12:50:07 +08:00
luoyu-intel d08c20edde [SYCL] Fix the sub group size of Intel (#8106)
* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size
2024-07-02 10:16:00 +08:00
Xuan Son Nguyen 5fac350b9c Fix gemma2 tokenizer convert (#8244)
* fix gemma2 tokenizer convert

* remove scores

* improve code, fix new line issue
2024-07-02 01:07:23 +02:00
Johannes Gäßler cb5fad4c6c CUDA: refactor and optimize IQ MMVQ (#8215)
* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix
2024-07-01 20:39:06 +02:00
Mateusz Charytoniuk dae57a1ebc readme: add Paddler to the list of projects (#8239) 2024-07-01 20:13:22 +03:00
Xuan Son Nguyen 49122a873f gemma2: add sliding window mask (#8227)
* gemma2: add sliding window mask

* fix data_swa uninitialized

* better naming

* add co-author

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

* replace list with single tensor

* update

* llama : minor styling

* convert : add sanity check for query_pre_attn_scalar

* fix small typo in README

---------

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 18:48:34 +02:00
Roni 0ddeff1023 readme : update tool list (#8209)
* Added gppm to Tool list in README

* Update README.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 15:48:16 +03:00
Michael Francis 3840b6f593 nix : enable curl (#8043)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 14:47:04 +03:00
Georgi Gerganov 257f8e41e2 nix : remove OpenCL remnants (#8235)
* nix : remove OpenCL remnants

* minor : remove parentheses
2024-07-01 14:46:18 +03:00
iacore 694c59cb42 Document BERT support. (#8205)
* Update README.md

document BERT support

* Update README.md
2024-07-01 13:40:58 +02:00
zhentaoyu 197fe6c1d7 [SYCL] Update SYCL-Rope op and Refactor (#8157)
* align with rope.cu and move sycl-op to a single file
2024-07-01 19:39:06 +08:00
Georgi Gerganov d0a7145ba9 flake.lock: Update (#8218) 2024-06-30 16:09:34 -07:00
Xuan Son Nguyen 9ef0780062 Fix new line issue with chat template, disable template when in-prefix/suffix is set (#8203)
* preserve new line llama_chat_format_single

* disable chat template if in-prefix/suffix is set

* remove redundant change
2024-06-30 20:27:13 +02:00
Andrei 1c5eba6f8e llama: Add attention and final logit soft-capping, update scaling factor to Gemma2 (#8197)
* Add attention and final logit softcapping.

* fix

* Add custom add_ functions

* Disable flash attention for Gemma2

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add default value for attention and final logit softcap value

* Add custom kq scaling from Gemma2Attention

* Remove custom pre attention scaling and use computed value instead.

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-29 23:44:08 -04:00
Xuan Son Nguyen 72272b83a3 fix code typo in llama-cli (#8198) 2024-06-29 00:14:20 +02:00
Olivier Chafik 8748d8ac6f json: attempt to skip slow tests when running under emulator (#8189) 2024-06-28 18:02:05 +01:00
Xuan Son Nguyen 26a39bbd6b Add MiniCPM, Deepseek V2 chat template + clean up llama_chat_apply_template_internal (#8172)
* tmp_contains

* minicpm chat template

* add DeepSeek Lite template

* change deepseek-lite to deepseek2

* correct code comment

* correct code from master branch
2024-06-28 15:11:44 +02:00
Sigbjørn Skjæret 38373cfbab Add SPM infill support (#8016)
* add --spm-infill option

* support --spm-infill

* support --spm-infill
2024-06-28 12:53:43 +02:00
slaren b851b3fba0 cmake : allow user to override default options (#8178) 2024-06-28 12:37:45 +02:00
Olivier Chafik 139cc621e9 json: restore default additionalProperties to false, fix some pattern escapes (#8180)
* json: expand ESCAPED_IN_REGEXPS_BUT_NOT_IN_LITERALS charset

* json: revert default of additionalProperties to false

* Update README.md
2024-06-28 09:26:45 +01:00
pculliton e57dc62057 llama: Add support for Gemma2ForCausalLM (#8156)
* Inference support for Gemma 2 model family

* Update convert-hf-to-gguf.py, constants, and tensor mappings

* cleanup

* format fix

* Fix special token vocab bug

* Don't add space prefix

* fix deleted lines

* Update src/llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Add model type names

* Add control vector

* Fix model type identification

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-06-27 21:00:43 -07:00
Xuan Son Nguyen a27aa50ab7 Add missing items in makefile (#8177) 2024-06-28 02:19:11 +02:00
Olivier Chafik cb0b06a8a6 json: update grammars/README w/ examples & note about additionalProperties (#8132)
* json: update grammars/README

* mention broken prefixItems

* add mention to llama-gbnf-validator

* json: explicit type: object for nested items object in cli example
2024-06-27 22:08:42 +01:00
loonerin 558f44bf83 CI: fix release build (Ubuntu+Mac) (#8170)
* CI: fix release build (Ubuntu)

PR #8006 changes defaults to build shared libs. However, CI for releases
expects static builds.

* CI: fix release build (Mac)

---------

Co-authored-by: loonerin <loonerin@users.noreply.github.com>
2024-06-27 21:01:23 +02:00
slaren 8172ee9da9 cmake : fix deprecated option names not working (#8171)
* cmake : fix deprecated option names not working

* remove LlAMA_OPENMP
2024-06-27 20:04:39 +02:00
Xuan Son Nguyen 16791b8f0b Add chatml fallback for cpp llama_chat_apply_template (#8160)
* add chatml fallback for cpp `llama_chat_apply_template`

* remove redundant code
2024-06-27 18:14:19 +02:00
Georgi Gerganov ab3679112d flake.lock: Update (#8071)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e9ee548d90ff586a6471b4ae80ae9cfcbceb3420?narHash=sha256-4Zu0RYRcAY/VWuu6awwq4opuiD//ahpc2aFHg2CWqFY%3D' (2024-06-13)
  → 'github:NixOS/nixpkgs/d603719ec6e294f034936c0d0dc06f689d91b6c3?narHash=sha256-k3JqJrkdoYwE3fHE6xGDY676AYmyh4U2Zw%2B0Bwe5DLU%3D' (2024-06-20)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Philip Taron <philip.taron@gmail.com>
2024-06-27 08:37:29 -07:00
jukofyork 97877eb10b Control vector loading fixes (#8137)
* Fixed leak in llama_control_vector_load_one() and allow llama_control_vector_load() to grow

* refactored `llama_control_vector_load_one()`

* allow multiple directions for same layer in same file

* llama_control_vector_load_one() and llama_control_vector_load() now break on error

* removed unnecessary ggml_free() call
2024-06-27 16:48:07 +02:00
Raj Hammeer Singh Hada 387952651a Delete examples/llama.android/llama/CMakeLists.txt (#8165)
* Delete examples/llama.android/llama/CMakeLists.txt

https://github.com/ggerganov/llama.cpp/pull/8145#issuecomment-2194534244

This file is not being used for building on Android. `llama.cpp/examples/llama.android/llama/src/main/cpp/CMakeLists.txt` is being used instead.

* Update CMakeLists.txt

Pick local llama.cpp files instead of fetching content from git
2024-06-27 16:39:29 +02:00
Sigbjørn Skjæret 6030c61281 Add Qwen2MoE 57B-A14B model identifier (#8158)
* Add Qwen2MoE 57B-A14B

* Add Qwen2MoE 57B-A14B
2024-06-27 16:27:41 +02:00
Johannes Gäßler 85a267daaa CUDA: fix MMQ stream-k for --split-mode row (#8167) 2024-06-27 16:26:05 +02:00
kustaaya f675b20a3b Added support for Viking pre-tokenizer (#8135)
Co-authored-by: kustaaya <kustaaya@protonmail.com>
2024-06-27 10:58:54 +02:00
Sigbjørn Skjæret 911e35bb8b llama : fix CodeLlama FIM token checks (#8144)
* account for space prefix character

* use find instead
2024-06-27 10:46:41 +03:00
Raj Hammeer Singh Hada ac146628e4 Fix llama-android.cpp for error - "common/common.h not found" (#8145)
- Path seems to be wrong for the common.h header file in llama-android.cpp file. Fixing the path so the Android Build doesn't fail with the error "There is no file common/common.h"
2024-06-27 03:57:57 +02:00
Daniel Bevenius 9b31a40c6d clip : suppress unused variable warnings (#8105)
* clip : suppress unused variable warnings

This commit suppresses unused variable warnings for the variables e in
the catch blocks.

The motivation for this change is to suppress the warnings that are
generated on Windows when using the MSVC compiler. The warnings are
not displayed when using GCC because GCC will mark all catch parameters
as used.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! clip : suppress unused variable warnings

Remove e (/*e*/) instead instead of using GGML_UNUSED.

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-27 01:50:09 +02:00
Georgi Gerganov c70d117c37 scripts : fix filename sync 2024-06-26 23:25:22 +03:00
slaren ae5d0f4b89 ci : publish new docker images only when the files change (#8142) 2024-06-26 21:59:28 +02:00
slaren 31ec3993f6 ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (#8140) 2024-06-26 21:34:14 +02:00
slaren c7ab7b612c make : fix missing -O3 (#8143) 2024-06-26 21:20:22 +03:00
Georgi Gerganov f2d48fffde sync : ggml 2024-06-26 19:39:19 +03:00
Georgi Gerganov 4713bf3093 authors : regen 2024-06-26 19:36:44 +03:00
Georgi Gerganov 0e814dfc42 devops : remove clblast + LLAMA_CUDA -> GGML_CUDA (#8139)
ggml-ci
2024-06-26 19:32:07 +03:00
Georgi Gerganov a95631ee97 readme : update API notes 2024-06-26 19:26:13 +03:00
Georgi Gerganov f3f65429c4 llama : reorganize source code + improve CMake (#8006)
* scripts : update sync [no ci]

* files : relocate [no ci]

* ci : disable kompute build [no ci]

* cmake : fixes [no ci]

* server : fix mingw build

ggml-ci

* cmake : minor [no ci]

* cmake : link math library [no ci]

* cmake : build normal ggml library (not object library) [no ci]

* cmake : fix kompute build

ggml-ci

* make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE

ggml-ci

* move public backend headers to the public include directory (#8122)

* move public backend headers to the public include directory

* nix test

* spm : fix metal header

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* scripts : fix sync paths [no ci]

* scripts : sync ggml-blas.h [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-26 18:33:02 +03:00
Isaac McFadyen 8854044561 Clarify default MMQ for CUDA and LLAMA_CUDA_FORCE_MMQ flag (#8115)
* Add message about int8 support

* Add suggestions from review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-26 08:29:28 +02:00
Johannes Gäßler c8771ab5f8 CUDA: fix misaligned shared memory read (#8123) 2024-06-26 08:28:02 +02:00
Eddie-Wang 494165f3b6 llama : extend llm_build_ffn() to support _scale tensors (#8103) 2024-06-26 09:27:46 +03:00
Olivier Chafik 9b2f16f805 json: better support for "type" unions (e.g. nullable arrays w/ typed items) (#7863)
* json: better suport for "type" arrays (e.g. `{"type": ["array", "null"], "items": {"type": "string"}}`)

* json: add test for type: [array, null] fix

* update tests
2024-06-26 01:46:35 +01:00
Olivier Chafik 6777c544bd json: fix additionalProperties, allow space after enum/const (#7840)
* json: default additionalProperty to true

* json: don't force additional props after normal properties!

* json: allow space after enum/const

* json: update pydantic example to set additionalProperties: false

* json: prevent additional props to redefine a typed prop

* port not_strings to python, add trailing space

* fix not_strings & port to js+py

* Update json-schema-to-grammar.cpp

* fix _not_strings for substring overlaps

* json: fix additionalProperties default, uncomment tests

* json: add integ. test case for additionalProperties

* json: nit: simplify condition

* reformat grammar integ tests w/ R"""()""" strings where there's escapes

* update # tokens in server test: consts can now have trailing space
2024-06-26 01:45:58 +01:00
jukofyork 163d50adaf fixes #7999 (adds control vectors to all build_XXX() functions in llama.cpp [needs testing] (#8060)
* fixes #7999

The `build_command_r` forgot to add the control vector.

* Fixes qwen2 too

* Fixed all models' control vectors

* Removed double calls to `cb(cur, "l_out", il)`

* Moved control vector logic to llama_control_vector:apply_to()
2024-06-25 22:47:40 +02:00
fairydreaming 6fcbf68235 llama : implement Unigram tokenizer needed by T5 and FLAN-T5 model families (#5763)
* llama : add T5 model architecture, tensors and model header parameters

* llama : add implementation of Unigram tokenizer with SentencePiece-like text normalization using precompiled charsmap

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-06-25 21:14:35 +02:00
Daniel Bevenius e6bf007744 llama : return nullptr from llama_grammar_init (#8093)
* llama : return nullptr from llama_grammar_init

This commit updates llama_grammar_init to return nullptr instead of
throwing an exception.

The motivation for this is that this function is declared inside an
extern "C" block and is intended/may be used from C code which will not
be able to handle exceptions thrown, and results in undefined behavior.

On Windows and using MSVC the following warning is currently generated:
```console
C:\llama.cpp\llama.cpp(13998,1): warning C4297: 'llama_grammar_init':
function assumed not to throw an exception but does
C:\llama.cpp\llama.cpp(13998,1): message :
__declspec(nothrow), throw(), noexcept(true), or noexcept was specified
on the function
```

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! llama : return nullptr from llama_grammar_init

Add checks for nullptr when calling llama_grammar_init.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-25 15:07:28 -04:00
Olivier Chafik 84631fe150 json: support integer minimum, maximum, exclusiveMinimum, exclusiveMaximum (#7797)
* json: support minimum for positive integer values

* json: fix min 0

* json: min + max integer constraints

* json: handle negative min / max integer bounds

* json: fix missing paren min/max bug

* json: proper paren fix

* json: integration test for schemas

* json: fix bounds tests

* Update json-schema-to-grammar.cpp

* json: fix negative max

* json: fix negative min (w/ more than 1 digit)

* Update test-grammar-integration.cpp

* json: nit: move string rules together

* json: port min/max integer support to Python & JS

* nit: move + rename _build_min_max_int

* fix min in [1, 9]

* Update test-grammar-integration.cpp

* add C++11-compatible replacement for std::string_view

* add min/max constrained int field to pydantic json schema example

* fix merge

* json: add integration tests for min/max bounds

* reshuffle/merge min/max integ test cases

* nits / cleanups

* defensive code against string out of bounds (apparently different behaviour of libstdc++ vs. clang's libc++, can't read final NULL char w/ former)
2024-06-25 20:06:20 +01:00
slaren dd047b476c disable docker CI on pull requests (#8110) 2024-06-25 19:20:06 +02:00
joecryptotoo 925c30956d Add healthchecks to llama-server containers (#8081)
* added healthcheck

* added healthcheck

* added healthcheck

* added healthcheck

* added healthcheck

* moved curl to base

* moved curl to base
2024-06-25 17:13:27 +02:00
Brian c8ad35955a Gguf dump start data offset via --data-offset and some extra refactor (#8054)
* gguf-dump: add --data-offset

* gguf-dump: add tensor data offset table

* gguf-dump: refactor GGUFReader for clarity

* gguf-dump: add --data-alignment

* gguf-dump.py: Rename variables and adjust comments

start_data_offset --> data_offset

_build_tensors_info_fields --> _build_tensor_info
2024-06-25 22:03:25 +10:00
Xuan Son Nguyen 49c03c79cd cvector: better prompt handling, add "mean vector" method (#8069)
* remove completions file

* fix inverted vector

* add mean method

* code style

* remove inverted pca hotfix
2024-06-25 13:59:54 +02:00
Xuan Son Nguyen 48e6b92cc3 Add chat template support for llama-cli (#8068)
* add chat template support for llama-cli

* add help message

* server: simplify format_chat

* more consistent naming

* improve

* add llama_chat_format_example

* fix server

* code style

* code style

* Update examples/main/main.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-25 21:56:49 +10:00
HanishKVC 3791ad2193 SimpleChat v3.1: Boolean chat request options in Settings UI, cache_prompt (#7950)
* SimpleChat: Allow for chat req bool options to be user controlled

* SimpleChat: Allow user to control cache_prompt flag in request

* SimpleChat: Add sample GUI images to readme file

Show the chat screen and the settings screen

* SimpleChat:Readme: Add quickstart block, title to image, cleanup

* SimpleChat: RePosition contents of the Info and Settings UI

Make it more logically structured and flow through.

* SimpleChat: Rename to apiRequestOptions from chatRequestOptions

So that it is not wrongly assumed that these request options are
used only for chat/completions endpoint. Rather these are used
for both the end points, so rename to match semantic better.

* SimpleChat: Update image included with readme wrt settings ui

* SimpleChat:ReadMe: Switch to webp screen image to reduce size
2024-06-25 21:27:35 +10:00
HatsuneMikuUwU33 f702a90e24 Update control vector help (#8104) 2024-06-25 10:44:48 +02:00
Meng, Hengyu 083bacce14 [SYCL] Re-enabled mul_mat_batched_sycl (#8095) 2024-06-25 10:19:20 +08:00
Johannes Gäßler 2df373ac40 CUDA: fix matrix multiplication algorithm choice (#8102) 2024-06-25 01:22:33 +02:00
Johannes Gäßler 3b099bcd9c CUDA: fix MMQ writeback for int8 tensor cores (#8100) 2024-06-24 22:15:33 +02:00
Johannes Gäßler a818f3028d CUDA: use MMQ instead of cuBLAS by default (#8075) 2024-06-24 17:43:42 +02:00
fairydreaming d62e4aaa02 gguf-py : fix tensor groups for encoder-decoder models in gguf-dump.py (#8090)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Brian <mofosyne@gmail.com>
2024-06-24 14:13:39 +02:00
Johannes Gäßler 9a590c8226 CUDA: optimize MMQ int8 tensor core performance (#8062)
* CUDA: optimize MMQ int8 tensor core performance

* only a single get_mma_tile_x_k function

* simplify code, make functions constexpr
2024-06-24 12:41:23 +02:00
Christian Zhou-Zheng 52fc8705a0 Option to split during conversion (#6942)
* support splits in convert.py

* Support split by size and dry run to write estimated shards/filesizes

* Move split functionality to new GGUFManager class

* fix improper function signature

* tentative push of convert-hf-to-gguf support

* resolve merge + SplitArguments for easier parsing

* Fix eager tensor memory leak and remove convert.py changes

Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

* refactor SplitStrategy to be a deque

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

* fix Q8 quantization

* remove unnecessary imports in gguf_manager

* fix final? merge issue

* fix gguf_writer placement and remove comments

* oops, actually fix gguf_writer placement

* reduce duplicated code from gguf_writer

* further simplify GGUFManager

* simplify even further and standardize with GGUFWriter

* reduce diffs with master

* form shards while adding tensors, SHA256 sums agree with master

* re-add type hint

Co-authored-by: compilade <git@compilade.net>

* GGUFWriter compatibility fix

Co-authored-by: compilade <git@compilade.net>

* Shard dataclass and un-negative dont_add_architecture

* type consistency in format_n_bytes_to_str

* move kv keys to constants.py

* make pathlib explicit

* base-1024 bytes to base-1000

* rename GGUFManager to GGUFWriterSplit

* Update gguf-py/gguf/constants.py

Co-authored-by: compilade <git@compilade.net>

* fix convert-hf-to-gguf.py permissions

* fix line endings

* Update gguf-py/gguf/gguf_writer_split.py

Co-authored-by: compilade <git@compilade.net>

* convert-hf : restore executable file permission

* examples/convert-legacy-llama.py: restore executable file permission

* reinstate original gguf package import and fix type annotation

* attempt to appease the linter

* attempt 2 to appease the linter

* attempt 3 to appease the linter

* comma consistency

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <git@compilade.net>

* edit cmd line args

* use simplification from #7827

* kv/ti data are still wrong

* try to refactor kv data (still fails)

* fix ti data messiness

* tidy up

* fix linting

* actually make the linter happy

* cleanup round 1

* remove SplitStrategy, SplitArguments

* appease linter

* fix typing and clean up

* fix linting

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* progress bar, fix split logic

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* catch oversights

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* swap bar orders

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* compatibility fix

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <git@compilade.net>

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: Brian <mofosyne@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-06-24 19:42:03 +10:00
slaren 8cb508d0d5 disable publishing the full-rocm docker image (#8083) 2024-06-24 08:36:11 +03:00
Yann Follet 646ef4a9cf embedding : more cli arguments (#7458)
* add parameters for embeddings
--embd-normalize
--embd-output-format
--embd-separator
description in the README.md

* Update README.md

fix tipo

* Trailing whitespace

* fix json generation, use " not '

* fix merge master

* fix code formating
group of parameters // embedding
print usage for embedding parameters

---------

Co-authored-by: Brian <mofosyne@gmail.com>
2024-06-24 08:30:24 +03:00
fairydreaming de0d6a68ac gguf-py, convert-hf : model conversion support for T5 and FLAN-T5 model variants (#5763)
* gguf-py : add T5 model architecture

* gguf-py : add separate tensors for encoder and decoder

* gguf-py : add new model header parameters: decoder_start_token_id, attention.relative_buckets_count, tokenizer.ggml.remove_extra_whitespaces, tokenizer.ggml.precompiled_charsmap

* convert-hf : add model conversion support for T5ForConditionalGeneration and T5WithLMHeadModel

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-06-24 07:06:05 +02:00
slaren 95f57bb5d5 ggml : remove ggml_task_type and GGML_PERF (#8017)
* ggml : remove ggml_task_type and GGML_PERF

* check abort_callback on main thread only

* vulkan : remove usage of ggml_compute_params

* remove LLAMA_PERF
2024-06-24 03:07:59 +02:00
Eddie-Wang e112b610a1 llama : add support for BitnetForCausalLM (#7931)
* hf bitnet v1

* hf bitnet e2e v2

* finish bitnet e2e

* finish f16 hf bitnet e2e

* remove unsed

* finish bitnet i2 e2e

* move i2s to quantize v1

* move i2 to quantize

* clean code

* clean code 2

* fix codestyle

* fix code

* fix

* fix code

* fix merge

* remove unused

* change table name

* fix whitespace

* delete redundant

* i2_s to absmax

* finish i2_s/i8_s vec_dot x86 simd

* i2s->q22

* fix code

* remove block scale

* add dequantize

* fix seq

* update avx2

* remove q2_2

* remove q22_grid

* fix whitespace

* reuse llm_build_kv

* fix bo

---------

Co-authored-by: root <root@wangjinheng>
2024-06-23 21:27:57 +03:00
Aarni Koskela 6a2f298bd7 server : fix JSON-Scheme typo (#7975) 2024-06-23 11:03:08 -04:00
Daniel Bevenius 11318d9aa1 Fix typo in llama_set_embeddings comment (#8077) 2024-06-23 15:39:45 +02:00
slaren b6b9a8e606 fix CI failures (#8066)
* test-backend-ops : increase cpy max nmse

* server ci : disable thread sanitizer
2024-06-23 13:14:45 +02:00
0cc4m 45c0e2e4c1 Refactor Vulkan backend to allow multiple contexts (#7961)
* Refactor Vulkan backend to allow multiple contexts

* Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs

* Fix Vulkan debug build error
2024-06-23 10:21:25 +02:00
Clint Herron b5a5f34efa Removing extra blank lines that were breaking Lint. (#8067) 2024-06-22 14:28:18 -04:00
Xuan Son Nguyen 3e58b0ee35 cvector: fix CI + correct help message (#8064)
* cvector: fix CI + correct help message

* also correct --pca-iter
2024-06-22 18:11:30 +02:00
HatsuneMikuUwU33 adf480c3ab cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052)
* Update negative.txt

* Update positive.txt

* Update cvector-generator.cpp

* Update cvector-generator.cpp
2024-06-22 17:19:37 +02:00
0xspringtime 3aa184a8c7 convert-hf : change assert to exception (#8015) 2024-06-22 15:37:41 +02:00
ddh0 5b48cd53a8 Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058)
Uses the values computed by @JohannesGaessler in PR #7413
2024-06-22 15:16:10 +02:00
Clint Herron c5a8d4b749 JSON Schema to GBNF integration tests (#7790)
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
2024-06-21 23:18:36 -04:00
k.h.lai 557b653dc9 vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)
* vulkan: detect multiple devices by deviceUUID instead of deviceID

* vulkan: remove unneeded variables

* vulkan: fix id query
2024-06-21 10:28:20 +02:00
Eve 7d5e8777ae ggml : AVX IQ quants (#7845)
* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply
2024-06-21 08:57:36 +03:00
Georgi Gerganov a927b0f3dd llama : optimize long word tokenization with WPM (#8034)
ggml-ci
2024-06-21 08:51:28 +03:00
Douglas Hanley 80ea089d77 llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
2024-06-21 08:38:22 +03:00
Shuichi Tsutsumi 0e64591e82 swiftui : enable stream updating (#7754) 2024-06-21 08:30:58 +03:00
Hamdoud Hakem b1ef562bc1 requirements : Bump torch and numpy for python3.12 (#8041) 2024-06-20 22:01:15 +02:00
Hamdoud Hakem 17b291a6a5 convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) 2024-06-20 21:59:59 +02:00
Johannes Gäßler abd894ad96 common: fix warning (#8036)
* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-20 16:40:13 +02:00
luoyu-intel de391e4c80 [SYCL] Fix windows build and inference (#8003)
* add sycl preset

* fix debug link error. fix windows crash

* update README
2024-06-20 21:19:05 +08:00
Johannes Gäßler d50f8897a7 CUDA: stream-k decomposition for MMQ (#8018)
* CUDA: stream-k decomposition for MMQ

* fix undefined memory reads for small matrices
2024-06-20 14:39:21 +02:00
Michael de Gans 2075a66a96 metal : fix ggml_metal_supports_op for BF16 (#8021)
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
2024-06-20 08:32:01 +03:00
sasha0552 ba58993152 server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
Michael de Gans a7854743c5 un-ignore build-info.cmake and build-info.sh (#7996)
* un-ignore `build-info.cmake` and `build-info.sh`

I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.

* un-ignore `build-info.cpp.in`

For the same reason as the previous two files.

* Reorganize `.gitignore`

* Add exceptions for files mentioned by @slaren

I did leave .clang-tidy since it was explicitly ignored before.

* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted

* Remove `.clang-tidy` from `.gitignore`

Per comment by @ggerganov

* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
2024-06-19 22:10:42 +02:00
slaren 9c77ec1d74 ggml : synchronize threads using barriers (#7993) 2024-06-19 15:04:15 +02:00
Georgi Gerganov a04a953cab codecov : remove (#8004) 2024-06-19 13:04:36 +03:00
Meng, Hengyu 623494a478 [SYCL] refactor (#6408)
* seperate lower precision GEMM from the main files

* fix workgroup size hardcode
2024-06-19 09:11:51 +08:00
jaime-m-p 37bef89433 tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
2024-06-18 18:40:52 +02:00
Sigbjørn Skjæret 91c188d6c2 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
jojorne 84f6de17f6 Fix no gcc pragma on Windows (#7751) 2024-06-18 22:18:32 +10:00
Ulrich Drepper 61665277af Allow compiling with CUDA without CUDA runtime installed (#7989)
On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages.  Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.

The development environment is prepared for such situations.  There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory.  Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.
2024-06-18 14:00:14 +02:00
Frank Mai b96f9afb0d chore: clean useless beam search param (#7985)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-18 10:11:40 +03:00
Abheek Gulati 1193778105 readme : update UI list (#7943) 2024-06-18 09:57:41 +03:00
Georgi Gerganov 5326bcceeb ggml : sync 2024-06-18 09:50:45 +03:00
Georgi Gerganov e6ecc2be47 whisper : use ggml_backend_sched (whisper/2239)
* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:50:40 +03:00
Ștefan-Gabriel Muscalu a94e6ff877 update: support Qwen2-57B-A14B (#7835)
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shared_exp are now properly calculated

* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shexp are now properly calculated
2024-06-17 21:08:46 +02:00
Srihari-mcw 5b6da18750 Make updates to type cast based on compiler instead of OS (#7851) 2024-06-17 20:23:17 +02:00
Georgi Gerganov 7c26775adb llama : disable FA if KV head size do not match (#7982) 2024-06-17 19:40:01 +03:00
Bryan Honof b473e95084 Add Nix and Flox install instructions (#7899) 2024-06-17 09:37:55 -06:00
slaren 99052cd227 sched : offload_op also requires supports_op (#7977) 2024-06-17 16:51:42 +02:00
Frank Mai c637fcd34d fix: divide 0 exception in mamba (#7932)
Signed-off-by: thxCode <thxcode0824@gmail.com>
2024-06-17 16:11:08 +02:00
Markus Tavenrath 6a2f0b3474 Implement non-mapped async IO for CUDA on Windows. (#7896)
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.

* Free resources except for backend.

* Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Fix editorconfig and unused variable

* Fix issues with Windows build

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-17 16:10:15 +02:00
Georgi Gerganov 21be9cab94 rpc : fix load/store misaligned addresses (#7948) 2024-06-17 11:09:20 +03:00
Brian 006167aaf6 gguf-dump.py: add --markdown dump output (#7853)
* gguf-dump.py: add --markdown dump output

* gguf-dump.py: Add toc

* gguf-dump.py: use standard tensor name lookup. Also add tensor ID field

* gguf-dump.py: Add tensor overview count

* gguf-dump.py: fix array preview

* gguf-dump.py: markdownTableWithAlignmentSupport() added

* Add type hints and spacing

Co-authored-by: compilade <git@compilade.net>

* gguf-dump.py: prettyfy dimention

* gguf-dump: right align element count

* gguf-dump.py: element count autosizing

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-06-17 15:25:20 +10:00
Neo Zhang df68d4fa5d [SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" (#7946)
* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md
2024-06-17 11:17:07 +08:00
Calvin Laurenson 43b35e38ba Add support for sqrt on CUDA (#7953)
* cuda sqrt support

* enable cuda in pca

* fix comments in pca

* add test

* add sqrt to ggml_backend_cuda_supports_op

* fix test

* new line

* Use F32 sqrtf instead of F64 sqrt

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-17 00:23:04 +02:00
Georgi Gerganov 19b7a836f6 cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-06-16 20:32:49 +03:00
Hong Bo PENG b5fcf8ef5c ggml : fix and optimize ppc64le (ggml/849)
* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 20:32:49 +03:00
Daniel Bevenius 398105ff43 ggml : remove duplicate include of ggml-common.h (ggml/853)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 20:32:49 +03:00
Georgi Gerganov bc6c457fa3 flake.lock: Update (#7951) 2024-06-16 09:16:21 -07:00
Georgi Gerganov 52399254b3 unicode : avoid char32_t (#7957)
ggml-ci
2024-06-16 14:51:40 +03:00
hopkins385 6fe1c62741 readme : update UI list [no ci] (#7958) 2024-06-16 14:51:18 +03:00
Georgi Gerganov cddaf028ad ggml : fix handling of zero blocks in IQ quants (#7955)
ggml-ci
2024-06-16 14:50:12 +03:00
Georgi Gerganov c8a82194a8 github : update pr template 2024-06-16 10:46:51 +03:00
0cc4m 7c7836d9d4 Vulkan Shader Refactor, Memory Debugging Option (#7947)
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory

* Improve debug log code

* Add memory debug output option

* Fix flake8

* Fix unnecessary high llama-3 VRAM use
2024-06-16 07:17:31 +02:00
Xuan Son Nguyen 0c7b3595b9 Add cvector-generator example (#7514)
* add control-vector-generator

* calc diff

* add comments

* proof-of-concept stdlib implementation

Implements PCA and file writing using mostly standard libraries. The output is recognized as a functional control vector, but outputs gibberish.

* param parsing, refactor, comments

Added basic command-line parameters for outfile and one each positive/negative prompt.

Refactored some messy code in PCA computation and GGUF exporting.

Left a bunch of comments regarding further work needed.

* example template completions

Implements an example template set built from the positive/negative prompts like the control vector Python implementation.

* add multi prompts, multi-thread for PCA

* fix mem error

* add debugs

* fix matrix transpose multiplication

you have got to be kidding me

* preliminary template/multiprompt support

model is running out of context and that ought to be fixed (segfaulting) but other than that it looks goodish

* fix zero output & param parsing, functional templating

fixed a bug where the output file had no tensor data/was all zero

fixed a bug where single hyphen flags were not being correctly parsed

implements creation of templated prompts from input (still need to adapt based on model)

* fix square_diff matmul index range and CRLF->LF line endings

fixed a logic error where square_diff would not multiply all rows

fixed a formatting error where the provided completions.txt had CRLF line endings

* add command-line args for num threads, num completions file lines, always reload model

refactored a few things and did what the commit message says on the tin

* code aestheticization

* fix compiler warnings

* in-series multithreading for prompt embedding?

added commented-out code to attempt to start implementing mutlithreading for embedding in main

* remove unnecessary multithreading

* interim fix memory leak

* translated everything but PCA (I think)

* tentatively translate the rest

* fix ggml errors and make new ones

at least it compiles and runs

* fix cb_eval

* temporary commit while I move dev environments

it finally outputs a functioning control vector - "functioning" in the sense that it can be loaded and it clearly has the right idea, but makes the model incoherent

* update debug statements

* pre-tokenize so we can allocate correct memory to ctx_diffs_wrapped

* update comments

* (wip) refactor

* clean up PCA ggml implementation

* fix shape of v_diff_original

* add n_batch for pca

* working version

* remember to copy back the last_eigenvector

* fix n_completions

* bring back n_completions

* default n_pca_batch to 20

* fix macos build

* add to makefile all targets

* use ggml_format_name

* add readme

* fix .editorconfig

* use ggml_backend_tensor_copy

* attemp to fix compile problem on mac

* fix compile warn

* reuse allocr

* move param parser to common

* better error handling

* clean up a bit

* add print_usage

* shorten help msg

* beautify help msg

* escape prompt by default

* change compile target to llama-cvector-generator

* typo

* disable GPU for PCA

* code style

---------

Co-authored-by: Christian Zhou-Zheng <christianzhouzheng@gmail.com>
2024-06-15 18:53:40 +02:00
Meng, Hengyu 7b2f4a7d19 [SYCL] remove global variables (#7710)
* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related
2024-06-15 14:05:10 +08:00
olexiyb f8ec8877b7 ci : fix macos x86 build (#7940)
In order to use old `macos-latest` we should use `macos-12`

Potentially will fix: https://github.com/ggerganov/llama.cpp/issues/6975
2024-06-14 20:28:34 +03:00
Johannes Gäßler 76d66ee0be CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (#7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes
2024-06-14 18:41:49 +02:00
Georgi Gerganov 66ef1ceedf metal : utilize max shared memory for mul_mat_id (#7935) 2024-06-14 17:14:09 +03:00
Radoslav Gerganov e65bbf606c llama-bench : fix RPC indication (#7936)
Show "<backend_name>+RPC" when RPC offloading is used
2024-06-14 16:47:41 +03:00
Sigbjørn Skjæret 6fcd1331ef llama : more checks before assuming FIM tokens (#7644)
* More checks before assuming FIM tokens for Llama arch

* extensive token check
2024-06-14 13:20:04 +03:00
Elaine 41b9260f18 convert : add Poro-34B-chat tokenizer support (#7713)
* support for Poro chat pre-tokenizer

* add support for Poro pre-tokenizer

* Update convert-hf-to-gguf-update.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Change Poro-34B-chat to poro-chat

* Change Poro-34B-chat to poro-chat

* Update convert-hf-to-gguf-update.py

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-14 13:16:49 +03:00
Radoslav Gerganov 172c825684 rpc : fix ggml_backend_rpc_supports_buft() (#7918) 2024-06-13 15:18:44 +03:00
Galunid a55eb1bf0f readme : Remove outdated instructions from README.md (#7914) [no ci] 2024-06-13 09:42:41 +02:00
slaren f578b86b21 move BLAS to a separate backend (#6210)
* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-13 03:11:35 +02:00
Olivier Chafik 1c641e6aac build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df4.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-13 00:41:52 +01:00
Johannes Gäßler 963552903f CUDA: fix broken oob check for FA vec f32 kernel (#7904) 2024-06-12 17:41:51 +02:00
Georgi Gerganov a9cae48003 tests : add non-cont unary tests (#7857)
* tests : add non-cont unary tests

* ggml : update unary asserts and "supports_op"

ggml-ci
2024-06-12 16:00:22 +03:00
Georgi Gerganov bfaa676b08 ggml : improve ggml_is_contiguous logic (#7856)
* ggml : improve ggml_is_contiguous logic

ggml-ci

* ggml : support more contiguous cases

ggml-ci
2024-06-12 15:24:20 +03:00
Georgi Gerganov 704a35b183 server : restore numeric prompts (#7883) 2024-06-12 14:42:29 +03:00
Meng, Hengyu dcf752707d update intel docker oneapi-basekit to 2024.1.1-devel-ubuntu22.04 (#7894)
In addition this reverts a workaround we had to do to workaround the upstream issue with expired intel GPG package keys in 2024.0.1-devel-ubuntu22.04
2024-06-12 19:05:35 +10:00
Patrice Ferlet f2b5764beb Fix a typo and add Fedora 40 pacakge to install for Vulkan (#7794) [no ci]
Fix "appropiate" to "appropriate" and add Fedora 40 packages to install to compile with Vulkan support
2024-06-12 11:18:16 +10:00
k.h.lai 73bac2b11d vulkan: select only one device for single gpu with multiple drivers (#7582) 2024-06-11 21:26:05 +02:00
0cc4m ef52d1d16a Update Vulkan RoPE implementation (#7818)
* Update Vulkan RoPE implementation

* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception

Minor fixes

* Fix segfault when running out of VRAM

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-11 21:20:29 +02:00
Deven Mistry 14f83526cd fix broken link in pr template (#7880) [no ci]
* fix broken link in pr template

* Update pull_request_template.md [no ci]

---------

Co-authored-by: Brian <mofosyne@gmail.com>
2024-06-12 02:18:58 +10:00
Brian 6fe42d073f github: move PR template to .github/ root (#7868) 2024-06-11 17:43:41 +03:00
Johannes Gäßler 148995e5e5 llama-bench: more compact markdown tables (#7879) 2024-06-11 14:45:40 +02:00
Georgi Gerganov 4bfe50f741 tests : check the Python version (#7872)
ggml-ci
2024-06-11 10:10:20 +03:00
Johannes Gäßler bdcb8f4222 CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (#7860) 2024-06-11 08:26:07 +02:00
slaren c2ce6c47e4 fix CUDA CI by using a windows-2019 image (#7861)
* try to fix CUDA ci with --allow-unsupported-compiler

* trigger when build.yml changes

* another test

* try exllama/bdashore3 method

* install vs build tools before cuda toolkit

* try win-2019
2024-06-11 08:59:20 +03:00
Olivier Chafik b61eb9644d json: refine constraint for whitespace to avoid runaways yet allow pretty print (#7866) 2024-06-11 02:22:57 +01:00
Olivier Chafik 396b18dfec json: document schema conversion in GBNF readme, align manual grammar examples & converters (#7841)
* json: fix char pattern in grammar converters

* json: prevent number precision & whitespace runaways in example grammars

* json: add doc to grammar readme
2024-06-11 01:00:30 +01:00
Jared Van Bortel 864a99e7a0 cmake : fix CMake requirement for CUDA (#7821) 2024-06-10 18:32:10 -04:00
slaren fd5ea0f897 ci : try win-2019 on server windows test (#7854) 2024-06-10 15:18:41 +03:00
Georgi Gerganov c28a83902c examples : remove --instruct remnants (#7846) 2024-06-10 15:00:15 +03:00
Georgi Gerganov d9da0e4986 server : improve "prompt" handling (#7847) 2024-06-10 14:59:55 +03:00
Johannes Gäßler 1f0dabda8d CUDA: use tensor cores for MMQ (#7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early
2024-06-10 11:45:13 +02:00
Ben Ashbaugh af4ae502dd use the correct SYCL context for host USM allocations (#7777)
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-10 10:21:31 +01:00
Georgi Gerganov 10ceba354a flake.lock: Update (#7838)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29)
  → 'github:NixOS/nixpkgs/051f920625ab5aabe37c920346e3e69d7d34400e?narHash=sha256-4q0s6m0GUcN7q%2BY2DqD27iLvbcd1G50T2lv08kKxkSI%3D' (2024-06-07)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-09 16:04:50 -07:00
Georgi Gerganov e95beeb1fc imatrix : handle partial entries (#7833) 2024-06-09 20:19:35 +03:00
Nicolás Pérez 57bf62ce7c docs: Added initial PR template with directions for doc only changes and squash merges [no ci] (#7700)
This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions.

Co-authored-by: Brian <mofosyne@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-06-10 01:24:29 +10:00
mgroeber9110 3e2ee44315 server: do not remove whitespace at the start of a completion chunk (#7830) 2024-06-09 20:50:35 +10:00
Johannes Gäßler 42b53d192f CUDA: revise q8_1 data layout for mul_mat_q (#7824) 2024-06-09 09:42:25 +02:00
sasha0552 2decf57bc6 convert-hf : set the model name based on cli arg, if present (#7693)
`--model-name` argument was added a while ago but did not do anything.
This commit fixes this issue and enables this feature.
2024-06-09 16:39:25 +10:00
compilade 5795b94182 convert-hf : match model part name prefix and suffix (#7687)
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. 

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
persistent problem, but shall do in the meantime.
2024-06-09 12:47:25 +10:00
compilade ed9f252118 gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
slaren fe1e3917cf Revert "[SYCL] Update rpc-server.cpp to include SYCL backend (#7682)" (#7808)
This reverts commit 9422c5e34b.
2024-06-09 01:43:39 +02:00
Olivier Chafik d4d915d351 url: save -mu downloads to new cache location (#7826)
* url: save -mu download to new cache location

* url: fs_get_cache_file_path util

* url: tweak sig of fs_get_cache_file
2024-06-08 21:21:08 +02:00
sasha0552 7a16ce7db2 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
slaren da799b4189 vulkan : reuse parent extra for views (#7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07 19:47:49 +02:00
Christian Zhou-Zheng c00fad71e5 gguf-split : change binary multi-byte units to decimal (#7803) 2024-06-07 15:56:01 +03:00
intelmatt 27615f5ab2 cmake : fix BUILD_SHARED_LIBS=ON build (#7784)
common depends on pthreads in Linux
2024-06-07 15:15:07 +03:00
Johannes Gäßler 7027b27d76 server: update cache_prompt documentation [no ci] (#7745) 2024-06-07 11:15:49 +02:00
woodx a5cabd7649 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
pengxin99 d5c938cd77 [SYCL] fix softmax r2r result wrong issue (#7811) 2024-06-07 14:28:26 +08:00
slaren c9ee7118d5 check for nans in imatrix and quantize (#7807)
* imatrix : detect nan/inf values

* quantize : check imatrix for nan/inf values
2024-06-07 09:01:29 +03:00
Georgi Gerganov ee459f40f6 server : fix --threads-http arg (#7801) 2024-06-06 19:19:59 +03:00
1454 changed files with 527162 additions and 310120 deletions
+164
View File
@@ -0,0 +1,164 @@
---
Language: Cpp
AlignAfterOpenBracket: Align
AlignArrayOfStructures: Left
AlignConsecutiveAssignments: AcrossComments
AlignConsecutiveBitFields: AcrossComments
AlignConsecutiveDeclarations: AcrossComments
AlignConsecutiveMacros: AcrossComments
# AlignConsecutiveShortCaseStatements: AcrossComments
AlignEscapedNewlines: Left # LeftWithLastLine
AlignOperands: Align
AlignTrailingComments:
Kind: Always
OverEmptyLines: 1
AllowAllArgumentsOnNextLine: true
AllowAllParametersOfDeclarationOnNextLine: false
# AllowBreakBeforeNoexceptSpecifier: OnlyWithParen
AllowShortBlocksOnASingleLine: Never
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: Inline
AllowShortIfStatementsOnASingleLine: Never
AllowShortLambdasOnASingleLine: Inline
AllowShortLoopsOnASingleLine: false
AlwaysBreakBeforeMultilineStrings: true
BinPackArguments: false
BinPackParameters: false # OnePerLine
BitFieldColonSpacing: Both
BreakBeforeBraces: Custom # Attach
BraceWrapping:
AfterCaseLabel: true
AfterClass: false
AfterControlStatement: false
AfterEnum: false
AfterFunction: false
AfterNamespace: false
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
BeforeLambdaBody: false
BeforeWhile: false
IndentBraces: false
SplitEmptyFunction: false
SplitEmptyRecord: false
SplitEmptyNamespace: false
# BreakAdjacentStringLiterals: true
BreakAfterAttributes: Never
BreakBeforeBinaryOperators: None
BreakBeforeInlineASMColon: OnlyMultiline
BreakBeforeTernaryOperators: false
# BreakBinaryOperations: Never
BreakConstructorInitializers: AfterColon
# BreakFunctionDefinitionParameters: false
BreakInheritanceList: AfterComma
BreakStringLiterals: true
# BreakTemplateDeclarations: Yes
ColumnLimit: 120
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
Cpp11BracedListStyle: false
DerivePointerAlignment: false
DisableFormat: false
EmptyLineBeforeAccessModifier: Leave
EmptyLineAfterAccessModifier: Never
ExperimentalAutoDetectBinPacking: false
FixNamespaceComments: true
IncludeBlocks: Regroup
IncludeCategories:
- Regex: '".*"'
Priority: 1
SortPriority: 0
- Regex: '^<.*\.h>'
Priority: 2
SortPriority: 0
- Regex: '^<.*'
Priority: 3
SortPriority: 0
- Regex: '.*'
Priority: 4
SortPriority: 0
IncludeIsMainRegex: '([-_](test|unittest))?$'
IncludeIsMainSourceRegex: ''
IndentAccessModifiers: false
IndentCaseBlocks: true
IndentCaseLabels: true
IndentExternBlock: NoIndent
IndentGotoLabels: false
IndentPPDirectives: AfterHash
IndentWidth: 4
IndentWrappedFunctionNames: false
InsertBraces: true # NOTE: may lead to incorrect formatting
InsertNewlineAtEOF: true
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: false
LambdaBodyIndentation: Signature
LineEnding: LF
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Auto
ObjCBlockIndentWidth: 4
ObjCSpaceAfterProperty: true
ObjCSpaceBeforeProtocolList: true
PPIndentWidth: -1
PackConstructorInitializers: CurrentLine
PenaltyBreakAssignment: 2
PenaltyBreakBeforeFirstCallParameter: 1
PenaltyBreakComment: 300
PenaltyBreakFirstLessLess: 120
PenaltyBreakString: 1000
PenaltyBreakTemplateDeclaration: 10
PenaltyExcessCharacter: 1000000
PenaltyReturnTypeOnItsOwnLine: 200
PointerAlignment: Middle
QualifierAlignment: Left
#QualifierOrder: ['static', 'inline', 'friend', 'constexpr', 'const', 'volatile', 'type', 'restrict']
RawStringFormats:
- Language: Cpp
Delimiters:
- cc
- CC
- cpp
- Cpp
- CPP
- 'c++'
- 'C++'
CanonicalDelimiter: ''
ReferenceAlignment: Middle
ReflowComments: false # IndentOnly
SeparateDefinitionBlocks: Always
SortIncludes: CaseInsensitive
SortUsingDeclarations: LexicographicNumeric
SpaceAfterCStyleCast: true
SpaceAfterLogicalNot: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCpp11BracedList: false
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatements
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyBlock: false
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 2
SpacesInAngles: Never
SpacesInContainerLiterals: true
SpacesInLineCommentPrefix:
Minimum: 1
Maximum: -1
SpacesInParentheses: false
SpacesInSquareBrackets: false
SpaceBeforeSquareBrackets: false
Standard: c++17
TabWidth: 4
UseTab: Never
WhitespaceSensitiveMacros: ['STRINGIZE']
...
+3
View File
@@ -13,12 +13,15 @@ Checks: >
-readability-magic-numbers,
-readability-uppercase-literal-suffix,
-readability-simplify-boolean-expr,
-readability-math-missing-parentheses,
clang-analyzer-*,
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,
performance-*,
portability-*,
-portability-simd-intrinsics,
misc-*,
-misc-const-correctness,
-misc-non-private-member-variables-in-classes,
-misc-no-recursion,
-misc-use-anonymous-namespace,
FormatStyle: none
+1 -1
View File
@@ -15,7 +15,7 @@ node('x86_runner1'){ // Running on x86 runner containing latest vecto
stage('Running llama.cpp'){
sh'''#!/bin/bash
module load gnu-bin2/0.1 # loading latest versions of vector qemu and vector gcc
qemu-riscv64 -L /softwares/gnu-bin2/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./main -m /home/alitariq/codellama-7b.Q4_K_M.gguf -p "Anything" -n 9 > llama_log.txt # Running llama.cpp on vector qemu-riscv64
qemu-riscv64 -L /softwares/gnu-bin2/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/alitariq/codellama-7b.Q4_K_M.gguf -p "Anything" -n 9 > llama_log.txt # Running llama.cpp on vector qemu-riscv64
cat llama_log.txt # Printing results
'''
}
+92
View File
@@ -0,0 +1,92 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION AS build
ARG TARGETARCH
ARG GGML_CPU_ARM_ARCH=armv8-a
RUN apt-get update && \
apt-get install -y build-essential git cmake libcurl4-openssl-dev
WORKDIR /app
COPY . .
RUN if [ "$TARGETARCH" = "amd64" ]; then \
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_TESTS=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON; \
elif [ "$TARGETARCH" = "arm64" ]; then \
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_TESTS=OFF -DGGML_CPU_ARM_ARCH=${GGML_CPU_ARM_ARCH}; \
else \
echo "Unsupported architecture"; \
exit 1; \
fi && \
cmake --build build -j $(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl\
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
+94
View File
@@ -0,0 +1,94 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=12.4.0
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
FROM ${BASE_CUDA_DEV_CONTAINER} AS build
# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default
RUN apt-get update && \
apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1
WORKDIR /app
COPY . .
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ${BASE_CUDA_RUN_CONTAINER} AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl\
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
-36
View File
@@ -1,36 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
FROM ${BASE_CUDA_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev libgomp1
COPY requirements.txt requirements.txt
COPY requirements requirements
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1
RUN make -j$(nproc)
ENTRYPOINT ["/app/.devops/tools.sh"]
-50
View File
@@ -1,50 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG ROCM_VERSION=5.6
# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete
FROM ${BASE_ROCM_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
gfx803 \
gfx900 \
gfx906 \
gfx908 \
gfx90a \
gfx1010 \
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102
COPY requirements.txt requirements.txt
COPY requirements requirements
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++
# Enable cURL
ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
RUN make -j$(nproc)
ENTRYPOINT ["/app/.devops/tools.sh"]
-25
View File
@@ -1,25 +0,0 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev libgomp1
COPY requirements.txt requirements.txt
COPY requirements requirements
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt
WORKDIR /app
COPY . .
ENV LLAMA_CURL=1
RUN make -j$(nproc)
ENV LC_ALL=C.utf8
ENTRYPOINT ["/app/.devops/tools.sh"]
+95
View File
@@ -0,0 +1,95 @@
ARG ONEAPI_VERSION=2025.1.1-0-devel-ubuntu24.04
## Build Image
FROM intel/oneapi-basekit:$ONEAPI_VERSION AS build
ARG GGML_SYCL_F16=OFF
RUN apt-get update && \
apt-get install -y git libcurl4-openssl-dev
WORKDIR /app
COPY . .
RUN if [ "${GGML_SYCL_F16}" = "ON" ]; then \
echo "GGML_SYCL_F16 is set" \
&& export OPT_SYCL_F16="-DGGML_SYCL_F16=ON"; \
fi && \
echo "Building with dynamic libs" && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${OPT_SYCL_F16} && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
FROM intel/oneapi-basekit:$ONEAPI_VERSION AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl\
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
### Full
FROM base AS full
COPY --from=build /app/lib/ /app
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update && \
apt-get install -y \
git \
python3 \
python3-pip \
python3-venv && \
python3 -m venv /opt/venv && \
. /opt/venv/bin/activate && \
pip install --upgrade pip setuptools wheel && \
pip install -r requirements.txt && \
apt autoremove -y && \
apt clean -y && \
rm -rf /tmp/* /var/tmp/* && \
find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete && \
find /var/cache -type f -delete
ENV PATH="/opt/venv/bin:$PATH"
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/lib/ /app
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/lib/ /app
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
+44
View File
@@ -0,0 +1,44 @@
ARG ASCEND_VERSION=8.1.RC1.alpha001-910b-openeuler22.03-py3.10
FROM ascendai/cann:$ASCEND_VERSION AS build
WORKDIR /app
COPY . .
RUN yum install -y gcc g++ cmake make libcurl-devel
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}
# find libascend_hal.so, because the drive hasn`t been mounted.
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/runtime/lib64/stub:$LD_LIBRARY_PATH
RUN echo "Building with static libs" && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh --force && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CANN=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_TESTS=OFF && \
cmake --build build --config Release --target llama-cli
# TODO: use image with NNRT
FROM ascendai/cann:$ASCEND_VERSION AS runtime
COPY --from=build /app/build/bin/llama-cli /llama-cli
ENV LC_ALL=C.utf8
ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}
ENTRYPOINT ["/llama-cli" ]
-84
View File
@@ -1,84 +0,0 @@
# SRPM for building from source and packaging an RPM for RPM-based distros.
# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
# Built and maintained by John Boero - boeroboy@gmail.com
# In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal
# Notes for llama.cpp:
# 1. Tags are currently based on hash - which will not sort asciibetically.
# We need to declare standard versioning if people want to sort latest releases.
# 2. Builds for CUDA/OpenCL support are separate, with different depenedencies.
# 3. NVidia's developer repo must be enabled with nvcc, cublas, clblas, etc installed.
# Example: https://developer.download.nvidia.com/compute/cuda/repos/fedora37/x86_64/cuda-fedora37.repo
# 4. OpenCL/CLBLAST support simply requires the ICD loader and basic opencl libraries.
# It is up to the user to install the correct vendor-specific support.
Name: llama.cpp-clblast
Version: %( date "+%%Y%%m%%d" )
Release: 1%{?dist}
Summary: OpenCL Inference of LLaMA model in C/C++
License: MIT
Source0: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.tar.gz
BuildRequires: coreutils make gcc-c++ git mesa-libOpenCL-devel clblast-devel
Requires: clblast
URL: https://github.com/ggerganov/llama.cpp
%define debug_package %{nil}
%define source_date_epoch_from_changelog 0
%description
CPU inference for Meta's Lllama2 models using default options.
%prep
%setup -n llama.cpp-master
%build
make -j LLAMA_CLBLAST=1
%install
mkdir -p %{buildroot}%{_bindir}/
cp -p main %{buildroot}%{_bindir}/llamaclblast
cp -p server %{buildroot}%{_bindir}/llamaclblastserver
cp -p simple %{buildroot}%{_bindir}/llamaclblastsimple
mkdir -p %{buildroot}/usr/lib/systemd/system
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamaclblast.service
[Unit]
Description=Llama.cpp server, CPU only (no GPU support in this build).
After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.target
[Service]
Type=simple
EnvironmentFile=/etc/sysconfig/llama
ExecStart=/usr/bin/llamaclblastserver $LLAMA_ARGS
ExecReload=/bin/kill -s HUP $MAINPID
Restart=never
[Install]
WantedBy=default.target
EOF
mkdir -p %{buildroot}/etc/sysconfig
%{__cat} <<EOF > %{buildroot}/etc/sysconfig/llama
LLAMA_ARGS="-m /opt/llama2/ggml-model-f32.bin"
EOF
%clean
rm -rf %{buildroot}
rm -rf %{_builddir}/*
%files
%{_bindir}/llamaclblast
%{_bindir}/llamaclblastserver
%{_bindir}/llamaclblastsimple
/usr/lib/systemd/system/llamaclblast.service
%config /etc/sysconfig/llama
%pre
%post
%preun
%postun
%changelog
+10 -10
View File
@@ -17,10 +17,10 @@ Version: %( date "+%%Y%%m%%d" )
Release: 1%{?dist}
Summary: CPU Inference of LLaMA model in pure C/C++ (no CUDA/OpenCL)
License: MIT
Source0: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.tar.gz
Source0: https://github.com/ggml-org/llama.cpp/archive/refs/heads/master.tar.gz
BuildRequires: coreutils make gcc-c++ git cuda-toolkit
Requires: cuda-toolkit
URL: https://github.com/ggerganov/llama.cpp
URL: https://github.com/ggml-org/llama.cpp
%define debug_package %{nil}
%define source_date_epoch_from_changelog 0
@@ -32,13 +32,13 @@ CPU inference for Meta's Lllama2 models using default options.
%setup -n llama.cpp-master
%build
make -j LLAMA_CUDA=1
make -j GGML_CUDA=1
%install
mkdir -p %{buildroot}%{_bindir}/
cp -p main %{buildroot}%{_bindir}/llamacppcuda
cp -p server %{buildroot}%{_bindir}/llamacppcudaserver
cp -p simple %{buildroot}%{_bindir}/llamacppcudasimple
cp -p llama-cli %{buildroot}%{_bindir}/llama-cuda-cli
cp -p llama-server %{buildroot}%{_bindir}/llama-cuda-server
cp -p llama-simple %{buildroot}%{_bindir}/llama-cuda-simple
mkdir -p %{buildroot}/usr/lib/systemd/system
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llamacuda.service
@@ -49,7 +49,7 @@ After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.t
[Service]
Type=simple
EnvironmentFile=/etc/sysconfig/llama
ExecStart=/usr/bin/llamacppcudaserver $LLAMA_ARGS
ExecStart=/usr/bin/llama-cuda-server $LLAMA_ARGS
ExecReload=/bin/kill -s HUP $MAINPID
Restart=never
@@ -67,9 +67,9 @@ rm -rf %{buildroot}
rm -rf %{_builddir}/*
%files
%{_bindir}/llamacppcuda
%{_bindir}/llamacppcudaserver
%{_bindir}/llamacppcudasimple
%{_bindir}/llama-cuda-cli
%{_bindir}/llama-cuda-server
%{_bindir}/llama-cuda-simple
/usr/lib/systemd/system/llamacuda.service
%config /etc/sysconfig/llama
+9 -9
View File
@@ -18,10 +18,10 @@ Version: %( date "+%%Y%%m%%d" )
Release: 1%{?dist}
Summary: CPU Inference of LLaMA model in pure C/C++ (no CUDA/OpenCL)
License: MIT
Source0: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.tar.gz
Source0: https://github.com/ggml-org/llama.cpp/archive/refs/heads/master.tar.gz
BuildRequires: coreutils make gcc-c++ git libstdc++-devel
Requires: libstdc++
URL: https://github.com/ggerganov/llama.cpp
URL: https://github.com/ggml-org/llama.cpp
%define debug_package %{nil}
%define source_date_epoch_from_changelog 0
@@ -38,9 +38,9 @@ make -j
%install
mkdir -p %{buildroot}%{_bindir}/
cp -p main %{buildroot}%{_bindir}/llama
cp -p server %{buildroot}%{_bindir}/llamaserver
cp -p simple %{buildroot}%{_bindir}/llamasimple
cp -p llama-cli %{buildroot}%{_bindir}/llama-cli
cp -p llama-server %{buildroot}%{_bindir}/llama-server
cp -p llama-simple %{buildroot}%{_bindir}/llama-simple
mkdir -p %{buildroot}/usr/lib/systemd/system
%{__cat} <<EOF > %{buildroot}/usr/lib/systemd/system/llama.service
@@ -51,7 +51,7 @@ After=syslog.target network.target local-fs.target remote-fs.target nss-lookup.t
[Service]
Type=simple
EnvironmentFile=/etc/sysconfig/llama
ExecStart=/usr/bin/llamaserver $LLAMA_ARGS
ExecStart=/usr/bin/llama-server $LLAMA_ARGS
ExecReload=/bin/kill -s HUP $MAINPID
Restart=never
@@ -69,9 +69,9 @@ rm -rf %{buildroot}
rm -rf %{_builddir}/*
%files
%{_bindir}/llama
%{_bindir}/llamaserver
%{_bindir}/llamasimple
%{_bindir}/llama-cli
%{_bindir}/llama-server
%{_bindir}/llama-simple
/usr/lib/systemd/system/llama.service
%config /etc/sysconfig/llama
-35
View File
@@ -1,35 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the CUDA runtime image
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
FROM ${BASE_CUDA_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
RUN apt-get update && \
apt-get install -y build-essential git
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV LLAMA_CUDA=1
RUN make -j$(nproc) main
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime
RUN apt-get update && \
apt-get install -y libgomp1
COPY --from=build /app/main /main
ENTRYPOINT [ "/main" ]
-34
View File
@@ -1,34 +0,0 @@
ARG ONEAPI_VERSION=2024.0.1-devel-ubuntu22.04
FROM intel/oneapi-basekit:$ONEAPI_VERSION as build
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
rm /etc/apt/sources.list.d/intel-graphics.list && \
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg
ARG LLAMA_SYCL_F16=OFF
RUN apt-get update && \
apt-get install -y git
WORKDIR /app
COPY . .
RUN if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
echo "LLAMA_SYCL_F16 is set" && \
export OPT_SYCL_F16="-DLLAMA_SYCL_F16=ON"; \
fi && \
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ${OPT_SYCL_F16} && \
cmake --build build --config Release --target main
FROM intel/oneapi-basekit:$ONEAPI_VERSION as runtime
COPY --from=build /app/build/bin/main /main
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/main" ]
-45
View File
@@ -1,45 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG ROCM_VERSION=5.6
# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete
FROM ${BASE_ROCM_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
gfx803 \
gfx900 \
gfx906 \
gfx908 \
gfx90a \
gfx1010 \
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102
COPY requirements.txt requirements.txt
COPY requirements requirements
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++
RUN make -j$(nproc) main
ENTRYPOINT [ "/app/main" ]
-27
View File
@@ -1,27 +0,0 @@
ARG UBUNTU_VERSION=jammy
FROM ubuntu:$UBUNTU_VERSION as build
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget libgomp1
# Install Vulkan SDK
RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list && \
apt update -y && \
apt-get install -y vulkan-sdk
# Build it
WORKDIR /app
COPY . .
RUN cmake -B build -DLLAMA_VULKAN=1 && \
cmake --build build --config Release --target main
# Clean up
WORKDIR /
RUN cp /app/build/bin/main /main && \
rm -rf /app
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/main" ]
-23
View File
@@ -1,23 +0,0 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential git
WORKDIR /app
COPY . .
RUN make -j$(nproc) main
FROM ubuntu:$UBUNTU_VERSION as runtime
RUN apt-get update && \
apt-get install -y libgomp1
COPY --from=build /app/main /main
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/main" ]
+101
View File
@@ -0,0 +1,101 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc4.2.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}-amd64
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64
FROM ${BASE_MUSA_DEV_CONTAINER} AS build
# MUSA architecture to build for (defaults to all supported archs)
ARG MUSA_DOCKER_ARCH=default
RUN apt-get update && \
apt-get install -y \
build-essential \
cmake \
python3 \
python3-pip \
git \
libcurl4-openssl-dev \
libgomp1
WORKDIR /app
COPY . .
RUN if [ "${MUSA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DMUSA_ARCHITECTURES=${MUSA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_MUSA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ${BASE_MUSA_RUN_CONTAINER} AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl\
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
+2 -3
View File
@@ -6,11 +6,10 @@
let
inherit (config.packages) default;
binaries = [
"llama"
"llama-cli"
"llama-embedding"
"llama-server"
"quantize"
"train-text-from-scratch"
"llama-quantize"
];
mkApp = name: {
type = "app";
+46 -7
View File
@@ -1,13 +1,52 @@
{ inputs, ... }:
{
perSystem =
{ config, lib, ... }:
{
config,
lib,
system,
...
}:
{
devShells =
lib.concatMapAttrs
(name: package: {
${name} = package.passthru.shell;
${name + "-extra"} = package.passthru.shell-extra;
})
config.packages;
let
pkgs = import inputs.nixpkgs { inherit system; };
stdenv = pkgs.stdenv;
scripts = config.packages.python-scripts;
in
lib.pipe (config.packages) [
(lib.concatMapAttrs (
name: package: {
${name} = pkgs.mkShell {
name = "${name}";
inputsFrom = [ package ];
shellHook = ''
echo "Entering ${name} devShell"
'';
};
"${name}-extra" =
if (name == "python-scripts") then
null
else
pkgs.mkShell {
name = "${name}-extra";
inputsFrom = [
package
scripts
];
# Extra packages that *may* be used by some scripts
packages = [
pkgs.python3Packages.tiktoken
];
shellHook = ''
echo "Entering ${name} devShell"
addToSearchPath "LD_LIBRARY_PATH" "${lib.getLib stdenv.cc.cc}/lib"
'';
};
}
))
(lib.filterAttrs (name: value: value != null))
];
};
}
+8 -10
View File
@@ -26,16 +26,14 @@
config.cudaSupport = true;
config.allowUnfreePredicate =
p:
builtins.all
(
license:
license.free
|| builtins.elem license.shortName [
"CUDA EULA"
"cuDNN EULA"
]
)
(p.meta.licenses or [ p.meta.license ]);
builtins.all (
license:
license.free
|| builtins.elem license.shortName [
"CUDA EULA"
"cuDNN EULA"
]
) (p.meta.licenses or [ p.meta.license ]);
};
# Ensure dependencies use ROCm consistently
pkgsRocm = import inputs.nixpkgs {
+36
View File
@@ -0,0 +1,36 @@
{
lib,
llamaVersion,
numpy,
tqdm,
sentencepiece,
pyyaml,
poetry-core,
buildPythonPackage,
pytestCheckHook,
}:
buildPythonPackage {
pname = "gguf";
version = llamaVersion;
pyproject = true;
nativeBuildInputs = [ poetry-core ];
propagatedBuildInputs = [
numpy
tqdm
sentencepiece
pyyaml
];
src = lib.cleanSource ../../gguf-py;
pythonImportsCheck = [
"numpy"
"gguf"
];
nativeCheckInputs = [ pytestCheckHook ];
doCheck = true;
meta = with lib; {
description = "Python package for writing binary files in the GGUF format";
license = licenses.mit;
maintainers = [ maintainers.ditsuke ];
};
}
+155 -225
View File
@@ -3,33 +3,36 @@
glibc,
config,
stdenv,
mkShell,
runCommand,
cmake,
ninja,
pkg-config,
git,
python3,
mpi,
blas,
cudaPackages,
autoAddDriverRunpath,
darwin,
rocmPackages,
vulkan-headers,
vulkan-loader,
clblast,
useBlas ? builtins.all (x: !x) [
useCuda
useMetalKit
useOpenCL
useRocm
useVulkan
] && blas.meta.available,
curl,
shaderc,
useBlas ?
builtins.all (x: !x) [
useCuda
useMetalKit
useRocm
useVulkan
]
&& blas.meta.available,
useCuda ? config.cudaSupport,
useMetalKit ? stdenv.isAarch64 && stdenv.isDarwin && !useOpenCL,
useMpi ? false, # Increases the runtime closure size by ~700M
useOpenCL ? false,
useMetalKit ? stdenv.isAarch64 && stdenv.isDarwin,
# Increases the runtime closure size by ~700M
useMpi ? false,
useRocm ? config.rocmSupport,
rocmGpuTargets ? builtins.concatStringsSep ";" rocmPackages.clr.gpuTargets,
enableCurl ? true,
useVulkan ? false,
llamaVersion ? "0.0.0", # Arbitrary version, substituted by the flake
@@ -37,16 +40,16 @@
# otherwise we get libstdc++ errors downstream.
effectiveStdenv ? if useCuda then cudaPackages.backendStdenv else stdenv,
enableStatic ? effectiveStdenv.hostPlatform.isStatic,
precompileMetalShaders ? false
}@inputs:
precompileMetalShaders ? false,
}:
let
inherit (lib)
cmakeBool
cmakeFeature
optionalAttrs
optionals
strings
versionOlder
;
stdenv = throw "Use effectiveStdenv instead";
@@ -56,45 +59,17 @@ let
++ lib.optionals useCuda [ "CUDA" ]
++ lib.optionals useMetalKit [ "MetalKit" ]
++ lib.optionals useMpi [ "MPI" ]
++ lib.optionals useOpenCL [ "OpenCL" ]
++ lib.optionals useRocm [ "ROCm" ]
++ lib.optionals useVulkan [ "Vulkan" ];
pnameSuffix =
strings.optionalString (suffices != [ ])
"-${strings.concatMapStringsSep "-" strings.toLower suffices}";
descriptionSuffix =
strings.optionalString (suffices != [ ])
", accelerated with ${strings.concatStringsSep ", " suffices}";
descriptionSuffix = strings.optionalString (
suffices != [ ]
) ", accelerated with ${strings.concatStringsSep ", " suffices}";
executableSuffix = effectiveStdenv.hostPlatform.extensions.executable;
# TODO: package the Python in this repository in a Nix-like way.
# It'd be nice to migrate to buildPythonPackage, as well as ensure this repo
# is PEP 517-compatible, and ensure the correct .dist-info is generated.
# https://peps.python.org/pep-0517/
#
# TODO: Package up each Python script or service appropriately, by making
# them into "entrypoints"
llama-python = python3.withPackages (
ps: [
ps.numpy
ps.sentencepiece
]
);
# TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
llama-python-extra = python3.withPackages (
ps: [
ps.numpy
ps.sentencepiece
ps.tiktoken
ps.torchWithoutCuda
ps.transformers
]
);
xcrunHost = runCommand "xcrunHost" {} ''
xcrunHost = runCommand "xcrunHost" { } ''
mkdir -p $out/bin
ln -s /usr/bin/xcrun $out/bin
'';
@@ -111,16 +86,9 @@ let
++ optionals useMetalKit [ MetalKit ];
cudaBuildInputs = with cudaPackages; [
cuda_cccl.dev # <nv/target>
# A temporary hack for reducing the closure size, remove once cudaPackages
# have stopped using lndir: https://github.com/NixOS/nixpkgs/issues/271792
cuda_cudart.dev
cuda_cudart.lib
cuda_cudart.static
libcublas.dev
libcublas.lib
libcublas.static
cuda_cudart
cuda_cccl # <nv/target>
libcublas
];
rocmBuildInputs = with rocmPackages; [
@@ -132,187 +100,149 @@ let
vulkanBuildInputs = [
vulkan-headers
vulkan-loader
shaderc
];
in
effectiveStdenv.mkDerivation (
finalAttrs: {
pname = "llama-cpp${pnameSuffix}";
version = llamaVersion;
effectiveStdenv.mkDerivation (finalAttrs: {
pname = "llama-cpp${pnameSuffix}";
version = llamaVersion;
# Note: none of the files discarded here are visible in the sandbox or
# affect the output hash. This also means they can be modified without
# triggering a rebuild.
src = lib.cleanSourceWith {
filter =
name: type:
let
noneOf = builtins.all (x: !x);
baseName = baseNameOf name;
in
noneOf [
(lib.hasSuffix ".nix" name) # Ignore *.nix files when computing outPaths
(lib.hasSuffix ".md" name) # Ignore *.md changes whe computing outPaths
(lib.hasPrefix "." baseName) # Skip hidden files and directories
(baseName == "flake.lock")
];
src = lib.cleanSource ../../.;
};
postPatch = ''
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"ggml-metal" ofType:@"metal"];' "@\"$out/bin/ggml-metal.metal\";"
substituteInPlace ./ggml-metal.m \
--replace '[bundle pathForResource:@"default" ofType:@"metallib"];' "@\"$out/bin/default.metallib\";"
'';
# With PR#6015 https://github.com/ggerganov/llama.cpp/pull/6015,
# `default.metallib` may be compiled with Metal compiler from XCode
# and we need to escape sandbox on MacOS to access Metal compiler.
# `xcrun` is used find the path of the Metal compiler, which is varible
# and not on $PATH
# see https://github.com/ggerganov/llama.cpp/pull/6118 for discussion
__noChroot = effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders;
nativeBuildInputs =
[
cmake
ninja
pkg-config
git
]
++ optionals useCuda [
cudaPackages.cuda_nvcc
# TODO: Replace with autoAddDriverRunpath
# once https://github.com/NixOS/nixpkgs/pull/275241 has been merged
cudaPackages.autoAddOpenGLRunpathHook
]
++ optionals (effectiveStdenv.hostPlatform.isGnu && enableStatic) [
glibc.static
] ++ optionals (effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders) [
xcrunHost
# Note: none of the files discarded here are visible in the sandbox or
# affect the output hash. This also means they can be modified without
# triggering a rebuild.
src = lib.cleanSourceWith {
filter =
name: type:
let
noneOf = builtins.all (x: !x);
baseName = baseNameOf name;
in
noneOf [
(lib.hasSuffix ".nix" name) # Ignore *.nix files when computing outPaths
(lib.hasSuffix ".md" name) # Ignore *.md changes whe computing outPaths
(lib.hasPrefix "." baseName) # Skip hidden files and directories
(baseName == "flake.lock")
];
src = lib.cleanSource ../../.;
};
buildInputs =
optionals effectiveStdenv.isDarwin darwinBuildInputs
++ optionals useCuda cudaBuildInputs
++ optionals useMpi [ mpi ]
++ optionals useOpenCL [ clblast ]
++ optionals useRocm rocmBuildInputs
++ optionals useBlas [ blas ]
++ optionals useVulkan vulkanBuildInputs;
postPatch = ''
substituteInPlace ./ggml/src/ggml-metal/ggml-metal.m \
--replace '[bundle pathForResource:@"ggml-metal" ofType:@"metal"];' "@\"$out/bin/ggml-metal.metal\";"
substituteInPlace ./ggml/src/ggml-metal/ggml-metal.m \
--replace '[bundle pathForResource:@"default" ofType:@"metallib"];' "@\"$out/bin/default.metallib\";"
'';
cmakeFlags =
[
(cmakeBool "LLAMA_NATIVE" false)
(cmakeBool "LLAMA_BUILD_SERVER" true)
(cmakeBool "BUILD_SHARED_LIBS" (!enableStatic))
(cmakeBool "CMAKE_SKIP_BUILD_RPATH" true)
(cmakeBool "LLAMA_BLAS" useBlas)
(cmakeBool "LLAMA_CLBLAST" useOpenCL)
(cmakeBool "LLAMA_CUDA" useCuda)
(cmakeBool "LLAMA_HIPBLAS" useRocm)
(cmakeBool "LLAMA_METAL" useMetalKit)
(cmakeBool "LLAMA_VULKAN" useVulkan)
(cmakeBool "LLAMA_STATIC" enableStatic)
]
++ optionals useCuda [
(
with cudaPackages.flags;
cmakeFeature "CMAKE_CUDA_ARCHITECTURES" (
builtins.concatStringsSep ";" (map dropDot cudaCapabilities)
)
# With PR#6015 https://github.com/ggml-org/llama.cpp/pull/6015,
# `default.metallib` may be compiled with Metal compiler from XCode
# and we need to escape sandbox on MacOS to access Metal compiler.
# `xcrun` is used find the path of the Metal compiler, which is varible
# and not on $PATH
# see https://github.com/ggml-org/llama.cpp/pull/6118 for discussion
__noChroot = effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders;
nativeBuildInputs =
[
cmake
ninja
pkg-config
git
]
++ optionals useCuda [
cudaPackages.cuda_nvcc
autoAddDriverRunpath
]
++ optionals (effectiveStdenv.hostPlatform.isGnu && enableStatic) [ glibc.static ]
++ optionals (effectiveStdenv.isDarwin && useMetalKit && precompileMetalShaders) [ xcrunHost ];
buildInputs =
optionals effectiveStdenv.isDarwin darwinBuildInputs
++ optionals useCuda cudaBuildInputs
++ optionals useMpi [ mpi ]
++ optionals useRocm rocmBuildInputs
++ optionals useBlas [ blas ]
++ optionals useVulkan vulkanBuildInputs
++ optionals enableCurl [ curl ];
cmakeFlags =
[
(cmakeBool "LLAMA_BUILD_SERVER" true)
(cmakeBool "BUILD_SHARED_LIBS" (!enableStatic))
(cmakeBool "CMAKE_SKIP_BUILD_RPATH" true)
(cmakeBool "LLAMA_CURL" enableCurl)
(cmakeBool "GGML_NATIVE" false)
(cmakeBool "GGML_BLAS" useBlas)
(cmakeBool "GGML_CUDA" useCuda)
(cmakeBool "GGML_HIP" useRocm)
(cmakeBool "GGML_METAL" useMetalKit)
(cmakeBool "GGML_VULKAN" useVulkan)
(cmakeBool "GGML_STATIC" enableStatic)
]
++ optionals useCuda [
(
with cudaPackages.flags;
cmakeFeature "CMAKE_CUDA_ARCHITECTURES" (
builtins.concatStringsSep ";" (map dropDot cudaCapabilities)
)
]
++ optionals useRocm [
(cmakeFeature "CMAKE_HIP_COMPILER" "${rocmPackages.llvm.clang}/bin/clang")
(cmakeFeature "CMAKE_HIP_ARCHITECTURES" (builtins.concatStringsSep ";" rocmPackages.clr.gpuTargets))
]
++ optionals useMetalKit [
(lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1")
(cmakeBool "LLAMA_METAL_EMBED_LIBRARY" (!precompileMetalShaders))
];
)
]
++ optionals useRocm [
(cmakeFeature "CMAKE_HIP_COMPILER" "${rocmPackages.llvm.clang}/bin/clang")
(cmakeFeature "CMAKE_HIP_ARCHITECTURES" rocmGpuTargets)
]
++ optionals useMetalKit [
(lib.cmakeFeature "CMAKE_C_FLAGS" "-D__ARM_FEATURE_DOTPROD=1")
(cmakeBool "GGML_METAL_EMBED_LIBRARY" (!precompileMetalShaders))
];
# Environment variables needed for ROCm
env = optionals useRocm {
ROCM_PATH = "${rocmPackages.clr}";
HIP_DEVICE_LIB_PATH = "${rocmPackages.rocm-device-libs}/amdgcn/bitcode";
};
# Environment variables needed for ROCm
env = optionalAttrs useRocm {
ROCM_PATH = "${rocmPackages.clr}";
HIP_DEVICE_LIB_PATH = "${rocmPackages.rocm-device-libs}/amdgcn/bitcode";
};
# TODO(SomeoneSerge): It's better to add proper install targets at the CMake level,
# if they haven't been added yet.
postInstall = ''
mv $out/bin/main${executableSuffix} $out/bin/llama${executableSuffix}
mv $out/bin/server${executableSuffix} $out/bin/llama-server${executableSuffix}
mkdir -p $out/include
cp $src/llama.h $out/include/
'';
# TODO(SomeoneSerge): It's better to add proper install targets at the CMake level,
# if they haven't been added yet.
postInstall = ''
mkdir -p $out/include
cp $src/include/llama.h $out/include/
'';
# Define the shells here, but don't add in the inputsFrom to avoid recursion.
passthru = {
inherit
useBlas
useCuda
useMetalKit
useMpi
useOpenCL
useRocm
useVulkan
;
meta = {
# Configurations we don't want even the CI to evaluate. Results in the
# "unsupported platform" messages. This is mostly a no-op, because
# cudaPackages would've refused to evaluate anyway.
badPlatforms = optionals useCuda lib.platforms.darwin;
shell = mkShell {
name = "shell-${finalAttrs.finalPackage.name}";
description = "contains numpy and sentencepiece";
buildInputs = [ llama-python ];
inputsFrom = [ finalAttrs.finalPackage ];
shellHook = ''
addToSearchPath "LD_LIBRARY_PATH" "${lib.getLib effectiveStdenv.cc.cc}/lib"
'';
};
# Configurations that are known to result in build failures. Can be
# overridden by importing Nixpkgs with `allowBroken = true`.
broken = (useMetalKit && !effectiveStdenv.isDarwin);
shell-extra = mkShell {
name = "shell-extra-${finalAttrs.finalPackage.name}";
description = "contains numpy, sentencepiece, torchWithoutCuda, and transformers";
buildInputs = [ llama-python-extra ];
inputsFrom = [ finalAttrs.finalPackage ];
};
};
description = "Inference of LLaMA model in pure C/C++${descriptionSuffix}";
homepage = "https://github.com/ggml-org/llama.cpp/";
license = lib.licenses.mit;
meta = {
# Configurations we don't want even the CI to evaluate. Results in the
# "unsupported platform" messages. This is mostly a no-op, because
# cudaPackages would've refused to evaluate anyway.
badPlatforms = optionals (useCuda || useOpenCL) lib.platforms.darwin;
# Accommodates `nix run` and `lib.getExe`
mainProgram = "llama-cli";
# Configurations that are known to result in build failures. Can be
# overridden by importing Nixpkgs with `allowBroken = true`.
broken = (useMetalKit && !effectiveStdenv.isDarwin);
# These people might respond, on the best effort basis, if you ping them
# in case of Nix-specific regressions or for reviewing Nix-specific PRs.
# Consider adding yourself to this list if you want to ensure this flake
# stays maintained and you're willing to invest your time. Do not add
# other people without their consent. Consider removing people after
# they've been unreachable for long periods of time.
description = "Inference of LLaMA model in pure C/C++${descriptionSuffix}";
homepage = "https://github.com/ggerganov/llama.cpp/";
license = lib.licenses.mit;
# Note that lib.maintainers is defined in Nixpkgs, but you may just add
# an attrset following the same format as in
# https://github.com/NixOS/nixpkgs/blob/f36a80e54da29775c78d7eff0e628c2b4e34d1d7/maintainers/maintainer-list.nix
maintainers = with lib.maintainers; [
philiptaron
SomeoneSerge
];
# Accommodates `nix run` and `lib.getExe`
mainProgram = "llama";
# These people might respond, on the best effort basis, if you ping them
# in case of Nix-specific regressions or for reviewing Nix-specific PRs.
# Consider adding yourself to this list if you want to ensure this flake
# stays maintained and you're willing to invest your time. Do not add
# other people without their consent. Consider removing people after
# they've been unreachable for long periods of time.
# Note that lib.maintainers is defined in Nixpkgs, but you may just add
# an attrset following the same format as in
# https://github.com/NixOS/nixpkgs/blob/f36a80e54da29775c78d7eff0e628c2b4e34d1d7/maintainers/maintainer-list.nix
maintainers = with lib.maintainers; [
philiptaron
SomeoneSerge
];
# Extend `badPlatforms` instead
platforms = lib.platforms.all;
};
}
)
# Extend `badPlatforms` instead
platforms = lib.platforms.all;
};
})
+66
View File
@@ -0,0 +1,66 @@
{
lib,
stdenv,
buildPythonPackage,
poetry-core,
mkShell,
python3Packages,
gguf-py,
}@inputs:
let
llama-python-deps = with python3Packages; [
numpy
sentencepiece
transformers
protobuf
torchWithoutCuda
gguf-py
tqdm
# for scripts/compare-llama-bench.py
gitpython
tabulate
# for examples/pydantic-models-to-grammar-examples.py
docstring-parser
pydantic
];
llama-python-test-deps = with python3Packages; [
# Server bench
matplotlib
# server tests
openai
pytest
prometheus-client
];
in
buildPythonPackage ({
pname = "llama-scripts";
version = "0.0.0";
pyproject = true;
# NOTE: The files filtered out here are not visible in the build sandbox, neither
# do they affect the output hash. They can be modified without triggering a rebuild.
src = lib.cleanSourceWith {
filter =
name: type:
let
any = builtins.any (x: x);
baseName = builtins.baseNameOf name;
in
any [
(lib.hasSuffix ".py" name)
(baseName == "README.md")
(baseName == "pyproject.toml")
];
src = lib.cleanSource ../../.;
};
nativeBuildInputs = [ poetry-core ];
nativeCheckInputs = llama-python-test-deps;
dependencies = llama-python-deps;
})
+31 -9
View File
@@ -1,19 +1,41 @@
{
lib,
newScope,
python3,
llamaVersion ? "0.0.0",
}:
let
pythonPackages = python3.pkgs;
buildPythonPackage = pythonPackages.buildPythonPackage;
numpy = pythonPackages.numpy;
tqdm = pythonPackages.tqdm;
sentencepiece = pythonPackages.sentencepiece;
pyyaml = pythonPackages.pyyaml;
poetry-core = pythonPackages.poetry-core;
pytestCheckHook = pythonPackages.pytestCheckHook;
in
# We're using `makeScope` instead of just writing out an attrset
# because it allows users to apply overlays later using `overrideScope'`.
# Cf. https://noogle.dev/f/lib/makeScope
lib.makeScope newScope (
self: {
inherit llamaVersion;
llama-cpp = self.callPackage ./package.nix { };
docker = self.callPackage ./docker.nix { };
docker-min = self.callPackage ./docker.nix { interactive = false; };
sif = self.callPackage ./sif.nix { };
}
)
lib.makeScope newScope (self: {
inherit llamaVersion;
gguf-py = self.callPackage ./package-gguf-py.nix {
inherit
buildPythonPackage
numpy
tqdm
sentencepiece
poetry-core
pyyaml
pytestCheckHook
;
};
python-scripts = self.callPackage ./python-scripts.nix { inherit buildPythonPackage poetry-core; };
llama-cpp = self.callPackage ./package.nix { };
docker = self.callPackage ./docker.nix { };
docker-min = self.callPackage ./docker.nix { interactive = false; };
sif = self.callPackage ./sif.nix { };
})
+113
View File
@@ -0,0 +1,113 @@
ARG UBUNTU_VERSION=24.04
# This needs to generally match the container host's environment.
ARG ROCM_VERSION=6.4
ARG AMDGPU_VERSION=6.4
# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete
### Build image
FROM ${BASE_ROCM_DEV_CONTAINER} AS build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggml-org/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
# gfx803, gfx900, gfx1032, gfx1101, gfx1102,not officialy supported
# gfx906 is deprecated
#check https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.2.4/reference/system-requirements.html
ARG ROCM_DOCKER_ARCH='gfx803,gfx900,gfx906,gfx908,gfx90a,gfx942,gfx1010,gfx1030,gfx1032,gfx1100,gfx1101,gfx1102'
#ARG ROCM_DOCKER_ARCH=gfx1100
# Set nvcc architectured
ENV AMDGPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
# ENV CC=/opt/rocm/llvm/bin/clang
# ENV CXX=/opt/rocm/llvm/bin/clang++
RUN apt-get update \
&& apt-get install -y \
build-essential \
cmake \
git \
libcurl4-openssl-dev \
curl \
libgomp1
WORKDIR /app
COPY . .
RUN HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$ROCM_DOCKER_ARCH -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=OFF \
&& cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib \
&& find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ${BASE_ROCM_DEV_CONTAINER} AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl\
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3-pip \
python3 \
python3-wheel\
&& pip install --break-system-packages --upgrade setuptools \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
-37
View File
@@ -1,37 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=11.7.1
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the CUDA runtime image
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
FROM ${BASE_CUDA_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
ARG CUDA_DOCKER_ARCH=all
RUN apt-get update && \
apt-get install -y build-essential git libcurl4-openssl-dev
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1
RUN make -j$(nproc) server
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1
COPY --from=build /app/server /server
ENTRYPOINT [ "/server" ]
-45
View File
@@ -1,45 +0,0 @@
ARG ONEAPI_VERSION=2024.0.1-devel-ubuntu22.04
FROM intel/oneapi-basekit:$ONEAPI_VERSION as build
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
rm /etc/apt/sources.list.d/intel-graphics.list && \
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg
ARG LLAMA_SYCL_F16=OFF
RUN apt-get update && \
apt-get install -y git libcurl4-openssl-dev
WORKDIR /app
COPY . .
RUN if [ "${LLAMA_SYCL_F16}" = "ON" ]; then \
echo "LLAMA_SYCL_F16 is set" && \
export OPT_SYCL_F16="-DLLAMA_SYCL_F16=ON"; \
fi && \
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON ${OPT_SYCL_F16} && \
cmake --build build --config Release --target server
FROM intel/oneapi-basekit:$ONEAPI_VERSION as runtime
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
rm /etc/apt/sources.list.d/intel-graphics.list && \
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
COPY --from=build /app/build/bin/server /server
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/server" ]
-50
View File
@@ -1,50 +0,0 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG ROCM_VERSION=5.6
# Target the CUDA build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete
FROM ${BASE_ROCM_DEV_CONTAINER} as build
# Unless otherwise specified, we make a fat build.
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
# This is mostly tied to rocBLAS supported archs.
ARG ROCM_DOCKER_ARCH=\
gfx803 \
gfx900 \
gfx906 \
gfx908 \
gfx90a \
gfx1010 \
gfx1030 \
gfx1100 \
gfx1101 \
gfx1102
COPY requirements.txt requirements.txt
COPY requirements requirements
RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt
WORKDIR /app
COPY . .
# Set nvcc architecture
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
# Enable ROCm
ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++
# Enable cURL
ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
RUN make -j$(nproc)
ENTRYPOINT [ "/app/server" ]
-31
View File
@@ -1,31 +0,0 @@
ARG UBUNTU_VERSION=jammy
FROM ubuntu:$UBUNTU_VERSION as build
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget
# Install Vulkan SDK
RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list && \
apt update -y && \
apt-get install -y vulkan-sdk
# Install cURL
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
# Build it
WORKDIR /app
COPY . .
RUN cmake -B build -DLLAMA_VULKAN=1 -DLLAMA_CURL=1 && \
cmake --build build --config Release --target server
# Clean up
WORKDIR /
RUN cp /app/build/bin/server /server && \
rm -rf /app
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/server" ]
-25
View File
@@ -1,25 +0,0 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential git libcurl4-openssl-dev
WORKDIR /app
COPY . .
ENV LLAMA_CURL=1
RUN make -j$(nproc) server
FROM ubuntu:$UBUNTU_VERSION as runtime
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1
COPY --from=build /app/server /server
ENV LC_ALL=C.utf8
ENTRYPOINT [ "/server" ]
+15 -11
View File
@@ -1,4 +1,4 @@
#!/bin/bash
#!/usr/bin/env bash
set -e
# Read the first argument into a variable
@@ -8,36 +8,40 @@ arg1="$1"
shift
if [[ "$arg1" == '--convert' || "$arg1" == '-c' ]]; then
python3 ./convert-hf-to-gguf.py "$@"
exec python3 ./convert_hf_to_gguf.py "$@"
elif [[ "$arg1" == '--quantize' || "$arg1" == '-q' ]]; then
./quantize "$@"
exec ./llama-quantize "$@"
elif [[ "$arg1" == '--run' || "$arg1" == '-r' ]]; then
./main "$@"
elif [[ "$arg1" == '--finetune' || "$arg1" == '-f' ]]; then
./finetune "$@"
exec ./llama-cli "$@"
elif [[ "$arg1" == '--bench' || "$arg1" == '-b' ]]; then
exec ./llama-bench "$@"
elif [[ "$arg1" == '--perplexity' || "$arg1" == '-p' ]]; then
exec ./llama-perplexity "$@"
elif [[ "$arg1" == '--all-in-one' || "$arg1" == '-a' ]]; then
echo "Converting PTH to GGML..."
for i in `ls $1/$2/ggml-model-f16.bin*`; do
for i in $(ls $1/$2/ggml-model-f16.bin*); do
if [ -f "${i/f16/q4_0}" ]; then
echo "Skip model quantization, it already exists: ${i/f16/q4_0}"
else
echo "Converting PTH to GGML: $i into ${i/f16/q4_0}..."
./quantize "$i" "${i/f16/q4_0}" q4_0
exec ./llama-quantize "$i" "${i/f16/q4_0}" q4_0
fi
done
elif [[ "$arg1" == '--server' || "$arg1" == '-s' ]]; then
./server "$@"
exec ./llama-server "$@"
else
echo "Unknown command: $arg1"
echo "Available commands: "
echo " --run (-r): Run a model previously converted into ggml"
echo " ex: -m /models/7B/ggml-model-q4_0.bin -p \"Building a website can be done in 10 simple steps:\" -n 512"
echo " --bench (-b): Benchmark the performance of the inference for various parameters."
echo " ex: -m model.gguf"
echo " --perplexity (-p): Measure the perplexity of a model over a given text."
echo " ex: -m model.gguf -f file.txt"
echo " --convert (-c): Convert a llama model into ggml"
echo " ex: --outtype f16 \"/models/7B/\" "
echo " --quantize (-q): Optimize with quantization process ggml"
echo " ex: \"/models/7B/ggml-model-f16.bin\" \"/models/7B/ggml-model-q4_0.bin\" 2"
echo " --finetune (-f): Run finetune command to create a lora finetune of the model"
echo " See documentation for finetune for command-line parameters"
echo " --all-in-one (-a): Execute --convert & --quantize"
echo " ex: \"/models/\" 7B"
echo " --server (-s): Run a model on the server"
+89
View File
@@ -0,0 +1,89 @@
ARG UBUNTU_VERSION=24.04
FROM ubuntu:$UBUNTU_VERSION AS build
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget
# Install Vulkan SDK and cURL
RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list && \
apt update -y && \
apt-get install -y vulkan-sdk libcurl4-openssl-dev curl
# Build it
WORKDIR /app
COPY . .
RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_VULKAN=1 -DLLAMA_BUILD_TESTS=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
RUN apt-get update \
&& apt-get install -y libgomp1 curl libvulkan-dev \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
python3-wheel \
&& pip install --break-system-packages --upgrade setuptools \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
+3 -3
View File
@@ -1,7 +1,7 @@
*.o
*.a
.cache/
.git/
# Do not ignore .git directory, otherwise the reported build number will always be 0
.github/
.gitignore
.vs/
@@ -12,8 +12,8 @@ build*/
models/*
/main
/quantize
/llama-cli
/llama-quantize
arm_neon.h
compile_commands.json
+1 -1
View File
@@ -1,5 +1,5 @@
{
"Exclude": ["^\\.gitmodules$"],
"Exclude": ["^\\.gitmodules$", "stb_image\\.h"],
"Disable": {
"IndentSize": true
}
+27 -1
View File
@@ -21,8 +21,34 @@ indent_style = tab
[prompts/*.txt]
insert_final_newline = unset
[examples/server/public/*]
[tools/server/public/*]
indent_size = 2
[tools/server/public/deps_*]
trim_trailing_whitespace = unset
indent_style = unset
indent_size = unset
[tools/server/deps_*]
trim_trailing_whitespace = unset
indent_style = unset
indent_size = unset
[examples/llama.swiftui/llama.swiftui.xcodeproj/*]
indent_style = tab
[tools/cvector-generator/*.txt]
trim_trailing_whitespace = unset
insert_final_newline = unset
[models/templates/*.jinja]
indent_style = unset
indent_size = unset
end_of_line = unset
charset = unset
trim_trailing_whitespace = unset
insert_final_newline = unset
[vendor/miniaudio/miniaudio.h]
trim_trailing_whitespace = unset
insert_final_newline = unset
+2 -1
View File
@@ -2,8 +2,9 @@
max-line-length = 125
ignore = E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704,W503
exclude =
# Do not traverse examples
# Do not traverse examples and tools
examples,
tools,
# Do not include package initializers
__init__.py,
# No need to traverse our git directory
-50
View File
@@ -1,50 +0,0 @@
name: Low Severity Bugs
description: Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
title: "Bug: "
labels: ["bug-unconfirmed", "low severity"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
Please include information about your system, the steps to reproduce the bug,
and the version of llama.cpp that you are using.
If possible, please provide a minimal code example that reproduces the bug.
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
placeholder: Tell us what you see!
validations:
required: true
- type: textarea
id: version
attributes:
label: Name and Version
description: Which executable and which version of our software are you running? (use `--version` to get a version string)
placeholder: |
$./main --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: What operating system are you seeing the problem on?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell
@@ -0,0 +1,87 @@
name: Bug (compilation)
description: Something goes wrong when trying to compile llama.cpp.
title: "Compile bug: "
labels: ["bug-unconfirmed", "compilation"]
body:
- type: markdown
attributes:
value: >
Thanks for taking the time to fill out this bug report!
This issue template is intended for bug reports where the compilation of llama.cpp fails.
Before opening an issue, please confirm that the compilation still fails with `-DGGML_CCACHE=OFF`.
If the compilation succeeds with ccache disabled you should be able to permanently fix the issue
by clearing `~/.cache/ccache` (on Linux).
- type: textarea
id: commit
attributes:
label: Git commit
description: Which commit are you trying to compile?
placeholder: |
$git rev-parse HEAD
84a07a17b1b08cf2b9747c633a2372782848a27f
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: Operating systems
description: Which operating systems do you know to be affected?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: true
- type: dropdown
id: backends
attributes:
label: GGML backends
description: Which GGML backends do you know to be affected?
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
multiple: true
validations:
required: true
- type: textarea
id: info
attributes:
label: Problem description & steps to reproduce
description: >
Please give us a summary of the problem and tell us how to reproduce it.
If you can narrow down the bug to specific compile flags, that information would be very much appreciated by us.
placeholder: >
I'm trying to compile llama.cpp with CUDA support on a fresh install of Ubuntu and get error XY.
Here are the exact commands that I used: ...
validations:
required: true
- type: textarea
id: first_bad_commit
attributes:
label: First Bad Commit
description: >
If the bug was not present on an earlier version: when did it start appearing?
If possible, please do a git bisect and identify the exact commit that introduced the bug.
validations:
required: false
- type: textarea
id: command
attributes:
label: Compile command
description: >
Please provide the exact command you used to compile llama.cpp. For example: `cmake -B ...`.
This will be automatically formatted into code, so no need for backticks.
render: shell
validations:
required: true
- type: textarea
id: logs
attributes:
label: Relevant log output
description: >
Please copy and paste any relevant log output, including any generated text.
This will be automatically formatted into code, so no need for backticks.
render: shell
validations:
required: true
+101
View File
@@ -0,0 +1,101 @@
name: Bug (model use)
description: Something goes wrong when using a model (in general, not specific to a single llama.cpp module).
title: "Eval bug: "
labels: ["bug-unconfirmed", "model evaluation"]
body:
- type: markdown
attributes:
value: >
Thanks for taking the time to fill out this bug report!
This issue template is intended for bug reports where the model evaluation results
(i.e. the generated text) are incorrect or llama.cpp crashes during model evaluation.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
The `llama-cli` binary can be used for simple and reproducible model inference.
- type: textarea
id: version
attributes:
label: Name and Version
description: Which version of our software are you running? (use `--version` to get a version string)
placeholder: |
$./llama-cli --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: Operating systems
description: Which operating systems do you know to be affected?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: true
- type: dropdown
id: backends
attributes:
label: GGML backends
description: Which GGML backends do you know to be affected?
options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
multiple: true
validations:
required: true
- type: textarea
id: hardware
attributes:
label: Hardware
description: Which CPUs/GPUs are you using?
placeholder: >
e.g. Ryzen 5950X + 2x RTX 4090
validations:
required: true
- type: textarea
id: model
attributes:
label: Models
description: >
Which model(s) at which quantization were you using when encountering the bug?
If you downloaded a GGUF file off of Huggingface, please provide a link.
placeholder: >
e.g. Meta LLaMA 3.1 Instruct 8b q4_K_M
validations:
required: false
- type: textarea
id: info
attributes:
label: Problem description & steps to reproduce
description: >
Please give us a summary of the problem and tell us how to reproduce it.
If you can narrow down the bug to specific hardware, compile flags, or command line arguments,
that information would be very much appreciated by us.
placeholder: >
e.g. when I run llama-cli with -ngl 99 I get garbled outputs.
When I use -ngl 0 it works correctly.
Here are the exact commands that I used: ...
validations:
required: true
- type: textarea
id: first_bad_commit
attributes:
label: First Bad Commit
description: >
If the bug was not present on an earlier version: when did it start appearing?
If possible, please do a git bisect and identify the exact commit that introduced the bug.
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: >
Please copy and paste any relevant log output, including the command that you entered and any generated text.
This will be automatically formatted into code, so no need for backticks.
render: shell
validations:
required: true
+91
View File
@@ -0,0 +1,91 @@
name: Bug (misc.)
description: Something is not working the way it should (and it's not covered by any of the above cases).
title: "Misc. bug: "
labels: ["bug-unconfirmed"]
body:
- type: markdown
attributes:
value: >
Thanks for taking the time to fill out this bug report!
This issue template is intended for miscellaneous bugs that don't fit into any other category.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
- type: textarea
id: version
attributes:
label: Name and Version
description: Which version of our software is affected? (You can use `--version` to get a version string.)
placeholder: |
$./llama-cli --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: Operating systems
description: Which operating systems do you know to be affected?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: false
- type: dropdown
id: module
attributes:
label: Which llama.cpp modules do you know to be affected?
multiple: true
options:
- Documentation/Github
- libllama (core library)
- llama-cli
- llama-server
- llama-bench
- llama-quantize
- Python/Bash scripts
- Test code
- Other (Please specify in the next section)
validations:
required: false
- type: textarea
id: command
attributes:
label: Command line
description: >
Please provide the exact commands you entered, if applicable. For example: `llama-server -m ... -c ...`, `llama-cli -m ...`, etc.
This will be automatically formatted into code, so no need for backticks.
render: shell
validations:
required: false
- type: textarea
id: info
attributes:
label: Problem description & steps to reproduce
description: >
Please give us a summary of the problem and tell us how to reproduce it (if applicable).
validations:
required: true
- type: textarea
id: first_bad_commit
attributes:
label: First Bad Commit
description: >
If the bug was not present on an earlier version and it's not trivial to track down: when did it start appearing?
If possible, please do a git bisect and identify the exact commit that introduced the bug.
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: >
If applicable, please copy and paste any relevant log output, including any generated text.
This will be automatically formatted into code, so no need for backticks.
render: shell
validations:
required: false
-50
View File
@@ -1,50 +0,0 @@
name: Medium Severity Bug
description: Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but generally still useable)
title: "Bug: "
labels: ["bug-unconfirmed", "medium severity"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
Please include information about your system, the steps to reproduce the bug,
and the version of llama.cpp that you are using.
If possible, please provide a minimal code example that reproduces the bug.
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
placeholder: Tell us what you see!
validations:
required: true
- type: textarea
id: version
attributes:
label: Name and Version
description: Which executable and which version of our software are you running? (use `--version` to get a version string)
placeholder: |
$./main --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: What operating system are you seeing the problem on?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell
@@ -1,12 +1,12 @@
name: Enhancement
description: Used to request enhancements for llama.cpp
description: Used to request enhancements for llama.cpp.
title: "Feature Request: "
labels: ["enhancement"]
body:
- type: markdown
attributes:
value: |
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggerganov/llama.cpp/discussions/categories/ideas)
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggml-org/llama.cpp/discussions/categories/ideas)
- type: checkboxes
id: prerequisites
@@ -16,11 +16,11 @@ body:
options:
- label: I am running the latest code. Mention the version if possible as well.
required: true
- label: I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- label: I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
required: true
- label: I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
required: true
- label: I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.
- label: I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.
required: true
- type: textarea
-50
View File
@@ -1,50 +0,0 @@
name: High Severity Bug
description: Used to report high severity bugs in llama.cpp (e.g. Malfunctioning features hindering important common workflow)
title: "Bug: "
labels: ["bug-unconfirmed", "high severity"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
Please include information about your system, the steps to reproduce the bug,
and the version of llama.cpp that you are using.
If possible, please provide a minimal code example that reproduces the bug.
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
placeholder: Tell us what you see!
validations:
required: true
- type: textarea
id: version
attributes:
label: Name and Version
description: Which executable and which version of our software are you running? (use `--version` to get a version string)
placeholder: |
$./main --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: What operating system are you seeing the problem on?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell
@@ -1,12 +1,12 @@
name: Research
description: Track new technical research area
description: Track new technical research area.
title: "Research: "
labels: ["research 🔬"]
body:
- type: markdown
attributes:
value: |
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
- type: checkboxes
id: research-stage
@@ -1,50 +0,0 @@
name: Critical Severity Bug
description: Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
title: "Bug: "
labels: ["bug-unconfirmed", "critical severity"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this bug report!
Please include information about your system, the steps to reproduce the bug,
and the version of llama.cpp that you are using.
If possible, please provide a minimal code example that reproduces the bug.
- type: textarea
id: what-happened
attributes:
label: What happened?
description: Also tell us, what did you expect to happen?
placeholder: Tell us what you see!
validations:
required: true
- type: textarea
id: version
attributes:
label: Name and Version
description: Which executable and which version of our software are you running? (use `--version` to get a version string)
placeholder: |
$./main --version
version: 2999 (42b4109e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
validations:
required: true
- type: dropdown
id: operating-system
attributes:
label: What operating system are you seeing the problem on?
multiple: true
options:
- Linux
- Mac
- Windows
- BSD
- Other? (Please let us know in description)
validations:
required: false
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell
@@ -1,13 +1,13 @@
name: Refactor (Maintainers)
description: Used to track refactoring opportunities
description: Used to track refactoring opportunities.
title: "Refactor: "
labels: ["refactor"]
body:
- type: markdown
attributes:
value: |
Don't forget to [check for existing refactor issue tickets](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggerganov/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
Don't forget to [check for existing refactor issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
- type: textarea
id: background-description
+3 -5
View File
@@ -1,13 +1,11 @@
blank_issues_enabled: true
contact_links:
- name: Got an idea?
url: https://github.com/ggerganov/llama.cpp/discussions/categories/ideas
url: https://github.com/ggml-org/llama.cpp/discussions/categories/ideas
about: Pop it there. It may then become an enhancement ticket.
- name: Got a question?
url: https://github.com/ggerganov/llama.cpp/discussions/categories/q-a
url: https://github.com/ggml-org/llama.cpp/discussions/categories/q-a
about: Ask a question there!
- name: Want to contribute?
url: https://github.com/ggerganov/llama.cpp/wiki/contribute
url: https://github.com/ggml-org/llama.cpp/wiki/contribute
about: Head to the contribution guide page of the wiki for areas you can help with
+22
View File
@@ -0,0 +1,22 @@
name: "Determine tag name"
description: "Determine the tag name to use for a release"
outputs:
name:
description: "The name of the tag"
value: ${{ steps.tag.outputs.name }}
runs:
using: "composite"
steps:
- name: Determine tag name
id: tag
shell: bash
run: |
BUILD_NUMBER="$(git rev-list --count HEAD)"
SHORT_HASH="$(git rev-parse --short=7 HEAD)"
if [[ "${{ env.BRANCH_NAME }}" == "master" ]]; then
echo "name=b${BUILD_NUMBER}" >> $GITHUB_OUTPUT
else
SAFE_NAME=$(echo "${{ env.BRANCH_NAME }}" | tr '/' '-')
echo "name=${SAFE_NAME}-b${BUILD_NUMBER}-${SHORT_HASH}" >> $GITHUB_OUTPUT
fi
@@ -0,0 +1,67 @@
name: "Windows - Setup CUDA Toolkit"
description: "Setup CUDA Toolkit for Windows"
inputs:
cuda_version:
description: "CUDA toolkit version"
required: true
runs:
using: "composite"
steps:
- name: Install Cuda Toolkit 11.7
if: ${{ inputs.cuda_version == '11.7' }}
shell: pwsh
run: |
mkdir -p "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7"
choco install unzip -y
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cudart/windows-x86_64/cuda_cudart-windows-x86_64-11.7.99-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/windows-x86_64/cuda_nvcc-windows-x86_64-11.7.99-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/windows-x86_64/cuda_nvrtc-windows-x86_64-11.7.99-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libcublas/windows-x86_64/libcublas-windows-x86_64-11.7.4.6-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvtx/windows-x86_64/cuda_nvtx-windows-x86_64-11.7.91-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/visual_studio_integration/windows-x86_64/visual_studio_integration-windows-x86_64-11.7.91-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvprof/windows-x86_64/cuda_nvprof-windows-x86_64-11.7.101-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cccl/windows-x86_64/cuda_cccl-windows-x86_64-11.7.91-archive.zip"
unzip '*.zip' -d "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7"
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_cudart-windows-x86_64-11.7.99-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_nvcc-windows-x86_64-11.7.99-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_nvrtc-windows-x86_64-11.7.99-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\libcublas-windows-x86_64-11.7.4.6-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_nvtx-windows-x86_64-11.7.91-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\visual_studio_integration-windows-x86_64-11.7.91-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_nvprof-windows-x86_64-11.7.101-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\cuda_cccl-windows-x86_64-11.7.91-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" /E /I /H /Y
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\libnvvp" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V11_7=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
- name: Install Cuda Toolkit 12.4
if: ${{ inputs.cuda_version == '12.4' }}
shell: pwsh
run: |
mkdir -p "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
choco install unzip -y
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cudart/windows-x86_64/cuda_cudart-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/windows-x86_64/cuda_nvcc-windows-x86_64-12.4.131-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/windows-x86_64/cuda_nvrtc-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libcublas/windows-x86_64/libcublas-windows-x86_64-12.4.5.8-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvtx/windows-x86_64/cuda_nvtx-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_profiler_api/windows-x86_64/cuda_profiler_api-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/visual_studio_integration/windows-x86_64/visual_studio_integration-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvprof/windows-x86_64/cuda_nvprof-windows-x86_64-12.4.127-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cccl/windows-x86_64/cuda_cccl-windows-x86_64-12.4.127-archive.zip"
unzip '*.zip' -d "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4"
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_cudart-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_nvcc-windows-x86_64-12.4.131-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_nvrtc-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\libcublas-windows-x86_64-12.4.5.8-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_nvtx-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_profiler_api-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\visual_studio_integration-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_nvprof-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\cuda_cccl-windows-x86_64-12.4.127-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" /E /I /H /Y
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\libnvvp" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V12_4=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
@@ -0,0 +1,30 @@
name: 'Windows - Setup CURL'
description: 'Composite action, to be reused in other workflow'
inputs:
curl_version:
description: 'CURL version'
required: false
default: '8.6.0_6'
architecture:
description: 'Architecture of the libcurl to download'
required: false
default: 'win64'
outputs:
curl_path:
description: "Path to the downloaded libcurl"
value: ${{ steps.get_libcurl.outputs.curl_path }}
runs:
using: "composite"
steps:
- name: libCURL
id: get_libcurl
shell: powershell
env:
CURL_VERSION: ${{ inputs.curl_version }}
ARCHITECTURE: ${{ inputs.architecture }}
run: |
curl.exe -o $env:RUNNER_TEMP/curl.zip -L "https://curl.se/windows/dl-${env:CURL_VERSION}/curl-${env:CURL_VERSION}-${env:ARCHITECTURE}-mingw.zip"
mkdir $env:RUNNER_TEMP/libcurl
tar.exe -xvf $env:RUNNER_TEMP/curl.zip --strip-components=1 -C $env:RUNNER_TEMP/libcurl
echo "curl_path=$env:RUNNER_TEMP/libcurl" >> $env:GITHUB_OUTPUT
+27 -23
View File
@@ -1,32 +1,27 @@
# https://github.com/actions/labeler
Kompute:
- changed-files:
- any-glob-to-any-file:
- ggml-kompute.h
- ggml-kompute.cpp
- README-kompute.md
Apple Metal:
- changed-files:
- any-glob-to-any-file:
- ggml-metal.h
- ggml-metal.cpp
- ggml/include/ggml-metal.h
- ggml/src/ggml-metal/**
- README-metal.md
SYCL:
- changed-files:
- any-glob-to-any-file:
- ggml-sycl.h
- ggml-sycl.cpp
- README-sycl.md
- ggml/include/ggml-sycl.h
- ggml/src/ggml-sycl/**
- docs/backend/SYCL.md
- examples/sycl/**
Nvidia GPU:
- changed-files:
- any-glob-to-any-file:
- ggml-cuda.h
- ggml-cuda/**
- ggml/include/ggml-cuda.h
- ggml/src/ggml-cuda/**
Vulkan:
- changed-files:
- any-glob-to-any-file:
- ggml_vk_generate_shaders.py
- ggml-vulkan*
- ggml/include/ggml-vulkan.h
- ggml/src/ggml-vulkan/**
documentation:
- changed-files:
- any-glob-to-any-file:
@@ -42,10 +37,11 @@ build:
- cmake/**
- CMakeLists.txt
- CMakePresets.json
- codecov.yml
examples:
- changed-files:
- any-glob-to-any-file: examples/**
- any-glob-to-any-file:
- examples/**
- tools/**
devops:
- changed-files:
- any-glob-to-any-file:
@@ -70,15 +66,11 @@ android:
server:
- changed-files:
- any-glob-to-any-file:
- examples/server/**
- tools/server/**
ggml:
- changed-files:
- any-glob-to-any-file:
- ggml.c
- ggml.h
- ggml-*.c
- ggml-*.h
- ggml-cuda/**
- ggml/**
nix:
- changed-files:
- any-glob-to-any-file:
@@ -88,3 +80,15 @@ nix:
embedding:
- changed-files:
- any-glob-to-any-file: examples/embedding/
Ascend NPU:
- changed-files:
- any-glob-to-any-file:
- ggml/include/ggml-cann.h
- ggml/src/ggml-cann/**
- docs/backend/CANN.md
OpenCL:
- changed-files:
- any-glob-to-any-file:
- ggml/include/ggml-opencl.h
- ggml/src/ggml-opencl/**
+1
View File
@@ -0,0 +1 @@
*Make sure to read the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md) before submitting a PR*
@@ -1,3 +1,6 @@
# TODO: there have been some issues with the workflow, so disabling for now
# https://github.com/ggml-org/llama.cpp/issues/7893
#
# Benchmark
name: Benchmark
@@ -24,10 +27,10 @@ on:
push:
branches:
- master
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.c', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.cpp', 'ggml-quants.c', '**/*.cu', 'tools/server/*.h*', 'tools/server/*.cpp']
pull_request_target:
types: [opened, synchronize, reopened]
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.c', 'ggml-quants.c', '**/*.cu', 'examples/server/*.h*', 'examples/server/*.cpp']
paths: ['llama.cpp', 'ggml.c', 'ggml-backend.cpp', 'ggml-quants.c', '**/*.cu', 'tools/server/*.h*', 'tools/server/*.cpp']
schedule:
- cron: '04 2 * * *'
@@ -54,17 +57,7 @@ jobs:
if: |
inputs.gpu-series == 'Standard_NC4as_T4_v3'
|| (
github.event_name == 'schedule'
&& github.ref_name == 'master'
&& github.repository_owner == 'ggerganov'
)
|| github.event_name == 'pull_request_target'
|| (
github.event_name == 'push'
&& github.event.ref == 'refs/heads/master'
&& github.repository_owner == 'ggerganov'
)
steps:
- name: Clone
id: checkout
@@ -76,7 +69,7 @@ jobs:
- name: Install python env
id: pipenv
run: |
cd examples/server/bench
cd tools/server/bench
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
@@ -86,7 +79,7 @@ jobs:
run: |
wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xzf prometheus*.tar.gz --strip-components=1
./prometheus --config.file=examples/server/bench/prometheus.yml &
./prometheus --config.file=tools/server/bench/prometheus.yml &
while ! nc -z localhost 9090; do
sleep 0.1
done
@@ -99,7 +92,7 @@ jobs:
- name: Install k6 and xk6-sse
id: k6_installation
run: |
cd examples/server/bench
cd tools/server/bench
go install go.k6.io/xk6/cmd/xk6@latest
xk6 build master \
--with github.com/phymbert/xk6-sse
@@ -109,9 +102,8 @@ jobs:
run: |
set -eux
cmake -B build \
-DLLAMA_NATIVE=OFF \
-DGGML_NATIVE=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_CURL=ON \
-DLLAMA_CUBLAS=ON \
-DCUDAToolkit_ROOT=/usr/local/cuda \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
@@ -119,25 +111,27 @@ jobs:
-DLLAMA_FATAL_WARNINGS=OFF \
-DLLAMA_ALL_WARNINGS=OFF \
-DCMAKE_BUILD_TYPE=Release;
cmake --build build --config Release -j $(nproc) --target server
cmake --build build --config Release -j $(nproc) --target llama-server
- name: Download the dataset
id: download_dataset
run: |
cd examples/server/bench
cd tools/server/bench
wget --quiet https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
- name: Server bench
id: server_bench
env:
HEAD_REF: ${{ github.head_ref || github.ref_name }}
run: |
set -eux
cd examples/server/bench
cd tools/server/bench
source venv/bin/activate
python bench.py \
--runner-label ${{ env.RUNNER_LABEL }} \
--name ${{ github.job }} \
--branch ${{ github.head_ref || github.ref_name }} \
--branch $HEAD_REF \
--commit ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha }} \
--scenario script.js \
--duration ${{ github.event.inputs.duration || env.DURATION }} \
@@ -163,9 +157,9 @@ jobs:
name: bench-server-${{ github.job }}-${{ env.RUNNER_LABEL }}-${{ matrix.model }}-${{ matrix.ftype }}
compression-level: 9
path: |
examples/server/bench/*.jpg
examples/server/bench/*.json
examples/server/bench/*.log
tools/server/bench/*.jpg
tools/server/bench/*.json
tools/server/bench/*.log
- name: Commit status
uses: Sibz/github-status-action@v1
@@ -184,17 +178,17 @@ jobs:
with:
client_id: ${{secrets.IMGUR_CLIENT_ID}}
path: |
examples/server/bench/prompt_tokens_seconds.jpg
examples/server/bench/predicted_tokens_seconds.jpg
examples/server/bench/kv_cache_usage_ratio.jpg
examples/server/bench/requests_processing.jpg
tools/server/bench/prompt_tokens_seconds.jpg
tools/server/bench/predicted_tokens_seconds.jpg
tools/server/bench/kv_cache_usage_ratio.jpg
tools/server/bench/requests_processing.jpg
- name: Extract mermaid
id: set_mermaid
run: |
set -eux
cd examples/server/bench
cd tools/server/bench
PROMPT_TOKENS_SECONDS=$(cat prompt_tokens_seconds.mermaid)
echo "PROMPT_TOKENS_SECONDS<<EOF" >> $GITHUB_ENV
echo "$PROMPT_TOKENS_SECONDS" >> $GITHUB_ENV
+51
View File
@@ -0,0 +1,51 @@
name: Build relocatable cmake package
on:
workflow_dispatch:
workflow_call:
jobs:
linux:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y build-essential tcl
- name: Build
run: |
PREFIX="$(pwd)"/inst
cmake -S . -B build -DCMAKE_PREFIX_PATH="$PREFIX" \
-DLLAMA_CURL=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
cmake --install build --prefix "$PREFIX" --config Release
export LLAMA_CONFIG="$PREFIX"/lib/cmake/llama/llama-config.cmake
tclsh <<'EOF'
set build(commit) [string trim [exec git rev-parse --short HEAD]]
set build(number) [string trim [exec git rev-list --count HEAD]]
set build(version) "0.0.$build(number)"
set llamaconfig [read [open "$env(LLAMA_CONFIG)" r]]
set checks [list "set\\(LLAMA_VERSION \\s+$build(version)\\)" \
"set\\(LLAMA_BUILD_COMMIT\\s+$build(commit)\\)" \
"set\\(LLAMA_BUILD_NUMBER\\s+$build(number)\\)"]
puts -nonewline "Checking llama-config.cmake version... "
foreach check $checks {
if {![regexp -expanded -- $check $llamaconfig]} {
puts "\"$check\" failed!"
exit 1
}
}
puts "success."
EOF
cd examples/simple-cmake-pkg
cmake -S . -B build -DCMAKE_PREFIX_PATH="$PREFIX"/lib/cmake
cmake --build build
+346
View File
@@ -0,0 +1,346 @@
name: Build on Linux using cross-compiler
on:
workflow_dispatch:
workflow_call:
jobs:
ubuntu-24-riscv64-cpu-cross:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- name: Setup Riscv
run: |
sudo dpkg --add-architecture riscv64
# Add arch-specific repositories for non-amd64 architectures
cat << EOF | sudo tee /etc/apt/sources.list.d/riscv64-ports.list
deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
EOF
sudo apt-get update || true ;# Prevent failure due to missing URLs.
sudo apt-get install -y --no-install-recommends \
build-essential \
gcc-14-riscv64-linux-gnu \
g++-14-riscv64-linux-gnu
- name: Build
run: |
cmake -B build -DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=riscv64 \
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake --build build --config Release -j $(nproc)
# ubuntu-24-riscv64-vulkan-cross:
# runs-on: ubuntu-24.04
# steps:
# - uses: actions/checkout@v4
# - name: Setup Riscv
# run: |
# sudo dpkg --add-architecture riscv64
# # Add arch-specific repositories for non-amd64 architectures
# cat << EOF | sudo tee /etc/apt/sources.list.d/riscv64-ports.list
# deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
# deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
# deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
# deb [arch=riscv64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
# EOF
# sudo apt-get update || true ;# Prevent failure due to missing URLs.
# sudo apt-get install -y --no-install-recommends \
# build-essential \
# glslc \
# gcc-14-riscv64-linux-gnu \
# g++-14-riscv64-linux-gnu \
# libvulkan-dev:riscv64
# - name: Build
# run: |
# cmake -B build -DLLAMA_CURL=OFF \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_VULKAN=ON \
# -DGGML_OPENMP=OFF \
# -DLLAMA_BUILD_EXAMPLES=ON \
# -DLLAMA_BUILD_TOOLS=ON \
# -DLLAMA_BUILD_TESTS=OFF \
# -DCMAKE_SYSTEM_NAME=Linux \
# -DCMAKE_SYSTEM_PROCESSOR=riscv64 \
# -DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
# -DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
# -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
# -DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
# -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
# -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
# -DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
# cmake --build build --config Release -j $(nproc)
# ubuntu-24-arm64-vulkan-cross:
# runs-on: ubuntu-24.04
# steps:
# - uses: actions/checkout@v4
# - name: Setup Arm64
# run: |
# sudo dpkg --add-architecture arm64
# # Add arch-specific repositories for non-amd64 architectures
# cat << EOF | sudo tee /etc/apt/sources.list.d/arm64-ports.list
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
# deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
# EOF
# sudo apt-get update || true ;# Prevent failure due to missing URLs.
# sudo apt-get install -y --no-install-recommends \
# build-essential \
# glslc \
# crossbuild-essential-arm64 \
# libvulkan-dev:arm64
# - name: Build
# run: |
# cmake -B build -DLLAMA_CURL=OFF \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_VULKAN=ON \
# -DGGML_OPENMP=OFF \
# -DLLAMA_BUILD_EXAMPLES=ON \
# -DLLAMA_BUILD_TOOLS=ON \
# -DLLAMA_BUILD_TESTS=OFF \
# -DCMAKE_SYSTEM_NAME=Linux \
# -DCMAKE_SYSTEM_PROCESSOR=aarch64 \
# -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \
# -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \
# -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
# -DCMAKE_FIND_ROOT_PATH=/usr/lib/aarch64-linux-gnu \
# -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
# -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
# -DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
# cmake --build build --config Release -j $(nproc)
ubuntu-24-ppc64el-cpu-cross:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- name: Setup PowerPC64le
run: |
sudo dpkg --add-architecture ppc64el
# Add arch-specific repositories for non-amd64 architectures
cat << EOF | sudo tee /etc/apt/sources.list.d/ppc64el-ports.list
deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
EOF
sudo apt-get update || true ;# Prevent failure due to missing URLs.
sudo apt-get install -y --no-install-recommends \
build-essential \
gcc-14-powerpc64le-linux-gnu \
g++-14-powerpc64le-linux-gnu
- name: Build
run: |
cmake -B build -DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=ppc64 \
-DCMAKE_C_COMPILER=powerpc64le-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=powerpc64le-linux-gnu-g++-14 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/powerpc64le-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake --build build --config Release -j $(nproc)
# ubuntu-24-ppc64el-vulkan-cross:
# runs-on: ubuntu-24.04
# steps:
# - uses: actions/checkout@v4
# - name: Setup PowerPC64le
# run: |
# sudo dpkg --add-architecture ppc64el
# # Add arch-specific repositories for non-amd64 architectures
# cat << EOF | sudo tee /etc/apt/sources.list.d/ppc64el-ports.list
# deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble main universe
# deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-updates main universe
# deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-security main universe
# deb [arch=ppc64el] http://ports.ubuntu.com/ubuntu-ports/ noble-backports main universe
# EOF
# sudo apt-get update || true ;# Prevent failure due to missing URLs.
# sudo apt-get install -y --no-install-recommends \
# build-essential \
# glslc \
# gcc-14-powerpc64le-linux-gnu \
# g++-14-powerpc64le-linux-gnu \
# libvulkan-dev:ppc64el
# - name: Build
# run: |
# cmake -B build -DLLAMA_CURL=OFF \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_VULKAN=ON \
# -DGGML_OPENMP=OFF \
# -DLLAMA_BUILD_EXAMPLES=ON \
# -DLLAMA_BUILD_TOOLS=ON \
# -DLLAMA_BUILD_TESTS=OFF \
# -DCMAKE_SYSTEM_NAME=Linux \
# -DCMAKE_SYSTEM_PROCESSOR=ppc64 \
# -DCMAKE_C_COMPILER=powerpc64le-linux-gnu-gcc-14 \
# -DCMAKE_CXX_COMPILER=powerpc64le-linux-gnu-g++-14 \
# -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
# -DCMAKE_FIND_ROOT_PATH=/usr/lib/powerpc64le-linux-gnu \
# -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
# -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
# -DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
# cmake --build build --config Release -j $(nproc)
debian-13-loongarch64-cpu-cross:
runs-on: ubuntu-24.04
container: debian@sha256:653dfb9f86c3782e8369d5f7d29bb8faba1f4bff9025db46e807fa4c22903671
steps:
- uses: actions/checkout@v4
- name: Setup LoongArch
run: |
rm -f /etc/apt/sources.list.d/*
cat << EOF | tee /etc/apt/sources.list.d/debian-ports.list
deb http://snapshot.debian.org/archive/debian/20250515T202920Z/ trixie main
EOF
( echo 'quiet "true";'; \
echo 'APT::Get::Assume-Yes "true";'; \
echo 'APT::Install-Recommends "false";'; \
echo 'Acquire::Check-Valid-Until "false";'; \
echo 'Acquire::Retries "5";'; \
) > /etc/apt/apt.conf.d/99snapshot-repos
apt-get update
apt-get install -y ca-certificates debian-ports-archive-keyring cmake git zip
dpkg --add-architecture loong64
# Add arch-specific repositories for non-amd64 architectures
cat << EOF | tee /etc/apt/sources.list.d/loong64-ports.list
deb [arch=loong64] http://snapshot.debian.org/archive/debian-ports/20250515T194251Z/ sid main
EOF
apt-get update || true ;# Prevent failure due to missing URLs.
apt-get install -y --no-install-recommends \
build-essential \
gcc-14-loongarch64-linux-gnu \
g++-14-loongarch64-linux-gnu
- name: Build
run: |
cmake -B build -DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=loongarch64 \
-DCMAKE_C_COMPILER=loongarch64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=loongarch64-linux-gnu-g++-14 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/loongarch64-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake --build build --config Release -j $(nproc)
debian-13-loongarch64-vulkan-cross:
runs-on: ubuntu-24.04
container: debian@sha256:653dfb9f86c3782e8369d5f7d29bb8faba1f4bff9025db46e807fa4c22903671
steps:
- uses: actions/checkout@v4
- name: Setup LoongArch
run: |
rm -f /etc/apt/sources.list.d/*
cat << EOF | tee /etc/apt/sources.list.d/debian-ports.list
deb http://snapshot.debian.org/archive/debian/20250515T202920Z/ trixie main
EOF
( echo 'quiet "true";'; \
echo 'APT::Get::Assume-Yes "true";'; \
echo 'APT::Install-Recommends "false";'; \
echo 'Acquire::Check-Valid-Until "false";'; \
echo 'Acquire::Retries "5";'; \
) > /etc/apt/apt.conf.d/99snapshot-repos
apt-get update
apt-get install -y ca-certificates debian-ports-archive-keyring cmake git zip
dpkg --add-architecture loong64
# Add arch-specific repositories for non-amd64 architectures
cat << EOF | tee /etc/apt/sources.list.d/loong64-ports.list
deb [arch=loong64] http://snapshot.debian.org/archive/debian-ports/20250515T194251Z/ sid main
EOF
apt-get update || true ;# Prevent failure due to missing URLs.
apt-get install -y --no-install-recommends \
build-essential \
glslc \
gcc-14-loongarch64-linux-gnu \
g++-14-loongarch64-linux-gnu \
libvulkan-dev:loong64
- name: Build
run: |
cmake -B build -DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_VULKAN=ON \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=loongarch64 \
-DCMAKE_C_COMPILER=loongarch64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=loongarch64-linux-gnu-g++-14 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/loongarch64-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake --build build --config Release -j $(nproc)
File diff suppressed because it is too large Load Diff
+6 -1
View File
@@ -3,6 +3,11 @@ on:
schedule:
- cron: "42 0 * * *"
# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
issues: write
jobs:
close-issues:
runs-on: ubuntu-latest
@@ -12,7 +17,7 @@ jobs:
steps:
- uses: actions/stale@v5
with:
exempt-issue-labels: "refactor,help wanted,good first issue,research,bug"
exempt-issue-labels: "refactoring,help wanted,good first issue,research,bug,roadmap"
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
-40
View File
@@ -1,40 +0,0 @@
name: Code Coverage
on: [push, pull_request]
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
jobs:
run:
runs-on: ubuntu-20.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install build-essential gcc-8 lcov
- name: Build
run: CC=gcc-8 make -j LLAMA_CODE_COVERAGE=1 tests
- name: Run tests
run: CC=gcc-8 make test
- name: Generate coverage report
run: |
make coverage
make lcov-report
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
files: lcov-report/coverage.info
+112 -51
View File
@@ -10,49 +10,55 @@
name: Publish Docker image
on:
pull_request:
push:
branches:
- master
workflow_dispatch: # allows manual triggering
schedule:
# Rebuild daily rather than on every push because it is expensive
- cron: '12 4 * * *'
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
packages: write
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
COMMIT_SHA: ${{ github.sha }}
strategy:
fail-fast: false
matrix:
config:
- { tag: "light", dockerfile: ".devops/main.Dockerfile", platforms: "linux/amd64,linux/arm64" }
- { tag: "full", dockerfile: ".devops/full.Dockerfile", platforms: "linux/amd64,linux/arm64" }
- { tag: "server", dockerfile: ".devops/server.Dockerfile", platforms: "linux/amd64,linux/arm64" }
# NOTE(canardletter): The CUDA builds on arm64 are very slow, so I
# have disabled them for now until the reason why
# is understood.
- { tag: "light-cuda", dockerfile: ".devops/main-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "full-cuda", dockerfile: ".devops/full-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "server-cuda", dockerfile: ".devops/server-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "light-rocm", dockerfile: ".devops/main-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
- { tag: "full-rocm", dockerfile: ".devops/full-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
- { tag: "server-rocm", dockerfile: ".devops/server-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
- { tag: "light-intel", dockerfile: ".devops/main-intel.Dockerfile", platforms: "linux/amd64" }
- { tag: "server-intel", dockerfile: ".devops/server-intel.Dockerfile", platforms: "linux/amd64" }
# Multi-stage build
# Note: the arm64 images are failing, which prevents the amd64 images from being built
# https://github.com/ggml-org/llama.cpp/issues/11888
#- { tag: "cpu", dockerfile: ".devops/cpu.Dockerfile", platforms: "linux/amd64,linux/arm64", full: true, light: true, server: true, free_disk_space: false }
- { tag: "cpu", dockerfile: ".devops/cpu.Dockerfile", platforms: "linux/amd64", full: true, light: true, server: true, free_disk_space: false }
- { tag: "cuda", dockerfile: ".devops/cuda.Dockerfile", platforms: "linux/amd64", full: true, light: true, server: true, free_disk_space: false }
- { tag: "musa", dockerfile: ".devops/musa.Dockerfile", platforms: "linux/amd64", full: true, light: true, server: true, free_disk_space: true }
- { tag: "intel", dockerfile: ".devops/intel.Dockerfile", platforms: "linux/amd64", full: true, light: true, server: true, free_disk_space: true }
- { tag: "vulkan", dockerfile: ".devops/vulkan.Dockerfile", platforms: "linux/amd64", full: true, light: true, server: true, free_disk_space: false }
# Note: the rocm images are failing due to a compiler error and are disabled until this is fixed to allow the workflow to complete
#- {tag: "rocm", dockerfile: ".devops/rocm.Dockerfile", platforms: "linux/amd64,linux/arm64", full: true, light: true, server: true, free_disk_space: true }
steps:
- name: Check out the repo
uses: actions/checkout@v4
with:
fetch-depth: 0 # preserve git history, so we can determine the build number
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
with:
image: tonistiigi/binfmt:qemu-v7.0.0-28
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v2
@@ -61,9 +67,45 @@ jobs:
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
# https://github.com/jlumbroso/free-disk-space/tree/54081f138730dfa15788a46383842cd2f914a1be#example
- name: Determine tag name
id: tag
shell: bash
run: |
BUILD_NUMBER="$(git rev-list --count HEAD)"
SHORT_HASH="$(git rev-parse --short=7 HEAD)"
REPO_OWNER="${GITHUB_REPOSITORY_OWNER@L}" # to lower case
REPO_NAME="${{ github.event.repository.name }}"
# determine tag name postfix (build number, commit hash)
if [[ "${{ env.GITHUB_BRANCH_NAME }}" == "master" ]]; then
TAG_POSTFIX="-b${BUILD_NUMBER}"
else
SAFE_NAME=$(echo "${{ env.GITHUB_BRANCH_NAME }}" | tr '/' '-')
TAG_POSTFIX="-${SAFE_NAME}-${SHORT_HASH}"
fi
# list all tags possible
if [[ "${{ matrix.config.tag }}" == "cpu" ]]; then
TYPE=""
else
TYPE="-${{ matrix.config.tag }}"
fi
PREFIX="ghcr.io/${REPO_OWNER}/${REPO_NAME}:"
FULLTAGS="${PREFIX}full${TYPE},${PREFIX}full${TYPE}${TAG_POSTFIX}"
LIGHTTAGS="${PREFIX}light${TYPE},${PREFIX}light${TYPE}${TAG_POSTFIX}"
SERVERTAGS="${PREFIX}server${TYPE},${PREFIX}server${TYPE}${TAG_POSTFIX}"
echo "full_output_tags=$FULLTAGS" >> $GITHUB_OUTPUT
echo "light_output_tags=$LIGHTTAGS" >> $GITHUB_OUTPUT
echo "server_output_tags=$SERVERTAGS" >> $GITHUB_OUTPUT
echo "full_output_tags=$FULLTAGS" # print out for debugging
echo "light_output_tags=$LIGHTTAGS" # print out for debugging
echo "server_output_tags=$SERVERTAGS" # print out for debugging
env:
GITHUB_BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/free-disk-space@main
if: ${{ matrix.config.free_disk_space == true }}
uses: ggml-org/free-disk-space@v1.3.1
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
@@ -78,40 +120,59 @@ jobs:
docker-images: true
swap-storage: true
- name: Determine tag name
id: tag
shell: bash
run: |
BUILD_NUMBER="$(git rev-list --count HEAD)"
SHORT_HASH="$(git rev-parse --short=7 HEAD)"
if [[ "${{ env.BRANCH_NAME }}" == "master" ]]; then
echo "name=b${BUILD_NUMBER}" >> $GITHUB_OUTPUT
else
SAFE_NAME=$(echo "${{ env.BRANCH_NAME }}" | tr '/' '-')
echo "name=${SAFE_NAME}-b${BUILD_NUMBER}-${SHORT_HASH}" >> $GITHUB_OUTPUT
fi
- name: Downcase github.repository_owner
run: |
echo "repository_owner_lowercase=${GITHUB_REPOSITORY_OWNER@L}" >> $GITHUB_ENV
env:
GITHUB_REPOSITORY_OWNER: '${{ github.repository_owner }}'
- name: Build and push Docker image (versioned)
if: github.event_name == 'push'
uses: docker/build-push-action@v4
- name: Build and push Full Docker image (tagged + versioned)
if: ${{ (github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && matrix.config.full == true }}
uses: docker/build-push-action@v6
with:
context: .
push: true
platforms: ${{ matrix.config.platforms }}
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
# tag list is generated from step above
tags: ${{ steps.tag.outputs.full_output_tags }}
file: ${{ matrix.config.dockerfile }}
target: full
provenance: false
# using github experimental cache
cache-from: type=gha
cache-to: type=gha,mode=max
# return to this if the experimental github cache is having issues
#cache-to: type=local,dest=/tmp/.buildx-cache
#cache-from: type=local,src=/tmp/.buildx-cache
- name: Build and push Docker image (tagged)
uses: docker/build-push-action@v4
- name: Build and push Light Docker image (tagged + versioned)
if: ${{ (github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && matrix.config.light == true }}
uses: docker/build-push-action@v6
with:
context: .
push: ${{ github.event_name == 'push' }}
push: true
platforms: ${{ matrix.config.platforms }}
tags: "ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }},ghcr.io/${{ env.repository_owner_lowercase }}/llama.cpp:${{ matrix.config.tag }}-${{ steps.tag.outputs.name }}"
# tag list is generated from step above
tags: ${{ steps.tag.outputs.light_output_tags }}
file: ${{ matrix.config.dockerfile }}
target: light
provenance: false
# using github experimental cache
cache-from: type=gha
cache-to: type=gha,mode=max
# return to this if the experimental github cache is having issues
#cache-to: type=local,dest=/tmp/.buildx-cache
#cache-from: type=local,src=/tmp/.buildx-cache
- name: Build and push Server Docker image (tagged + versioned)
if: ${{ (github.event_name == 'push' || github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && matrix.config.server == true }}
uses: docker/build-push-action@v6
with:
context: .
push: true
platforms: ${{ matrix.config.platforms }}
# tag list is generated from step above
tags: ${{ steps.tag.outputs.server_output_tags }}
file: ${{ matrix.config.dockerfile }}
target: server
provenance: false
# using github experimental cache
cache-from: type=gha
cache-to: type=gha,mode=max
# return to this if the experimental github cache is having issues
#cache-to: type=local,dest=/tmp/.buildx-cache
#cache-from: type=local,src=/tmp/.buildx-cache
+3 -1
View File
@@ -23,5 +23,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: editorconfig-checker/action-editorconfig-checker@main
- uses: editorconfig-checker/action-editorconfig-checker@v2
with:
version: v3.0.3
- run: editorconfig-checker
+1 -1
View File
@@ -11,7 +11,7 @@ jobs:
steps:
- uses: actions/checkout@v4
with:
repository: "ggerganov/llama.cpp"
repository: "ggml-org/llama.cpp"
- uses: actions/labeler@v5
with:
configuration-path: '.github/labeler.yml'
-65
View File
@@ -1,65 +0,0 @@
name: Nix aarch64 builds
on:
workflow_dispatch: # allows manual triggering
schedule:
# Rebuild daily rather than on every push because QEMU is expensive (e.g.
# 1.5h instead of minutes with the cold cache).
#
# randint(0, 59), randint(0, 23)
- cron: '26 12 * * *'
# But also rebuild if we touched any of the Nix expressions:
push:
branches:
- master
paths: ['**/*.nix', 'flake.lock']
pull_request:
types: [opened, synchronize, reopened]
paths: ['**/*.nix', 'flake.lock']
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
jobs:
nix-build-aarch64:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install QEMU
# Copy-paste from https://github.com/orgs/community/discussions/8305#discussioncomment-5888654
run: |
sudo apt-get update
sudo apt-get install -y qemu-user-static qemu-system-aarch64
sudo usermod -a -G kvm $USER
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v9
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
extra-conf: |
extra-platforms = aarch64-linux
extra-system-features = nixos-test kvm
extra-substituters = https://llama-cpp.cachix.org https://cuda-maintainers.cachix.org
extra-trusted-public-keys = llama-cpp.cachix.org-1:H75X+w83wUKTIPSO1KWy9ADUrzThyGs8P5tmAbkWhQc= cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E=
- uses: DeterminateSystems/magic-nix-cache-action@v2
with:
upstream-cache: https://${{ matrix.cachixName }}.cachix.org
- name: Set-up cachix to push the results to
uses: cachix/cachix-action@v13
with:
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
name: llama-cpp
- name: Show all output paths
run: >
nix run github:nix-community/nix-eval-jobs
-- --gc-roots-dir gcroot
--flake
".#packages.aarch64-linux"
- name: Build
run: >
nix run github:Mic92/nix-fast-build
-- --skip-cached --no-nom
--systems aarch64-linux
--flake
".#checks.aarch64-linux"
-72
View File
@@ -1,72 +0,0 @@
name: Nix CI
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
pull_request:
types: [opened, synchronize, reopened]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
jobs:
nix-eval:
strategy:
fail-fast: false
matrix:
os: [ ubuntu-latest, macos-latest ]
runs-on: ${{ matrix.os }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v9
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
extra-conf: |
extra-substituters = https://llama-cpp.cachix.org https://cuda-maintainers.cachix.org
extra-trusted-public-keys = llama-cpp.cachix.org-1:H75X+w83wUKTIPSO1KWy9ADUrzThyGs8P5tmAbkWhQc= cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E=
- uses: DeterminateSystems/magic-nix-cache-action@v2
with:
upstream-cache: https://${{ matrix.cachixName }}.cachix.org
- name: List all flake outputs
run: nix flake show --all-systems
- name: Show all output paths
run: >
nix run github:nix-community/nix-eval-jobs
-- --gc-roots-dir gcroot
--flake
".#packages.$(nix eval --raw --impure --expr builtins.currentSystem)"
nix-build:
strategy:
fail-fast: false
matrix:
os: [ ubuntu-latest, macos-latest ]
runs-on: ${{ matrix.os }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@v9
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
extra-conf: |
extra-substituters = https://llama-cpp.cachix.org https://cuda-maintainers.cachix.org
extra-trusted-public-keys = llama-cpp.cachix.org-1:H75X+w83wUKTIPSO1KWy9ADUrzThyGs8P5tmAbkWhQc= cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E=
- uses: DeterminateSystems/magic-nix-cache-action@v2
with:
upstream-cache: https://${{ matrix.cachixName }}.cachix.org
- name: Set-up cachix to push the results to
uses: cachix/cachix-action@v13
with:
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
name: llama-cpp
- name: Build
run: >
nix run github:Mic92/nix-fast-build
-- --skip-cached --no-nom
--flake
".#checks.$(nix eval --raw --impure --expr builtins.currentSystem)"
-22
View File
@@ -1,22 +0,0 @@
name: update-flake-lock
on:
workflow_dispatch:
schedule:
- cron: '0 0 * * 0' # runs weekly on Sunday at 00:00
jobs:
lockfile:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Nix
uses: DeterminateSystems/nix-installer-action@main
- name: Update flake.lock
uses: DeterminateSystems/update-flake-lock@main
with:
pr-title: "nix: update flake.lock"
pr-labels: |
nix
pr-reviewers: philiptaron,SomeoneSerge
token: ${{ secrets.FLAKE_TOKEN }}
-36
View File
@@ -1,36 +0,0 @@
# Make the flake discoverable on https://flakestry.dev and https://flakehub.com/flakes
name: "Publish a flake to flakestry & flakehub"
on:
push:
tags:
- "*"
workflow_dispatch:
inputs:
tag:
description: "The existing tag to publish"
type: "string"
required: true
jobs:
flakestry-publish:
runs-on: ubuntu-latest
permissions:
id-token: "write"
contents: "read"
steps:
- uses: flakestry/flakestry-publish@main
with:
version: "${{ inputs.tag || github.ref_name }}"
flakehub-publish:
runs-on: "ubuntu-latest"
permissions:
id-token: "write"
contents: "read"
steps:
- uses: "actions/checkout@v4"
with:
ref: "${{ (inputs.tag != null) && format('refs/tags/{0}', inputs.tag) || '' }}"
- uses: "DeterminateSystems/nix-installer-action@main"
- uses: "DeterminateSystems/flakehub-push@main"
with:
visibility: "public"
tag: "${{ inputs.tag }}"
@@ -6,15 +6,13 @@ on:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-check-requirements.yml'
- 'scripts/check-requirements.sh'
- 'convert*.py'
- 'requirements.txt'
- 'requirements/*.txt'
- '**/requirements*.txt'
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
+8 -1
View File
@@ -1,6 +1,13 @@
name: flake8 Lint
on: [push, pull_request]
on:
push:
branches:
- master
paths: ['.github/workflows/python-lint.yml', '**/*.py']
pull_request:
types: [opened, synchronize, reopened]
paths: ['.github/workflows/python-lint.yml', '**/*.py']
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
+40
View File
@@ -0,0 +1,40 @@
name: Python Type-Check
on:
push:
paths:
- '.github/workflows/python-type-check.yml'
- 'pyrightconfig.json'
- '**.py'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-type-check.yml'
- 'pyrightconfig.json'
- '**.py'
- '**/requirements*.txt'
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
jobs:
python-type-check:
runs-on: ubuntu-latest
name: pyright type-check
steps:
- name: Check out source repository
uses: actions/checkout@v4
- name: Set up Python environment
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies
# TODO: use a venv
run: pip install -r requirements/requirements-all.txt
- name: Type-check with Pyright
uses: jakebailey/pyright-action@v2
with:
version: 1.1.382
level: warning
warnings: true
+755
View File
@@ -0,0 +1,755 @@
name: Release
on:
workflow_dispatch: # allows manual triggering
inputs:
create_release:
description: 'Create new release'
required: true
type: boolean
push:
branches:
- master
paths: ['.github/workflows/release.yml', '**/CMakeLists.txt', '**/.cmake', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.cuh', '**/*.swift', '**/*.m', '**/*.metal', '**/*.comp']
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON"
jobs:
macOS-arm64:
runs-on: macos-14
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: macOS-latest-cmake-arm64
evict-old-files: 1d
- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
brew install curl
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build \
-DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DGGML_RPC=ON \
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -r llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip ./build/bin/*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.zip
name: llama-bin-macos-arm64.zip
macOS-x64:
runs-on: macos-13
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: macOS-latest-cmake-x64
evict-old-files: 1d
- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update
brew install curl
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_INSTALL_RPATH='@loader_path' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -r llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip ./build/bin/*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-macos-x64.zip
name: llama-bin-macos-x64.zip
ubuntu-22-cpu:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-22.04
# GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS are not currently supported on arm
# - build: 'arm64'
# os: ubuntu-22.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: ubuntu-cpu-cmake
evict-old-files: 1d
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential libcurl4-openssl-dev
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_INSTALL_RPATH='$ORIGIN' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \
-DLLAMA_FATAL_WARNINGS=ON \
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip ./build/bin/*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip
name: llama-bin-ubuntu-${{ matrix.build }}.zip
ubuntu-22-vulkan:
runs-on: ubuntu-22.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: ubuntu-22-cmake-vulkan
evict-old-files: 1d
- name: Dependencies
id: depends
run: |
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt-get update -y
sudo apt-get install -y build-essential mesa-vulkan-drivers vulkan-sdk libcurl4-openssl-dev
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_INSTALL_RPATH='$ORIGIN' \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_VULKAN=ON \
${{ env.CMAKE_ARGS }}
cmake --build build --config Release -j $(nproc)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip ./build/bin/*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-x64.zip
name: llama-bin-ubuntu-vulkan-x64.zip
windows-cpu:
runs-on: windows-2025
strategy:
matrix:
include:
- arch: 'x64'
- arch: 'arm64'
steps:
- name: Clone
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: windows-latest-cmake-cpu-${{ matrix.arch }}
variant: ccache
evict-old-files: 1d
- name: Install Ninja
run: |
choco install ninja
- name: libCURL
id: get_libcurl
uses: ./.github/actions/windows-setup-curl
with:
architecture: ${{ matrix.arch == 'x64' && 'win64' || 'win64a' }}
- name: Build
shell: cmd
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" ${{ matrix.arch == 'x64' && 'x64' || 'amd64_arm64' }}
cmake -S . -B build -G "Ninja Multi-Config" ^
-D CMAKE_TOOLCHAIN_FILE=cmake/${{ matrix.arch }}-windows-llvm.cmake ^
-DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=${{ matrix.arch == 'x64' && 'ON' || 'OFF' }} ^
-DGGML_OPENMP=ON ^
-DCURL_LIBRARY="%CURL_PATH%/lib/libcurl.dll.a" -DCURL_INCLUDE_DIR="%CURL_PATH%/include" ^
${{ env.CMAKE_ARGS }}
cmake --build build --config Release
- name: Pack artifacts
id: pack_artifacts
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: |
Copy-Item $env:CURL_PATH\bin\libcurl-${{ matrix.arch }}.dll .\build\bin\Release\
Copy-Item "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Redist\MSVC\14.44.35112\debug_nonredist\${{ matrix.arch }}\Microsoft.VC143.OpenMP.LLVM\libomp140.${{ matrix.arch == 'x64' && 'x86_64' || 'aarch64' }}.dll" .\build\bin\Release\
7z a llama-bin-win-cpu-${{ matrix.arch }}.zip .\build\bin\Release\*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-bin-win-cpu-${{ matrix.arch }}.zip
name: llama-bin-win-cpu-${{ matrix.arch }}.zip
windows:
runs-on: windows-2025
env:
OPENBLAS_VERSION: 0.3.23
VULKAN_VERSION: 1.4.313.2
strategy:
matrix:
include:
- backend: 'vulkan'
arch: 'x64'
defines: '-DGGML_VULKAN=ON'
target: 'ggml-vulkan'
- backend: 'opencl-adreno'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON'
target: 'ggml-opencl'
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: windows-latest-cmake-${{ matrix.backend }}-${{ matrix.arch }}
variant: ccache
evict-old-files: 1d
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.backend == 'vulkan' }}
run: |
curl.exe -o $env:RUNNER_TEMP/VulkanSDK-Installer.exe -L "https://sdk.lunarg.com/sdk/download/${env:VULKAN_VERSION}/windows/vulkansdk-windows-X64-${env:VULKAN_VERSION}.exe"
& "$env:RUNNER_TEMP\VulkanSDK-Installer.exe" --accept-licenses --default-answer --confirm-command install
Add-Content $env:GITHUB_ENV "VULKAN_SDK=C:\VulkanSDK\${env:VULKAN_VERSION}"
Add-Content $env:GITHUB_PATH "C:\VulkanSDK\${env:VULKAN_VERSION}\bin"
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Install OpenCL Headers and Libs
id: install_opencl
if: ${{ matrix.backend == 'opencl-adreno' && matrix.arch == 'arm64' }}
run: |
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
cmake -B build `
-DBUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build --target install
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
cmake -B build-arm64-release `
-A arm64 `
-DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build-arm64-release --target install --config release
- name: Build
id: cmake_build
run: |
cmake -S . -B build ${{ matrix.defines }} -DGGML_NATIVE=OFF -DGGML_CPU=OFF -DGGML_BACKEND_DL=ON -DLLAMA_CURL=OFF
cmake --build build --config Release --target ${{ matrix.target }}
- name: Pack artifacts
id: pack_artifacts
run: |
7z a llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip .\build\bin\Release\${{ matrix.target }}.dll
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
name: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
windows-cuda:
runs-on: windows-2022
strategy:
matrix:
cuda: ['12.4']
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: Install ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
evict-old-files: 1d
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
cuda_version: ${{ matrix.cuda }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
shell: cmd
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
cmake -S . -B build -G "Ninja Multi-Config" ^
-DGGML_BACKEND_DL=ON ^
-DGGML_NATIVE=OFF ^
-DGGML_CPU=OFF ^
-DGGML_CUDA=ON ^
-DLLAMA_CURL=OFF
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% --target ggml-cuda
- name: Pack artifacts
id: pack_artifacts
run: |
7z a llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip .\build\bin\Release\ggml-cuda.dll
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
name: llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
- name: Copy and pack Cuda runtime
run: |
echo "Cuda install location: ${{ env.CUDA_PATH }}"
$dst='.\build\bin\cudart\'
robocopy "${{env.CUDA_PATH}}\bin" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
robocopy "${{env.CUDA_PATH}}\lib" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
7z a cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip $dst\*
- name: Upload Cuda runtime
uses: actions/upload-artifact@v4
with:
path: cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
name: cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
windows-sycl:
runs-on: windows-2022
defaults:
run:
shell: bash
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/7cd9bba0-7aab-4e30-b3ae-2221006a4a05/intel-oneapi-base-toolkit-2025.1.1.34_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: windows-latest-cmake-sycl
variant: ccache
evict-old-files: 1d
- name: Install
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Build
id: cmake_build
shell: cmd
run: |
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
cmake -G "Ninja" -B build ^
-DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx ^
-DCMAKE_BUILD_TYPE=Release ^
-DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=ON ^
-DGGML_CPU=OFF -DGGML_SYCL=ON ^
-DLLAMA_CURL=OFF
cmake --build build --target ggml-sycl -j
- name: Build the release package
id: pack_artifacts
run: |
echo "cp oneAPI running time dll files in ${{ env.ONEAPI_ROOT }} to ./build/bin"
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_sycl_blas.5.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_core.2.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_tbb_thread.2.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_level_zero.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_opencl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_loader.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_win_proxy_loader.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl8.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/svml_dispmd.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libmmd.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libiomp5md.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/dnnl/latest/bin/dnnl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/tbb/latest/bin/tbb12.dll" ./build/bin
echo "cp oneAPI running time dll files to ./build/bin done"
7z a llama-bin-win-sycl-x64.zip ./build/bin/*
- name: Upload the release package
uses: actions/upload-artifact@v4
with:
path: llama-bin-win-sycl-x64.zip
name: llama-bin-win-sycl-x64.zip
windows-hip:
runs-on: windows-2022
strategy:
matrix:
include:
- name: "radeon"
gpu_targets: "gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
- name: Clone rocWMMA repository
id: clone_rocwmma
run: |
git clone https://github.com/rocm/rocwmma --branch rocm-6.2.4 --depth 1
- name: ccache
uses: hendrikmuhs/ccache-action@v1.2.16
with:
key: windows-latest-cmake-hip-${{ matrix.name }}-x64
evict-old-files: 1d
- name: Install
id: depends
run: |
$ErrorActionPreference = "Stop"
write-host "Downloading AMD HIP SDK Installer"
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
write-host "Installing AMD HIP SDK"
Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -Wait
write-host "Completed AMD HIP SDK installation"
- name: Verify ROCm
id: verify
run: |
& 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' --version
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/rocwmma/library/include/ -Wno-ignored-attributes -Wno-nested-anon-types" `
-DCMAKE_BUILD_TYPE=Release `
-DGGML_BACKEND_DL=ON `
-DGGML_NATIVE=OFF `
-DGGML_CPU=OFF `
-DAMDGPU_TARGETS="${{ matrix.gpu_targets }}" `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGGML_HIP=ON `
-DLLAMA_CURL=OFF
cmake --build build --target ggml-hip -j ${env:NUMBER_OF_PROCESSORS}
md "build\bin\rocblas\library\"
cp "${env:HIP_PATH}\bin\hipblas.dll" "build\bin\"
cp "${env:HIP_PATH}\bin\rocblas.dll" "build\bin\"
cp "${env:HIP_PATH}\bin\rocblas\library\*" "build\bin\rocblas\library\"
- name: Pack artifacts
id: pack_artifacts
run: |
7z a llama-bin-win-hip-${{ matrix.name }}-x64.zip .\build\bin\*
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-bin-win-hip-${{ matrix.name }}-x64.zip
name: llama-bin-win-hip-${{ matrix.name }}-x64.zip
ios-xcode-build:
runs-on: macos-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build -G Xcode \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DLLAMA_CURL=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_TOOLS=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
- name: xcodebuild for swift package
id: xcodebuild
run: |
./build-xcframework.sh
- name: Build Xcode project
run: xcodebuild -project examples/llama.swiftui/llama.swiftui.xcodeproj -scheme llama.swiftui -sdk iphoneos CODE_SIGNING_REQUIRED=NO CODE_SIGN_IDENTITY= -destination 'generic/platform=iOS' FRAMEWORK_FOLDER_PATH=./build-ios build
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
zip --symlinks -r llama-${{ steps.tag.outputs.name }}-xcframework.zip build-apple/llama.xcframework
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-xcframework.zip
name: llama-${{ steps.tag.outputs.name }}-xcframework
release:
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
# Fine-grant permission
# https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication#modifying-the-permissions-for-the-github_token
permissions:
contents: write # for creating release
runs-on: ubuntu-latest
needs:
- windows
- windows-cpu
- windows-cuda
- windows-sycl
- windows-hip
- ubuntu-22-cpu
- ubuntu-22-vulkan
- macOS-arm64
- macOS-x64
- ios-xcode-build
steps:
- name: Clone
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Download artifacts
id: download-artifact
uses: actions/download-artifact@v4
with:
path: ./artifact
merge-multiple: true
- name: Move artifacts
id: move_artifacts
run: |
mkdir -p release
echo "Adding CPU backend files to existing zips..."
for arch in x64 arm64; do
cpu_zip="artifact/llama-bin-win-cpu-${arch}.zip"
temp_dir=$(mktemp -d)
echo "Extracting CPU backend for $arch..."
unzip "$cpu_zip" -d "$temp_dir"
echo "Adding CPU files to $arch zips..."
for target_zip in artifact/llama-bin-win-*-${arch}.zip; do
if [[ "$target_zip" == "$cpu_zip" ]]; then
continue
fi
echo "Adding CPU backend to $(basename "$target_zip")"
realpath_target_zip=$(realpath "$target_zip")
(cd "$temp_dir" && zip -r "$realpath_target_zip" .)
done
rm -rf "$temp_dir"
done
echo "Renaming and moving zips to release..."
for zip_file in artifact/llama-bin-win-*.zip; do
base_name=$(basename "$zip_file" .zip)
zip_name="llama-${{ steps.tag.outputs.name }}-${base_name#llama-}.zip"
echo "Moving $zip_file to release/$zip_name"
mv "$zip_file" "release/$zip_name"
done
echo "Moving other artifacts..."
mv -v artifact/*.zip release
- name: Create release
id: create_release
uses: ggml-org/action-create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ steps.tag.outputs.name }}
- name: Upload release
id: upload_release
uses: actions/github-script@v3
with:
github-token: ${{secrets.GITHUB_TOKEN}}
script: |
const path = require('path');
const fs = require('fs');
const release_id = '${{ steps.create_release.outputs.id }}';
for (let file of await fs.readdirSync('./release')) {
if (path.extname(file) === '.zip') {
console.log('uploadReleaseAsset', file);
await github.repos.uploadReleaseAsset({
owner: context.repo.owner,
repo: context.repo.repo,
release_id: release_id,
name: file,
data: await fs.readFileSync(`./release/${file}`)
});
}
}
+107 -41
View File
@@ -15,12 +15,16 @@ on:
push:
branches:
- master
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
pull_request_target:
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'tools/server/**.*']
pull_request:
types: [opened, synchronize, reopened]
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
schedule:
- cron: '2 4 * * *'
paths: ['.github/workflows/server.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'tools/server/**.*']
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@@ -32,7 +36,7 @@ jobs:
strategy:
matrix:
sanitizer: [ADDRESS, THREAD, UNDEFINED]
sanitizer: [ADDRESS, UNDEFINED] # THREAD is broken
build_type: [RelWithDebInfo]
include:
- build_type: Release
@@ -70,52 +74,113 @@ jobs:
- name: Tests dependencies
id: test_dependencies
run: |
pip install -r examples/server/tests/requirements.txt
pip install -r tools/server/tests/requirements.txt
- name: Verify server deps
id: verify_server_deps
# Setup nodejs (to be used for verifying bundled index.html)
- uses: actions/setup-node@v4
with:
node-version: '22.11.0'
- name: WebUI - Install dependencies
id: webui_lint
run: |
cd tools/server/webui
npm ci
- name: WebUI - Check code format
id: webui_format
run: |
git config --global --add safe.directory $(realpath .)
cd examples/server
git ls-files --others --modified
cd tools/server/webui
git status
./deps.sh
npm run format
git status
not_ignored_files="$(git ls-files --others --modified)"
echo "Modified files: ${not_ignored_files}"
if [ -n "${not_ignored_files}" ]; then
echo "Repository is dirty or server deps are not built as expected"
echo "${not_ignored_files}"
modified_files="$(git status -s)"
echo "Modified files: ${modified_files}"
if [ -n "${modified_files}" ]; then
echo "Files do not follow coding style. To fix: npm run format"
echo "${modified_files}"
exit 1
fi
- name: Build
id: cmake_build
- name: Verify bundled index.html
id: verify_server_index_html
run: |
git config --global --add safe.directory $(realpath .)
cd tools/server/webui
git status
npm run build
git status
modified_files="$(git status -s)"
echo "Modified files: ${modified_files}"
if [ -n "${modified_files}" ]; then
echo "Repository is dirty or server/webui is not built as expected"
echo "Hint: You may need to follow Web UI build guide in server/README.md"
echo "${modified_files}"
exit 1
fi
- name: Build (no OpenMP)
id: cmake_build_no_openmp
if: ${{ matrix.sanitizer == 'THREAD' }}
run: |
cmake -B build \
-DLLAMA_NATIVE=OFF \
-DGGML_NATIVE=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_OPENMP=OFF ;
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
- name: Build (sanitizers)
id: cmake_build_sanitizers
if: ${{ matrix.sanitizer != '' && matrix.sanitizer != 'THREAD' }}
run: |
cmake -B build \
-DGGML_NATIVE=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_CURL=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON ;
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target server
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
- name: Build (sanitizers)
id: cmake_build
if: ${{ matrix.sanitizer == '' }}
run: |
cmake -B build \
-DGGML_NATIVE=OFF \
-DLLAMA_BUILD_SERVER=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} ;
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
- name: Tests
id: server_integration_tests
if: ${{ matrix.sanitizer == '' }}
env:
GITHUB_ACTIONS: "true"
run: |
cd examples/server/tests
PORT=8888 ./tests.sh
cd tools/server/tests
./tests.sh
- name: Tests (sanitizers)
id: server_integration_tests_sanitizers
if: ${{ matrix.sanitizer != '' }}
run: |
cd tools/server/tests
LLAMA_SANITIZE=1 ./tests.sh
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
run: |
cd examples/server/tests
PORT=8888 ./tests.sh --stop --no-skipped --no-capture --tags slow
cd tools/server/tests
SLOW_TESTS=1 ./tests.sh
server-windows:
runs-on: windows-latest
runs-on: windows-2022
steps:
- name: Clone
@@ -127,18 +192,15 @@ jobs:
- name: libCURL
id: get_libcurl
env:
CURL_VERSION: 8.6.0_6
run: |
curl.exe -o $env:RUNNER_TEMP/curl.zip -L "https://curl.se/windows/dl-${env:CURL_VERSION}/curl-${env:CURL_VERSION}-win64-mingw.zip"
mkdir $env:RUNNER_TEMP/libcurl
tar.exe -xvf $env:RUNNER_TEMP/curl.zip --strip-components=1 -C $env:RUNNER_TEMP/libcurl
uses: ./.github/actions/windows-setup-curl
- name: Build
id: cmake_build
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: |
cmake -B build -DLLAMA_CURL=ON -DCURL_LIBRARY="$env:RUNNER_TEMP/libcurl/lib/libcurl.dll.a" -DCURL_INCLUDE_DIR="$env:RUNNER_TEMP/libcurl/include"
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS} --target server
cmake -B build -DCURL_LIBRARY="$env:CURL_PATH/lib/libcurl.dll.a" -DCURL_INCLUDE_DIR="$env:CURL_PATH/include"
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS} --target llama-server
- name: Python setup
id: setup_python
@@ -149,23 +211,27 @@ jobs:
- name: Tests dependencies
id: test_dependencies
run: |
pip install -r examples/server/tests/requirements.txt
pip install -r tools/server/tests/requirements.txt
- name: Copy Libcurl
id: prepare_libcurl
env:
CURL_PATH: ${{ steps.get_libcurl.outputs.curl_path }}
run: |
cp $env:RUNNER_TEMP/libcurl/bin/libcurl-x64.dll ./build/bin/Release/libcurl-x64.dll
cp $env:CURL_PATH/bin/libcurl-x64.dll ./build/bin/Release/libcurl-x64.dll
- name: Tests
id: server_integration_tests
if: ${{ !matrix.disabled_on_pr || !github.event.pull_request }}
run: |
cd examples/server/tests
behave.exe --summary --stop --no-capture --exclude 'issues|wrong_usages|passkey' --tags llama.cpp
cd tools/server/tests
$env:PYTHONIOENCODING = ":replace"
pytest -v -x -m "not slow"
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
run: |
cd examples/server/tests
behave.exe --stop --no-skipped --no-capture --tags slow
cd tools/server/tests
$env:SLOW_TESTS = "1"
pytest -v -x
+40
View File
@@ -0,0 +1,40 @@
name: Update Operations Documentation
on:
push:
paths:
- 'docs/ops/**'
- 'scripts/create_ops_docs.py'
pull_request:
paths:
- 'docs/ops/**'
- 'scripts/create_ops_docs.py'
jobs:
update-ops-docs:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'
- name: Generate operations documentation to temporary file
run: |
mkdir -p /tmp/ops_check
./scripts/create_ops_docs.py /tmp/ops_check/ops.md
- name: Check if docs/ops.md matches generated version
run: |
if ! diff -q docs/ops.md /tmp/ops_check/ops.md; then
echo "Operations documentation (docs/ops.md) is not up to date with the backend CSV files."
echo "To fix: run ./scripts/create_ops_docs.py and commit the updated docs/ops.md along with your changes"
echo "Differences found:"
diff docs/ops.md /tmp/ops_check/ops.md || true
exit 1
fi
echo "Operations documentation is up to date."
+42
View File
@@ -0,0 +1,42 @@
name: Update Winget Package
on:
workflow_dispatch: # allows manual triggering
schedule:
- cron: '28 5 * * *' # Update every day at 5:28 UTC
jobs:
update:
name: Update Winget Package
runs-on: ubuntu-latest
steps:
- name: Install cargo binstall
uses: cargo-bins/cargo-binstall@268643a6b5ea099f5718ee5cd3ff7dc89a5eb49b
- name: Install komac
run: |
cargo binstall komac@2.11.2 -y
- name: Find latest release
id: find_latest_release
uses: actions/github-script@v6
with:
script: |
const { data: releases } = await github.rest.repos.listReleases({
owner: context.repo.owner,
repo: context.repo.repo,
});
console.log("Latest release:", releases[0].tag_name);
return releases[0].tag_name;
- name: Update manifest
env:
VERSION: ${{ steps.find_latest_release.outputs.result }}
run: |
echo "Updating manifest..."
komac update --version ${{ env.VERSION }} \
--urls "https://github.com/ggml-org/llama.cpp/releases/download/${{ env.VERSION }}/llama-${{ env.VERSION }}-bin-win-vulkan-x64.zip" \
--token ${{ secrets.WINGET_GITHUB_TOKEN }} \
--submit \
ggml.llamacpp
+105 -85
View File
@@ -1,129 +1,149 @@
*.o
# Extensions
*.a
*.so
*.bat
*.bin
*.d
*.dll
*.dot
*.etag
*.exe
*.gcda
*.gcno
*.gcov
*.gguf
*.gguf.json
*.bin
*.exe
*.dll
*.log
*.gcov
*.gcno
*.gcda
*.dot
*.bat
*.tmp
*.metallib
*.etag
*.lastModified
.DS_Store
.build/
*.log
*.metallib
*.o
*.so
*.swp
*.tmp
# IDE / OS
.cache/
.ccls-cache/
.direnv/
.DS_Store
.envrc
.idea/
.swiftpm
.venv
.clang-tidy
.vs/
.vscode/
.idea/
nppBackup
ggml-metal-embed.metal
lcov-report/
# Coverage
gcovr-report/
lcov-report/
# Build Artifacts
tags
.build/
build*
release
debug
!build-info.cmake
!build-info.cpp.in
!build-info.sh
!build.zig
cmake-build-*
!docs/build.md
/libllama.so
/llama-*
/vulkan-shaders-gen
android-ndk-*
arm_neon.h
cmake-build-*
CMakeSettings.json
compile_commands.json
ggml-metal-embed.metal
llama-batched-swift
/rpc-server
out/
tmp/
autogen-*.md
# Deprecated
/main
/server
# CI
!.github/workflows/*.yml
# Models
models/*
models-mnt
!models/.editorconfig
!models/ggml-vocab-*.gguf*
!models/templates
/Pipfile
/baby-llama
/beam-search
/benchmark-matmult
/convert-llama2c-to-ggml
/embd-input-test
/embedding
/eval-callback
/gguf
/gguf-llama-simple
/gguf-split
/gritlm
/imatrix
/infill
/libllama.so
/llama-bench
/llava-cli
/lookahead
/lookup
/lookup-create
/lookup-merge
/lookup-stats
/main
/metal
/passkey
/perplexity
/q8dot
/quantize
/quantize-stats
/result
/save-load-state
/server
/simple
/batched
/batched-bench
/export-lora
/finetune
/retrieval
/speculative
/parallel
/train-text-from-scratch
/tokenize
/vdot
/common/build-info.cpp
arm_neon.h
compile_commands.json
CMakeSettings.json
__pycache__
dist
# Zig
zig-out/
zig-cache/
# Logs
ppl-*.txt
qnt-*.txt
perf-*.txt
examples/jeopardy/results.txt
examples/server/*.html.hpp
examples/server/*.js.hpp
examples/server/*.mjs.hpp
examples/server/*.css.hpp
# Examples
poetry.lock
examples/jeopardy/results.txt
tools/server/*.css.hpp
tools/server/*.html.hpp
tools/server/*.js.hpp
tools/server/*.mjs.hpp
tools/server/*.gz.hpp
!build_64.sh
!examples/*.bat
!examples/*/*.kts
!examples/*/*/*.kts
!examples/sycl/*.bat
!examples/sycl/*.sh
# Server Web UI temporary files
node_modules
tools/server/webui/dist
# Python
/.venv
__pycache__/
*/poetry.lock
poetry.toml
nppBackup
# Nix
/result
# Test binaries
/tests/test-grammar-parser
/tests/test-llama-grammar
/tests/test-backend-ops
/tests/test-double-float
/tests/test-grad0
/tests/test-grammar-parser
/tests/test-llama-grammar
/tests/test-opt
/tests/test-quantize-fns
/tests/test-quantize-perf
/tests/test-rope
/tests/test-sampling
/tests/test-tokenizer-0
/tests/test-tokenizer-1-spm
/tests/test-tokenizer-1-bpe
/tests/test-rope
/tests/test-backend-ops
/tests/test-tokenizer-1-spm
# Scripts
!/scripts/install-oneapi.bat
# Test models for lora adapters
/lora-tests
# Local scripts
/run-vim.sh
/run-chat.sh
-3
View File
@@ -1,3 +0,0 @@
[submodule "kompute"]
path = kompute
url = https://github.com/nomic-ai/kompute.git
+452 -1
View File
File diff suppressed because it is too large Load Diff
+145 -1282
View File
File diff suppressed because it is too large Load Diff
+65 -19
View File
@@ -11,39 +11,85 @@
"CMAKE_INSTALL_RPATH": "$ORIGIN;$ORIGIN/.."
}
},
{ "name": "debug", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Debug" } },
{ "name": "release", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "RelWithDebInfo" } },
{ "name": "static", "hidden": true, "cacheVariables": { "LLAMA_STATIC": "ON" } },
{
"name": "sycl-base",
"hidden": true,
"generator": "Ninja",
"binaryDir": "${sourceDir}/build-${presetName}",
"cacheVariables": {
"CMAKE_EXPORT_COMPILE_COMMANDS": "ON",
"CMAKE_CXX_COMPILER": "icx",
"CMAKE_C_COMPILER": "cl",
"GGML_SYCL": "ON",
"CMAKE_INSTALL_RPATH": "$ORIGIN;$ORIGIN/.."
}
},
{ "name": "debug", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Debug" } },
{ "name": "release", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "Release" } },
{ "name": "reldbg", "hidden": true, "cacheVariables": { "CMAKE_BUILD_TYPE": "RelWithDebInfo" } },
{ "name": "static", "hidden": true, "cacheVariables": { "GGML_STATIC": "ON" } },
{ "name": "sycl_f16", "hidden": true, "cacheVariables": { "GGML_SYCL_F16": "ON" } },
{ "name": "vulkan", "hidden": true, "cacheVariables": { "GGML_VULKAN": "ON" } },
{
"name": "arm64-windows-msvc", "hidden": true,
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x86_64", "strategy": "external" },
"name": "x64-windows-llvm", "hidden": true,
"cacheVariables": {
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/arm64-windows-msvc.cmake"
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/x64-windows-llvm.cmake"
}
},
{
"name": "arm64-windows-llvm", "hidden": true,
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x86_64", "strategy": "external" },
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x64", "strategy": "external" },
"cacheVariables": {
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/arm64-windows-llvm.cmake"
}
},
{ "name": "arm64-windows-llvm-debug" , "inherits": [ "base", "arm64-windows-llvm", "debug" ] },
{ "name": "arm64-windows-llvm-release", "inherits": [ "base", "arm64-windows-llvm", "release" ] },
{ "name": "arm64-windows-llvm+static-release", "inherits": [ "base", "arm64-windows-llvm", "release", "static" ] },
{
"name": "arm64-apple-clang", "hidden": true,
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x64", "strategy": "external" },
"cacheVariables": {
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/arm64-apple-clang.cmake"
}
},
{
"name": "x64-linux-gcc", "hidden": true,
"cacheVariables": {
"CMAKE_C_COMPILER": "gcc",
"CMAKE_CXX_COMPILER": "g++"
}
},
{ "name": "x64-linux-gcc-debug", "inherits": [ "base", "x64-linux-gcc", "debug" ] },
{ "name": "x64-linux-gcc-release", "inherits": [ "base", "x64-linux-gcc", "release" ] },
{ "name": "x64-linux-gcc-reldbg", "inherits": [ "base", "x64-linux-gcc", "reldbg" ] },
{ "name": "x64-linux-gcc+static-release", "inherits": [ "base", "x64-linux-gcc", "release", "static" ] },
{ "name": "arm64-windows-msvc-debug" , "inherits": [ "base", "arm64-windows-msvc", "debug" ] },
{ "name": "arm64-windows-msvc-release", "inherits": [ "base", "arm64-windows-msvc", "release" ] },
{ "name": "arm64-windows-msvc+static-release", "inherits": [ "base", "arm64-windows-msvc", "release", "static" ] },
{ "name": "arm64-windows-llvm-debug", "inherits": [ "base", "arm64-windows-llvm", "debug" ] },
{ "name": "arm64-windows-llvm-release", "inherits": [ "base", "arm64-windows-llvm", "reldbg" ] },
{ "name": "arm64-windows-llvm+static-release", "inherits": [ "base", "arm64-windows-llvm", "reldbg", "static" ] },
{ "name": "x64-windows-msvc-debug" , "inherits": [ "base", "debug" ] },
{ "name": "x64-windows-msvc-release", "inherits": [ "base", "release" ] },
{ "name": "x64-windows-msvc+static-release", "inherits": [ "base", "release", "static" ] }
{ "name": "arm64-apple-clang-debug", "inherits": [ "base", "arm64-apple-clang", "debug" ] },
{ "name": "arm64-apple-clang-release", "inherits": [ "base", "arm64-apple-clang", "reldbg" ] },
{ "name": "arm64-apple-clang+static-release", "inherits": [ "base", "arm64-apple-clang", "reldbg", "static" ] },
{ "name": "x64-windows-llvm-debug", "inherits": [ "base", "x64-windows-llvm", "debug" ] },
{ "name": "x64-windows-llvm-release", "inherits": [ "base", "x64-windows-llvm", "release" ] },
{ "name": "x64-windows-llvm-reldbg", "inherits": [ "base", "x64-windows-llvm", "reldbg" ] },
{ "name": "x64-windows-llvm+static-release", "inherits": [ "base", "x64-windows-llvm", "reldbg", "static" ] },
{ "name": "x64-windows-msvc-debug", "inherits": [ "base", "debug" ] },
{ "name": "x64-windows-msvc-release", "inherits": [ "base", "reldbg" ] },
{ "name": "x64-windows-msvc+static-release", "inherits": [ "base", "reldbg", "static" ] },
{ "name": "x64-windows-sycl-debug", "inherits": [ "sycl-base", "debug" ] },
{ "name": "x64-windows-sycl-debug-f16", "inherits": [ "sycl-base", "debug", "sycl_f16" ] },
{ "name": "x64-windows-sycl-release", "inherits": [ "sycl-base", "release" ] },
{ "name": "x64-windows-sycl-release-f16", "inherits": [ "sycl-base", "release", "sycl_f16" ] },
{ "name": "x64-windows-vulkan-debug", "inherits": [ "base", "vulkan", "debug" ] },
{ "name": "x64-windows-vulkan-release", "inherits": [ "base", "vulkan", "release" ] }
]
}
+12
View File
@@ -0,0 +1,12 @@
# collaborators can optionally add themselves here to indicate their availability for reviewing related PRs
/ci/ @ggerganov
/.devops/*.Dockerfile @ngxson
/tools/server/ @ngxson
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
/ggml/src/ggml-cuda/mmv.* @JohannesGaessler
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
/ggml/src/ggml-opt.cpp @JohannesGaessler
/ggml/src/gguf.cpp @JohannesGaessler
/ggml/src/ggml-vulkan/ @0cc4m
+127
View File
@@ -0,0 +1,127 @@
# Pull requests (for contributors)
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
- Test your changes:
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
# Pull requests (for collaborators)
- Squash-merge PRs
- Use the following format for the squashed commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
- Optionally pick a `<module>` from here: https://github.com/ggml-org/llama.cpp/wiki/Modules
- Consider adding yourself to [CODEOWNERS](CODEOWNERS)
# Coding guidelines
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy-looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- Use sized integer types such as `int32_t` in the public API, e.g. `size_t` may also be appropriate for allocation sizes or byte offsets
- Declare structs with `struct foo {}` instead of `typedef struct foo {} foo`
- In C++ code omit optional `struct` and `enum` keyword whenever they are not necessary
```cpp
// OK
llama_context * ctx;
const llama_rope_type rope_type;
// not OK
struct llama_context * ctx;
const enum llama_rope_type rope_type;
```
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline.)_
- Try to follow the existing patterns in the code (indentation, spaces, etc.). In case of doubt use `clang-format` (from clang-tools v15+) to format the added code
- For anything not covered in the current guidelines, refer to the [C++ Core Guidelines](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines)
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggml-org/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$
![matmul](media/matmul.png)
# Naming guidelines
- Use `snake_case` for function, variable and type names
- Naming usually optimizes for longest common prefix (see https://github.com/ggml-org/ggml/pull/302#discussion_r1243240963)
```cpp
// not OK
int small_number;
int big_number;
// OK
int number_small;
int number_big;
```
- Enum values are always in upper case and prefixed with the enum name
```cpp
enum llama_vocab_type {
LLAMA_VOCAB_TYPE_NONE = 0,
LLAMA_VOCAB_TYPE_SPM = 1,
LLAMA_VOCAB_TYPE_BPE = 2,
LLAMA_VOCAB_TYPE_WPM = 3,
LLAMA_VOCAB_TYPE_UGM = 4,
LLAMA_VOCAB_TYPE_RWKV = 5,
};
```
- The general naming pattern is `<class>_<method>`, with `<method>` being `<action>_<noun>`
```cpp
llama_model_init(); // class: "llama_model", method: "init"
llama_sampler_chain_remove(); // class: "llama_sampler_chain", method: "remove"
llama_sampler_get_seed(); // class: "llama_sampler", method: "get_seed"
llama_set_embeddings(); // class: "llama_context", method: "set_embeddings"
llama_n_threads(); // class: "llama_context", method: "n_threads"
llama_adapter_lora_free(); // class: "llama_adapter_lora", method: "free"
```
- The `get` `<action>` can be omitted
- The `<noun>` can be omitted if not necessary
- The `_context` suffix of the `<class>` is optional. Use it to disambiguate symbols when needed
- Use `init`/`free` for constructor/destructor `<action>`
- Use the `_t` suffix when a type is supposed to be opaque to the user - it's not relevant to them if it is a struct or anything else
```cpp
typedef struct llama_context * llama_context_t;
enum llama_pooling_type llama_pooling_type(const llama_context_t ctx);
```
_(NOTE: this guideline is yet to be applied to the `llama.cpp` codebase. New code should follow this guideline)_
- C/C++ filenames are all lowercase with dashes. Headers use the `.h` extension. Source files use the `.c` or `.cpp` extension
- Python filenames are all lowercase with underscores
- _(TODO: abbreviations usage)_
# Preprocessor directives
- _(TODO: add guidelines with examples and apply them to the codebase)_
```cpp
#ifdef FOO
#endif // FOO
```
# Documentation
- Documentation is a community effort
- When you need to look into the source code to figure out how to use an API consider adding a short summary to the header file for future reference
- When you notice incorrect or outdated documentation, please update it
# Resources
The Github issues, PRs and discussions contain a lot of information that can be useful to get familiar with the codebase. For convenience, some of the more important information is referenced from Github projects:
https://github.com/ggml-org/llama.cpp/projects
+960 -407
View File
File diff suppressed because it is too large Load Diff
-78
View File
@@ -1,78 +0,0 @@
// swift-tools-version:5.5
import PackageDescription
var sources = [
"ggml.c",
"sgemm.cpp",
"llama.cpp",
"unicode.cpp",
"unicode-data.cpp",
"ggml-alloc.c",
"ggml-backend.c",
"ggml-quants.c",
]
var resources: [Resource] = []
var linkerSettings: [LinkerSetting] = []
var cSettings: [CSetting] = [
.unsafeFlags(["-Wno-shorten-64-to-32", "-O3", "-DNDEBUG"]),
.unsafeFlags(["-fno-objc-arc"]),
// NOTE: NEW_LAPACK will required iOS version 16.4+
// We should consider add this in the future when we drop support for iOS 14
// (ref: ref: https://developer.apple.com/documentation/accelerate/1513264-cblas_sgemm?language=objc)
// .define("ACCELERATE_NEW_LAPACK"),
// .define("ACCELERATE_LAPACK_ILP64")
]
#if canImport(Darwin)
sources.append("ggml-metal.m")
resources.append(.process("ggml-metal.metal"))
linkerSettings.append(.linkedFramework("Accelerate"))
cSettings.append(
contentsOf: [
.define("GGML_USE_ACCELERATE"),
.define("GGML_USE_METAL")
]
)
#endif
#if os(Linux)
cSettings.append(.define("_GNU_SOURCE"))
#endif
let package = Package(
name: "llama",
platforms: [
.macOS(.v12),
.iOS(.v14),
.watchOS(.v4),
.tvOS(.v14)
],
products: [
.library(name: "llama", targets: ["llama"]),
],
targets: [
.target(
name: "llama",
path: ".",
exclude: [
"cmake",
"examples",
"scripts",
"models",
"tests",
"CMakeLists.txt",
"ggml-cuda.cu",
"ggml-cuda.h",
"Makefile"
],
sources: sources,
resources: resources,
publicHeadersPath: "spm-headers",
cSettings: cSettings,
linkerSettings: linkerSettings
)
],
cxxLanguageStandard: .cxx11
)
-568
View File
@@ -1,568 +0,0 @@
# llama.cpp for SYCL
- [Background](#background)
- [News](#news)
- [OS](#os)
- [Hardware](#hardware)
- [Docker](#docker)
- [Linux](#linux)
- [Windows](#windows)
- [Environment Variable](#environment-variable)
- [Known Issue](#known-issues)
- [Q&A](#qa)
- [TODO](#todo)
## Background
**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17.
**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:
- **DPCPP** *(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers.
- **oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. oneMKL - Math Kernel Library)*.
- **oneAPI LevelZero**: A high performance low level interface for fine-grained control over intel iGPUs and dGPUs.
- **Nvidia & AMD Plugins**: These are plugins extending oneAPI's DPCPP support to SYCL on Nvidia and AMD GPU targets.
### Llama.cpp + SYCL
The llama.cpp SYCL backend is designed to support **Intel GPU** firstly. Based on the cross-platform feature of SYCL, it could support other vendor GPUs: Nvidia GPU (*AMD GPU coming*).
When targeting **Intel CPU**, it is recommended to use llama.cpp for [Intel oneMKL](README.md#intel-onemkl) backend.
It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, etc..*. In beginning work, the oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.
## News
- 2024.4
- Support data types: GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M.
- 2024.3
- Release binary files of Windows.
- A blog is published: **Run LLM on all Intel GPUs Using llama.cpp**: [intel.com](https://www.intel.com/content/www/us/en/developer/articles/technical/run-llm-on-all-gpus-using-llama-cpp-artical.html) or [medium.com](https://medium.com/@jianyu_neo/run-llm-on-all-intel-gpus-using-llama-cpp-fd2e2dcbd9bd).
- New base line is ready: [tag b2437](https://github.com/ggerganov/llama.cpp/tree/b2437).
- Support multiple cards: **--split-mode**: [none|layer]; not support [row], it's on developing.
- Support to assign main GPU by **--main-gpu**, replace $GGML_SYCL_DEVICE.
- Support detecting all GPUs with level-zero and same top **Max compute units**.
- Support OPs
- hardsigmoid
- hardswish
- pool2d
- 2024.1
- Create SYCL backend for Intel GPU.
- Support Windows build
## OS
| OS | Status | Verified |
|---------|---------|------------------------------------------------|
| Linux | Support | Ubuntu 22.04, Fedora Silverblue 39, Arch Linux |
| Windows | Support | Windows 11 |
## Hardware
### Intel GPU
**Verified devices**
| Intel GPU | Status | Verified Model |
|-------------------------------|---------|---------------------------------------|
| Intel Data Center Max Series | Support | Max 1550, 1100 |
| Intel Data Center Flex Series | Support | Flex 170 |
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
| Intel iGPU | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
*Notes:*
- **Memory**
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/main`.
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.
- **Execution Unit (EU)**
- If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
### Other Vendor GPU
**Verified devices**
| Nvidia GPU | Status | Verified Model |
|--------------------------|---------|----------------|
| Ampere Series | Support | A100, A4000 |
| Ampere Series *(Mobile)* | Support | RTX 40 Series |
## Docker
The docker build option is currently limited to *intel GPU* targets.
### Build image
```sh
# Using FP16
docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .
```
*Notes*:
To build in default FP32 *(Slower than FP16 alternative)*, you can remove the `--build-arg="LLAMA_SYCL_F16=ON"` argument from the previous command.
You can also use the `.devops/server-intel.Dockerfile`, which builds the *"server"* alternative.
### Run container
```sh
# First, find all the DRI cards
ls -la /dev/dri
# Then, pick the card that you want to use (here for e.g. /dev/dri/card1).
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
```
*Notes:*
- Docker has been tested successfully on native Linux. WSL support has not been verified yet.
- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
## Linux
### I. Setup Environment
1. **Install GPU drivers**
- **Intel GPU**
Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps).
*Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html).
Once installed, add the user(s) to the `video` and `render` groups.
```sh
sudo usermod -aG render $USER
sudo usermod -aG video $USER
```
*Note*: logout/re-login for the changes to take effect.
Verify installation through `clinfo`:
```sh
sudo apt install clinfo
sudo clinfo -l
```
Sample output:
```sh
Platform #0: Intel(R) OpenCL Graphics
`-- Device #0: Intel(R) Arc(TM) A770 Graphics
Platform #0: Intel(R) OpenCL HD Graphics
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
```
- **Nvidia GPU**
In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cuda)-* are installed.
2. **Install Intel® oneAPI Base toolkit**
- **For Intel GPU**
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*.
Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI MKL for intel GPUs.
- **Adding support to Nvidia GPUs**
**oneAPI Plugin**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
**oneMKL for cuBlas**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.
```sh
git clone https://github.com/oneapi-src/oneMKL
cd oneMKL
cmake -B buildWithCublas -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON -DTARGET_DOMAINS=blas
cmake --build buildWithCublas --config Release
```
3. **Verify installation and environment**
In order to check the available SYCL devices on the machine, please use the `sycl-ls` command.
```sh
source /opt/intel/oneapi/setvars.sh
sycl-ls
```
- **Intel GPU**
When targeting an intel GPU, the user should expect one or more level-zero devices among the available SYCL devices. Please make sure that at least one GPU is present, for instance [`ext_oneapi_level_zero:gpu:0`] in the sample output below:
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.30.26918.50]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
```
- **Nvidia GPU**
Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]
```
### II. Build llama.cpp
#### Intel GPU
```sh
# Export relevant ENV variables
source /opt/intel/oneapi/setvars.sh
# Build LLAMA with MKL BLAS acceleration for intel GPU
# Option 1: Use FP32 (recommended for better performance in most cases)
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
# Option 2: Use FP16
cmake -B build -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
# build all binary
cmake --build build --config Release -j -v
```
#### Nvidia GPU
```sh
# Export relevant ENV variables
export LD_LIBRARY_PATH=/path/to/oneMKL/buildWithCublas/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/path/to/oneMKL/buildWithCublas/lib:$LIBRARY_PATH
export CPLUS_INCLUDE_DIR=/path/to/oneMKL/buildWithCublas/include:$CPLUS_INCLUDE_DIR
export CPLUS_INCLUDE_DIR=/path/to/oneMKL/include:$CPLUS_INCLUDE_DIR
# Build LLAMA with Nvidia BLAS acceleration through SYCL
# Option 1: Use FP32 (recommended for better performance in most cases)
cmake -B build -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
# Option 2: Use FP16
cmake -B build -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
# build all binary
cmake --build build --config Release -j -v
```
### III. Run the inference
1. Retrieve and prepare model
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
2. Enable oneAPI running environment
```sh
source /opt/intel/oneapi/setvars.sh
```
3. List devices information
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
```sh
./build/bin/ls-sycl-device
```
A example of such log in a system with 1 *intel CPU* and 1 *intel GPU* can look like the following:
```
found 6 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
| 2| [opencl:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 3| [opencl:gpu:1]| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53651849216|
| 4| [opencl:cpu:0]| 13th Gen Intel(R) Core(TM) i7-13700K| 3.0| 24| 8192| 64| 67064815616|
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 24|67108864| 64| 67064815616|
```
| Attribute | Note |
|------------------------|-------------------------------------------------------------|
| compute capability 1.3 | Level-zero driver/runtime, recommended |
| compute capability 3.0 | OpenCL driver/runtime, slower than level-zero in most cases |
4. Launch inference
There are two device selection modes:
- Single device: Use one device target specified by the user.
- Multiple devices: Automatically select the devices with the same largest Max compute-units.
| Device selection | Parameter |
|------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) |
Examples:
- Use device 0:
```sh
ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
```
or run by script:
```sh
./examples/sycl/run_llama2.sh 0
```
- Use multiple devices:
```sh
ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
```
Otherwise, you can run the script:
```sh
./examples/sycl/run_llama2.sh
```
*Notes:*
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
```sh
detect 1 SYCL GPUs: [0] with top Max compute units:512
```
Or
```sh
use 1 SYCL GPUs: [0] with Max compute units:512
```
## Windows
### I. Setup Environment
1. Install GPU driver
Intel GPU drivers instructions guide and download page can be found here: [Get intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
2. Install Visual Studio
If you already have a recent version of Microsoft Visual Studio, you can skip this step. Otherwise, please refer to the official download page for [Microsoft Visual Studio](https://visualstudio.microsoft.com/).
3. Install Intel® oneAPI Base toolkit
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*.
Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
b. Enable oneAPI running environment:
- Type "oneAPI" in the search bar, then open the `Intel oneAPI command prompt for Intel 64 for Visual Studio 2022` App.
- On the command prompt, enable the runtime environment with the following:
```
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
```
c. Verify installation
In the oneAPI command line, run the following to print the available SYCL devices:
```
sycl-ls
```
There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
Output (example):
```
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
```
4. Install build tools
a. Download & install cmake for Windows: https://cmake.org/download/
b. Download & install mingw-w64 make for Windows provided by w64devkit
- Download the 1.19.0 version of [w64devkit](https://github.com/skeeto/w64devkit/releases/download/v1.19.0/w64devkit-1.19.0.zip).
- Extract `w64devkit` on your pc.
- Add the **bin** folder path in the Windows system PATH environment (for e.g. `C:\xxx\w64devkit\bin\`).
### II. Build llama.cpp
On the oneAPI command line window, step into the llama.cpp main directory and run the following:
```
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
# Option 1: Use FP32 (recommended for better performance in most cases)
cmake -B build -G "MinGW Makefiles" -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release
# Option 2: Or FP16
cmake -B build -G "MinGW Makefiles" -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON
cmake --build build --config Release -j
```
Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
```sh
.\examples\sycl\win-build-sycl.bat
```
*Notes:*
- By default, calling `make` will build all target binary files. In case of a minimal experimental setup, the user can build the inference executable only through `make main`.
### III. Run the inference
1. Retrieve and prepare model
You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
2. Enable oneAPI running environment
On the oneAPI command line window, run the following and step into the llama.cpp directory:
```
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
```
3. List devices information
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
```
build\bin\ls-sycl-device.exe
```
The output of this command in a system with 1 *intel CPU* and 1 *intel GPU* would look like the following:
```
found 6 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
| 2| [opencl:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 3| [opencl:gpu:1]| Intel(R) UHD Graphics 770| 3.0| 32| 512| 32| 53651849216|
| 4| [opencl:cpu:0]| 13th Gen Intel(R) Core(TM) i7-13700K| 3.0| 24| 8192| 64| 67064815616|
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 24|67108864| 64| 67064815616|
```
| Attribute | Note |
|------------------------|-----------------------------------------------------------|
| compute capability 1.3 | Level-zero running time, recommended |
| compute capability 3.0 | OpenCL running time, slower than level-zero in most cases |
4. Launch inference
There are two device selection modes:
- Single device: Use one device assigned by user.
- Multiple devices: Automatically choose the devices with the same biggest Max compute units.
| Device selection | Parameter |
|------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) |
Examples:
- Use device 0:
```
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm none -mg 0
```
- Use multiple devices:
```
build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
```
Otherwise, run the following wrapper script:
```
.\examples\sycl\win-run-llama2.bat
```
Note:
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
```sh
detect 1 SYCL GPUs: [0] with top Max compute units:512
```
Or
```sh
use 1 SYCL GPUs: [0] with Max compute units:512
```
## Environment Variable
#### Build
| Name | Value | Function |
|--------------------|-----------------------------------|---------------------------------------------|
| LLAMA_SYCL | ON (mandatory) | Enable build with SYCL code path. |
| LLAMA_SYCL_TARGET | INTEL *(default)* \| NVIDIA | Set the SYCL target device type. |
| LLAMA_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
| CMAKE_C_COMPILER | icx | Set *icx* compiler for SYCL code path. |
| CMAKE_CXX_COMPILER | icpx *(Linux)*, icx *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
#### Runtime
| Name | Value | Function |
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
## Known Issues
- `Split-mode:[row]` is not supported.
## Q&A
- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
- Potential cause: Unavailable oneAPI installation or not set ENV variables.
- Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`.
- General compiler error:
- Remove **build** folder or try a clean-build.
- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux.
Please double-check with `sudo sycl-ls`.
If it's present in the list, please add video/render group to your user then **logout/login** or restart your system:
```
sudo usermod -aG render $USER
sudo usermod -aG video $USER
```
Otherwise, please double-check the GPU driver installation steps.
### **GitHub contribution**:
Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
## TODO
- Support row layer split for multiple card runs.
+399 -792
View File
File diff suppressed because it is too large Load Diff
+3 -2
View File
@@ -40,7 +40,8 @@ To protect sensitive data from potential leaks or unauthorized access, it is cru
### Untrusted environments or networks
If you can't run your models in a secure and isolated environment or if it must be exposed to an untrusted network, make sure to take the following security precautions:
* Confirm the hash of any downloaded artifact (e.g. pre-trained model weights) matches a known-good value
* Do not use the RPC backend, [rpc-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) and [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) functionality (see https://github.com/ggml-org/llama.cpp/pull/13061).
* Confirm the hash of any downloaded artifact (e.g. pre-trained model weights) matches a known-good value.
* Encrypt your data if sending it over the network.
### Multi-Tenant environments
@@ -62,6 +63,6 @@ Beware that none of the topics under [Using llama.cpp securely](#using-llamacpp-
<!-- normal version -->
However, If you have discovered a security vulnerability in this project, please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggerganov/llama.cpp/security/advisories/new).
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
+541
View File
@@ -0,0 +1,541 @@
#!/usr/bin/env bash
#
# Options
IOS_MIN_OS_VERSION=16.4
MACOS_MIN_OS_VERSION=13.3
VISIONOS_MIN_OS_VERSION=1.0
TVOS_MIN_OS_VERSION=16.4
BUILD_SHARED_LIBS=OFF
LLAMA_BUILD_EXAMPLES=OFF
LLAMA_BUILD_TOOLS=OFF
LLAMA_BUILD_TESTS=OFF
LLAMA_BUILD_SERVER=OFF
GGML_METAL=ON
GGML_METAL_EMBED_LIBRARY=ON
GGML_BLAS_DEFAULT=ON
GGML_METAL_USE_BF16=ON
GGML_OPENMP=OFF
COMMON_C_FLAGS="-Wno-macro-redefined -Wno-shorten-64-to-32 -Wno-unused-command-line-argument -g"
COMMON_CXX_FLAGS="-Wno-macro-redefined -Wno-shorten-64-to-32 -Wno-unused-command-line-argument -g"
# Common options for all builds
COMMON_CMAKE_ARGS=(
-DCMAKE_XCODE_ATTRIBUTE_CODE_SIGNING_REQUIRED=NO
-DCMAKE_XCODE_ATTRIBUTE_CODE_SIGN_IDENTITY=""
-DCMAKE_XCODE_ATTRIBUTE_CODE_SIGNING_ALLOWED=NO
-DCMAKE_XCODE_ATTRIBUTE_DEBUG_INFORMATION_FORMAT="dwarf-with-dsym"
-DCMAKE_XCODE_ATTRIBUTE_GCC_GENERATE_DEBUGGING_SYMBOLS=YES
-DCMAKE_XCODE_ATTRIBUTE_COPY_PHASE_STRIP=NO
-DCMAKE_XCODE_ATTRIBUTE_STRIP_INSTALLED_PRODUCT=NO
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
-DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBS}
-DLLAMA_BUILD_EXAMPLES=${LLAMA_BUILD_EXAMPLES}
-DLLAMA_BUILD_TOOLS=${LLAMA_BUILD_TOOLS}
-DLLAMA_BUILD_TESTS=${LLAMA_BUILD_TESTS}
-DLLAMA_BUILD_SERVER=${LLAMA_BUILD_SERVER}
-DGGML_METAL_EMBED_LIBRARY=${GGML_METAL_EMBED_LIBRARY}
-DGGML_BLAS_DEFAULT=${GGML_BLAS_DEFAULT}
-DGGML_METAL=${GGML_METAL}
-DGGML_METAL_USE_BF16=${GGML_METAL_USE_BF16}
-DGGML_NATIVE=OFF
-DGGML_OPENMP=${GGML_OPENMP}
)
XCODE_VERSION=$(xcodebuild -version 2>/dev/null | head -n1 | awk '{ print $2 }')
MAJOR_VERSION=$(echo $XCODE_VERSION | cut -d. -f1)
MINOR_VERSION=$(echo $XCODE_VERSION | cut -d. -f2)
echo "Detected Xcode version: $XCODE_VERSION"
check_required_tool() {
local tool=$1
local install_message=$2
if ! command -v $tool &> /dev/null; then
echo "Error: $tool is required but not found."
echo "$install_message"
exit 1
fi
}
echo "Checking for required tools..."
check_required_tool "cmake" "Please install CMake 3.28.0 or later (brew install cmake)"
check_required_tool "xcodebuild" "Please install Xcode and Xcode Command Line Tools (xcode-select --install)"
check_required_tool "libtool" "Please install libtool which should be available with Xcode Command Line Tools (CLT). Make sure Xcode CLT is installed (xcode-select --install)"
check_required_tool "dsymutil" "Please install Xcode and Xcode Command Line Tools (xcode-select --install)"
set -e
## Clean up previous builds
rm -rf build-apple
rm -rf build-ios-sim
rm -rf build-ios-device
rm -rf build-macos
rm -rf build-visionos
rm -rf build-visionos-sim
rm -rf build-tvos-sim
rm -rf build-tvos-device
# Setup the xcframework build directory structure
setup_framework_structure() {
local build_dir=$1
local min_os_version=$2
local platform=$3 # "ios", "macos", "visionos", or "tvos"
local framework_name="llama"
echo "Creating ${platform}-style framework structure for ${build_dir}"
if [[ "$platform" == "macos" ]]; then
# macOS versioned structure uses versioned directories
mkdir -p ${build_dir}/framework/${framework_name}.framework/Versions/A/Headers
mkdir -p ${build_dir}/framework/${framework_name}.framework/Versions/A/Modules
mkdir -p ${build_dir}/framework/${framework_name}.framework/Versions/A/Resources
# Create symbolic links
ln -sf A ${build_dir}/framework/${framework_name}.framework/Versions/Current
ln -sf Versions/Current/Headers ${build_dir}/framework/${framework_name}.framework/Headers
ln -sf Versions/Current/Modules ${build_dir}/framework/${framework_name}.framework/Modules
ln -sf Versions/Current/Resources ${build_dir}/framework/${framework_name}.framework/Resources
ln -sf Versions/Current/${framework_name} ${build_dir}/framework/${framework_name}.framework/${framework_name}
# Set header and module paths
local header_path=${build_dir}/framework/${framework_name}.framework/Versions/A/Headers/
local module_path=${build_dir}/framework/${framework_name}.framework/Versions/A/Modules/
else
# iOS/VisionOS/tvOS use a flat structure
mkdir -p ${build_dir}/framework/${framework_name}.framework/Headers
mkdir -p ${build_dir}/framework/${framework_name}.framework/Modules
# Remove any existing structure to ensure clean build
rm -rf ${build_dir}/framework/${framework_name}.framework/Versions
# Set header and module paths
local header_path=${build_dir}/framework/${framework_name}.framework/Headers/
local module_path=${build_dir}/framework/${framework_name}.framework/Modules/
fi
# Copy all required headers (common for all platforms)
cp include/llama.h ${header_path}
cp ggml/include/ggml.h ${header_path}
cp ggml/include/ggml-opt.h ${header_path}
cp ggml/include/ggml-alloc.h ${header_path}
cp ggml/include/ggml-backend.h ${header_path}
cp ggml/include/ggml-metal.h ${header_path}
cp ggml/include/ggml-cpu.h ${header_path}
cp ggml/include/ggml-blas.h ${header_path}
cp ggml/include/gguf.h ${header_path}
# Create module map (common for all platforms)
cat > ${module_path}module.modulemap << EOF
framework module llama {
header "llama.h"
header "ggml.h"
header "ggml-alloc.h"
header "ggml-backend.h"
header "ggml-metal.h"
header "ggml-cpu.h"
header "ggml-blas.h"
header "gguf.h"
link "c++"
link framework "Accelerate"
link framework "Metal"
link framework "Foundation"
export *
}
EOF
# Platform-specific settings for Info.plist
local platform_name=""
local sdk_name=""
local supported_platform=""
case "$platform" in
"ios")
platform_name="iphoneos"
sdk_name="iphoneos${min_os_version}"
supported_platform="iPhoneOS"
local plist_path="${build_dir}/framework/${framework_name}.framework/Info.plist"
local device_family=' <key>UIDeviceFamily</key>
<array>
<integer>1</integer>
<integer>2</integer>
</array>'
;;
"macos")
platform_name="macosx"
sdk_name="macosx${min_os_version}"
supported_platform="MacOSX"
local plist_path="${build_dir}/framework/${framework_name}.framework/Versions/A/Resources/Info.plist"
local device_family=""
;;
"visionos")
platform_name="xros"
sdk_name="xros${min_os_version}"
supported_platform="XRPlatform"
local plist_path="${build_dir}/framework/${framework_name}.framework/Info.plist"
local device_family=""
;;
"tvos")
platform_name="appletvos"
sdk_name="appletvos${min_os_version}"
supported_platform="AppleTVOS"
local plist_path="${build_dir}/framework/${framework_name}.framework/Info.plist"
local device_family=' <key>UIDeviceFamily</key>
<array>
<integer>3</integer>
</array>'
;;
esac
# Create Info.plist
cat > ${plist_path} << EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>CFBundleDevelopmentRegion</key>
<string>en</string>
<key>CFBundleExecutable</key>
<string>llama</string>
<key>CFBundleIdentifier</key>
<string>org.ggml.llama</string>
<key>CFBundleInfoDictionaryVersion</key>
<string>6.0</string>
<key>CFBundleName</key>
<string>llama</string>
<key>CFBundlePackageType</key>
<string>FMWK</string>
<key>CFBundleShortVersionString</key>
<string>1.0</string>
<key>CFBundleVersion</key>
<string>1</string>
<key>MinimumOSVersion</key>
<string>${min_os_version}</string>
<key>CFBundleSupportedPlatforms</key>
<array>
<string>${supported_platform}</string>
</array>${device_family}
<key>DTPlatformName</key>
<string>${platform_name}</string>
<key>DTSDKName</key>
<string>${sdk_name}</string>
</dict>
</plist>
EOF
}
# Create dynamic libraries from static libraries.
combine_static_libraries() {
local build_dir="$1"
local release_dir="$2"
local platform="$3" # "ios", "macos", "visionos", or "tvos"
local is_simulator="$4"
local base_dir="$(pwd)"
local framework_name="llama"
# Determine output path based on platform
local output_lib=""
if [[ "$platform" == "macos" ]]; then
# macOS uses versioned structure
output_lib="${build_dir}/framework/${framework_name}.framework/Versions/A/${framework_name}"
else
# iOS, visionOS, and tvOS use a directory flat structure
output_lib="${build_dir}/framework/${framework_name}.framework/${framework_name}"
fi
local libs=(
"${base_dir}/${build_dir}/src/${release_dir}/libllama.a"
"${base_dir}/${build_dir}/ggml/src/${release_dir}/libggml.a"
"${base_dir}/${build_dir}/ggml/src/${release_dir}/libggml-base.a"
"${base_dir}/${build_dir}/ggml/src/${release_dir}/libggml-cpu.a"
"${base_dir}/${build_dir}/ggml/src/ggml-metal/${release_dir}/libggml-metal.a"
"${base_dir}/${build_dir}/ggml/src/ggml-blas/${release_dir}/libggml-blas.a"
)
# Create temporary directory for processing
local temp_dir="${base_dir}/${build_dir}/temp"
mkdir -p "${temp_dir}"
# Since we have multiple architectures libtool will find object files that do not
# match the target architecture. We suppress these warnings.
libtool -static -o "${temp_dir}/combined.a" "${libs[@]}" 2> /dev/null
# Determine SDK, architectures, and install_name based on platform and simulator flag.
local sdk=""
local archs=""
local min_version_flag=""
local install_name=""
case "$platform" in
"ios")
if [[ "$is_simulator" == "true" ]]; then
sdk="iphonesimulator"
archs="arm64 x86_64"
min_version_flag="-mios-simulator-version-min=${IOS_MIN_OS_VERSION}"
else
sdk="iphoneos"
archs="arm64"
min_version_flag="-mios-version-min=${IOS_MIN_OS_VERSION}"
fi
install_name="@rpath/llama.framework/llama"
;;
"macos")
sdk="macosx"
archs="arm64 x86_64"
min_version_flag="-mmacosx-version-min=${MACOS_MIN_OS_VERSION}"
install_name="@rpath/llama.framework/Versions/Current/llama"
;;
"visionos")
if [[ "$is_simulator" == "true" ]]; then
sdk="xrsimulator"
archs="arm64 x86_64"
min_version_flag="-mtargetos=xros${VISIONOS_MIN_OS_VERSION}-simulator"
else
sdk="xros"
archs="arm64"
min_version_flag="-mtargetos=xros${VISIONOS_MIN_OS_VERSION}"
fi
# Use flat structure for visionOS, same as iOS
install_name="@rpath/llama.framework/llama"
;;
"tvos")
if [[ "$is_simulator" == "true" ]]; then
sdk="appletvsimulator"
archs="arm64 x86_64"
min_version_flag="-mtvos-simulator-version-min=${TVOS_MIN_OS_VERSION}"
else
sdk="appletvos"
archs="arm64"
min_version_flag="-mtvos-version-min=${TVOS_MIN_OS_VERSION}"
fi
install_name="@rpath/llama.framework/llama"
;;
esac
# Build architecture flags
local arch_flags=""
for arch in $archs; do
arch_flags+=" -arch $arch"
done
# Create dynamic library
echo "Creating dynamic library for ${platform}."
xcrun -sdk $sdk clang++ -dynamiclib \
-isysroot $(xcrun --sdk $sdk --show-sdk-path) \
$arch_flags \
$min_version_flag \
-Wl,-force_load,"${temp_dir}/combined.a" \
-framework Foundation -framework Metal -framework Accelerate \
-install_name "$install_name" \
-o "${base_dir}/${output_lib}"
# Platform-specific post-processing for device builds
if [[ "$is_simulator" == "false" ]]; then
if command -v xcrun vtool &>/dev/null; then
case "$platform" in
"ios")
echo "Marking binary as a framework binary for iOS..."
xcrun vtool -set-build-version ios ${IOS_MIN_OS_VERSION} ${IOS_MIN_OS_VERSION} -replace \
-output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
;;
"visionos")
echo "Marking binary as a framework binary for visionOS..."
if [[ "$MAJOR_VERSION" -gt 16 ]] || [[ "$MAJOR_VERSION" -eq 16 && "$MINOR_VERSION" -gt 2 ]]; then
echo "Xcode version greater than 16.2, using visionOS."
VISION_OS_BUILD_VERSION="visionos"
else
echo "Xcode version less than or equal to 16.2, using xros."
VISION_OS_BUILD_VERSION="xros"
fi
xcrun vtool -set-build-version ${VISION_OS_BUILD_VERSION} ${VISIONOS_MIN_OS_VERSION} ${VISIONOS_MIN_OS_VERSION} -replace \
-output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
;;
"tvos")
echo "Marking binary as a framework binary for tvOS..."
xcrun vtool -set-build-version tvos ${TVOS_MIN_OS_VERSION} ${TVOS_MIN_OS_VERSION} -replace \
-output "${base_dir}/${output_lib}" "${base_dir}/${output_lib}"
;;
esac
else
echo "Warning: vtool not found. Binary may not pass App Store validation."
fi
fi
echo "Creating properly formatted dSYM..."
# Create a separate directory for dSYMs for all platforms
mkdir -p "${base_dir}/${build_dir}/dSYMs"
# iOS and visionOS style dSYM (flat structure)
if [[ "$platform" == "ios" || "$platform" == "visionos" || "$platform" == "tvos" ]]; then
# Generate dSYM in the dSYMs directory
xcrun dsymutil "${base_dir}/${output_lib}" -o "${base_dir}/${build_dir}/dSYMs/llama.dSYM"
# Create a copy of the binary that will be stripped
cp "${base_dir}/${output_lib}" "${temp_dir}/binary_to_strip"
# Strip debug symbols from the copy
xcrun strip -S "${temp_dir}/binary_to_strip" -o "${temp_dir}/stripped_lib"
# Replace the original with the stripped version
mv "${temp_dir}/stripped_lib" "${base_dir}/${output_lib}"
else
# macOS style dSYM
# First strip debug info to a separate file
xcrun strip -S "${base_dir}/${output_lib}" -o "${temp_dir}/stripped_lib"
# Generate dSYM in the dSYMs directory
xcrun dsymutil "${base_dir}/${output_lib}" -o "${base_dir}/${build_dir}/dSYMs/llama.dSYM"
# Replace original binary with stripped version
mv "${temp_dir}/stripped_lib" "${base_dir}/${output_lib}"
fi
# Remove any automatically generated dSYM files in the framework structure as they will
# otherwise case Invalid Bundle Structure validation errors.
if [ -d "${base_dir}/${output_lib}.dSYM" ]; then
echo "Removing generated dSYM file in framework structure: ${base_dir}/${output_lib}.dSYM"
rm -rf "${base_dir}/${output_lib}.dSYM"
fi
# Clean up
rm -rf "${temp_dir}"
}
echo "Building for iOS simulator..."
cmake -B build-ios-sim -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${IOS_MIN_OS_VERSION} \
-DIOS=ON \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_SYSROOT=iphonesimulator \
-DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=iphonesimulator \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-ios-sim --config Release -- -quiet
echo "Building for iOS devices..."
cmake -B build-ios-device -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${IOS_MIN_OS_VERSION} \
-DCMAKE_OSX_SYSROOT=iphoneos \
-DCMAKE_OSX_ARCHITECTURES="arm64" \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=iphoneos \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-ios-device --config Release -- -quiet
echo "Building for macOS..."
cmake -B build-macos -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${MACOS_MIN_OS_VERSION} \
-DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-macos --config Release -- -quiet
echo "Building for visionOS..."
cmake -B build-visionos -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${VISIONOS_MIN_OS_VERSION} \
-DCMAKE_OSX_ARCHITECTURES="arm64" \
-DCMAKE_SYSTEM_NAME=visionOS \
-DCMAKE_OSX_SYSROOT=xros \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xros \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-visionos --config Release -- -quiet
echo "Building for visionOS simulator..."
cmake -B build-visionos-sim -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${VISIONOS_MIN_OS_VERSION} \
-DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \
-DCMAKE_SYSTEM_NAME=visionOS \
-DCMAKE_OSX_SYSROOT=xrsimulator \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xrsimulator \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-visionos-sim --config Release -- -quiet
# Add tvOS builds (might need the same u_int definitions as watchOS and visionOS)
echo "Building for tvOS simulator..."
cmake -B build-tvos-sim -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${TVOS_MIN_OS_VERSION} \
-DCMAKE_SYSTEM_NAME=tvOS \
-DCMAKE_OSX_SYSROOT=appletvsimulator \
-DCMAKE_OSX_ARCHITECTURES="arm64;x86_64" \
-DGGML_METAL=ON \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=appletvsimulator \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-tvos-sim --config Release -- -quiet
echo "Building for tvOS devices..."
cmake -B build-tvos-device -G Xcode \
"${COMMON_CMAKE_ARGS[@]}" \
-DCMAKE_OSX_DEPLOYMENT_TARGET=${TVOS_MIN_OS_VERSION} \
-DCMAKE_SYSTEM_NAME=tvOS \
-DCMAKE_OSX_SYSROOT=appletvos \
-DCMAKE_OSX_ARCHITECTURES="arm64" \
-DGGML_METAL=ON \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=appletvos \
-DCMAKE_C_FLAGS="${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_CURL=OFF \
-S .
cmake --build build-tvos-device --config Release -- -quiet
# Setup frameworks and copy binaries and headers
echo "Setting up framework structures..."
setup_framework_structure "build-ios-sim" ${IOS_MIN_OS_VERSION} "ios"
setup_framework_structure "build-ios-device" ${IOS_MIN_OS_VERSION} "ios"
setup_framework_structure "build-macos" ${MACOS_MIN_OS_VERSION} "macos"
setup_framework_structure "build-visionos" ${VISIONOS_MIN_OS_VERSION} "visionos"
setup_framework_structure "build-visionos-sim" ${VISIONOS_MIN_OS_VERSION} "visionos"
setup_framework_structure "build-tvos-sim" ${TVOS_MIN_OS_VERSION} "tvos"
setup_framework_structure "build-tvos-device" ${TVOS_MIN_OS_VERSION} "tvos"
# Create dynamic libraries from static libraries
echo "Creating dynamic libraries from static libraries..."
combine_static_libraries "build-ios-sim" "Release-iphonesimulator" "ios" "true"
combine_static_libraries "build-ios-device" "Release-iphoneos" "ios" "false"
combine_static_libraries "build-macos" "Release" "macos" "false"
combine_static_libraries "build-visionos" "Release-xros" "visionos" "false"
combine_static_libraries "build-visionos-sim" "Release-xrsimulator" "visionos" "true"
combine_static_libraries "build-tvos-sim" "Release-appletvsimulator" "tvos" "true"
combine_static_libraries "build-tvos-device" "Release-appletvos" "tvos" "false"
# Create XCFramework with correct debug symbols paths
echo "Creating XCFramework..."
xcodebuild -create-xcframework \
-framework $(pwd)/build-ios-sim/framework/llama.framework \
-debug-symbols $(pwd)/build-ios-sim/dSYMs/llama.dSYM \
-framework $(pwd)/build-ios-device/framework/llama.framework \
-debug-symbols $(pwd)/build-ios-device/dSYMs/llama.dSYM \
-framework $(pwd)/build-macos/framework/llama.framework \
-debug-symbols $(pwd)/build-macos/dSYMS/llama.dSYM \
-framework $(pwd)/build-visionos/framework/llama.framework \
-debug-symbols $(pwd)/build-visionos/dSYMs/llama.dSYM \
-framework $(pwd)/build-visionos-sim/framework/llama.framework \
-debug-symbols $(pwd)/build-visionos-sim/dSYMs/llama.dSYM \
-framework $(pwd)/build-tvos-device/framework/llama.framework \
-debug-symbols $(pwd)/build-tvos-device/dSYMs/llama.dSYM \
-framework $(pwd)/build-tvos-sim/framework/llama.framework \
-debug-symbols $(pwd)/build-tvos-sim/dSYMs/llama.dSYM \
-output $(pwd)/build-apple/llama.xcframework
+41 -2
View File
@@ -1,11 +1,11 @@
# CI
In addition to [Github Actions](https://github.com/ggerganov/llama.cpp/actions) `llama.cpp` uses a custom CI framework:
In addition to [Github Actions](https://github.com/ggml-org/llama.cpp/actions) `llama.cpp` uses a custom CI framework:
https://github.com/ggml-org/ci
It monitors the `master` branch for new commits and runs the
[ci/run.sh](https://github.com/ggerganov/llama.cpp/blob/master/ci/run.sh) script on dedicated cloud instances. This allows us
[ci/run.sh](https://github.com/ggml-org/llama.cpp/blob/master/ci/run.sh) script on dedicated cloud instances. This allows us
to execute heavier workloads compared to just using Github Actions. Also with time, the cloud instances will be scaled
to cover various hardware architectures, including GPU and Apple Silicon instances.
@@ -26,4 +26,43 @@ GG_BUILD_CUDA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# with SYCL support
source /opt/intel/oneapi/setvars.sh
GG_BUILD_SYCL=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# with MUSA support
GG_BUILD_MUSA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
```
## Running MUSA CI in a Docker Container
Assuming `$PWD` is the root of the `llama.cpp` repository, follow these steps to set up and run MUSA CI in a Docker container:
### 1. Create a local directory to store cached models, configuration files and venv:
```bash
mkdir -p $HOME/llama.cpp/ci-cache
```
### 2. Create a local directory to store CI run results:
```bash
mkdir -p $HOME/llama.cpp/ci-results
```
### 3. Start a Docker container and run the CI:
```bash
docker run --privileged -it \
-v $HOME/llama.cpp/ci-cache:/ci-cache \
-v $HOME/llama.cpp/ci-results:/ci-results \
-v $PWD:/ws -w /ws \
mthreads/musa:rc4.2.0-devel-ubuntu22.04-amd64
```
Inside the container, execute the following commands:
```bash
apt update -y && apt install -y bc cmake ccache git python3.10-venv time unzip wget
git config --global --add safe.directory /ws
GG_BUILD_MUSA=1 bash ./ci/run.sh /ci-results /ci-cache
```
This setup ensures that the CI runs within an isolated Docker environment while maintaining cached files and results across runs.
+281 -142
View File
@@ -1,4 +1,4 @@
#/bin/bash
#!/usr/bin/env bash
#
# sample usage:
#
@@ -13,6 +13,15 @@
# # with SYCL support
# GG_BUILD_SYCL=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with VULKAN support
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with WebGPU support
# GG_BUILD_WEBGPU=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with MUSA support
# GG_BUILD_MUSA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
if [ -z "$2" ]; then
echo "usage: $0 <output-dir> <mnt-dir>"
@@ -33,14 +42,27 @@ sd=`dirname $0`
cd $sd/../
SRC=`pwd`
CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=ON"
CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON"
if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DLLAMA_METAL_SHADER_DEBUG=ON"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON -DGGML_METAL_USE_BF16=ON"
fi
if [ ! -z ${GG_BUILD_CUDA} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DLLAMA_CUDA=1"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_CUDA=ON"
if command -v nvidia-smi >/dev/null 2>&1; then
CUDA_ARCH=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader,nounits 2>/dev/null | head -1 | tr -d '.')
if [[ -n "$CUDA_ARCH" && "$CUDA_ARCH" =~ ^[0-9]+$ ]]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH}"
else
echo "Warning: Using fallback CUDA architectures"
CMAKE_EXTRA="${CMAKE_EXTRA} -DCMAKE_CUDA_ARCHITECTURES=61;70;75;80;86;89"
fi
else
echo "Error: nvidia-smi not found, cannot build with CUDA"
exit 1
fi
fi
if [ ! -z ${GG_BUILD_SYCL} ]; then
@@ -49,8 +71,27 @@ if [ ! -z ${GG_BUILD_SYCL} ]; then
echo "source /opt/intel/oneapi/setvars.sh"
exit 1
fi
# Use only main GPU
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
# Enable sysman for correct memory reporting
export ZES_ENABLE_SYSMAN=1
# to circumvent precision issues on CPY operations
export SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_SYCL=1 -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON"
fi
CMAKE_EXTRA="${CMAKE_EXTRA} -DLLAMA_SYCL=1 DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON"
if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_VULKAN=1"
fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1"
fi
if [ ! -z ${GG_BUILD_MUSA} ]; then
# Use qy1 by default (MTT S80)
MUSA_ARCH=${MUSA_ARCH:-21}
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_MUSA=ON -DMUSA_ARCHITECTURES=${MUSA_ARCH}"
fi
## helpers
@@ -103,8 +144,11 @@ function gg_run_ctest_debug {
set -e
# Check cmake, make and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(time ctest --output-on-failure -L main -E test-opt ) 2>&1 | tee -a $OUT/${ci}-ctest.log
@@ -131,8 +175,11 @@ function gg_run_ctest_release {
set -e
# Check cmake, make and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
if [ -z ${GG_BUILD_LOW_PERF} ]; then
(time ctest --output-on-failure -L main ) 2>&1 | tee -a $OUT/${ci}-ctest.log
@@ -160,8 +207,8 @@ function gg_run_test_scripts_debug {
set -e
(cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./examples/quantize && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./tools/gguf-split && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./tools/quantize && time bash tests.sh "$SRC/build-ci-debug/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
set +e
}
@@ -184,8 +231,8 @@ function gg_run_test_scripts_release {
set -e
(cd ./examples/gguf-split && time bash tests.sh "$SRC/build-ci-release/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./examples/quantize && time bash tests.sh "$SRC/build-ci-release/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./tools/gguf-split && time bash tests.sh "$SRC/build-ci-release/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
(cd ./tools/quantize && time bash tests.sh "$SRC/build-ci-release/bin" "$MNT/models") 2>&1 | tee -a $OUT/${ci}-scripts.log
set +e
}
@@ -260,7 +307,6 @@ function gg_sum_ctest_with_model_release {
}
# open_llama_7b_v2
# requires: GG_BUILD_CUDA
function gg_run_open_llama_7b_v2 {
cd ${SRC}
@@ -284,10 +330,10 @@ function gg_run_open_llama_7b_v2 {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} -DLLAMA_CUDA=1 .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../examples/convert-legacy-llama.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
python3 ../examples/convert_legacy_llama.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
@@ -303,47 +349,47 @@ function gg_run_open_llama_7b_v2 {
wiki_test="${path_wiki}/wiki.test.raw"
./bin/quantize ${model_f16} ${model_q8_0} q8_0
./bin/quantize ${model_f16} ${model_q4_0} q4_0
./bin/quantize ${model_f16} ${model_q4_1} q4_1
./bin/quantize ${model_f16} ${model_q5_0} q5_0
./bin/quantize ${model_f16} ${model_q5_1} q5_1
./bin/quantize ${model_f16} ${model_q2_k} q2_k
./bin/quantize ${model_f16} ${model_q3_k} q3_k
./bin/quantize ${model_f16} ${model_q4_k} q4_k
./bin/quantize ${model_f16} ${model_q5_k} q5_k
./bin/quantize ${model_f16} ${model_q6_k} q6_k
./bin/llama-quantize ${model_f16} ${model_q8_0} q8_0
./bin/llama-quantize ${model_f16} ${model_q4_0} q4_0
./bin/llama-quantize ${model_f16} ${model_q4_1} q4_1
./bin/llama-quantize ${model_f16} ${model_q5_0} q5_0
./bin/llama-quantize ${model_f16} ${model_q5_1} q5_1
./bin/llama-quantize ${model_f16} ${model_q2_k} q2_k
./bin/llama-quantize ${model_f16} ${model_q3_k} q3_k
./bin/llama-quantize ${model_f16} ${model_q4_k} q4_k
./bin/llama-quantize ${model_f16} ${model_q5_k} q5_k
./bin/llama-quantize ${model_f16} ${model_q6_k} q6_k
(time ./bin/main --model ${model_f16} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/main --model ${model_q8_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/main --model ${model_q4_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/main --model ${model_q4_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/main --model ${model_q5_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/main --model ${model_q5_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/main --model ${model_q2_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/main --model ${model_q3_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/main --model ${model_q4_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/main --model ${model_q5_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/main --model ${model_q6_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-cli -no-cnv --model ${model_f16} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/save-load-state -ngl 10 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -fa -ngl 10 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -ngl 99 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -fa -ngl 99 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
function check_ppl {
qnt="$1"
@@ -419,9 +465,9 @@ function gg_run_pythia_1_4b {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert-hf-to-gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
@@ -437,45 +483,45 @@ function gg_run_pythia_1_4b {
wiki_test_60="${path_wiki}/wiki.test-60.raw"
./bin/quantize ${model_f16} ${model_q8_0} q8_0
./bin/quantize ${model_f16} ${model_q4_0} q4_0
./bin/quantize ${model_f16} ${model_q4_1} q4_1
./bin/quantize ${model_f16} ${model_q5_0} q5_0
./bin/quantize ${model_f16} ${model_q5_1} q5_1
./bin/quantize ${model_f16} ${model_q2_k} q2_k
./bin/quantize ${model_f16} ${model_q3_k} q3_k
./bin/quantize ${model_f16} ${model_q4_k} q4_k
./bin/quantize ${model_f16} ${model_q5_k} q5_k
./bin/quantize ${model_f16} ${model_q6_k} q6_k
./bin/llama-quantize ${model_f16} ${model_q8_0} q8_0
./bin/llama-quantize ${model_f16} ${model_q4_0} q4_0
./bin/llama-quantize ${model_f16} ${model_q4_1} q4_1
./bin/llama-quantize ${model_f16} ${model_q5_0} q5_0
./bin/llama-quantize ${model_f16} ${model_q5_1} q5_1
./bin/llama-quantize ${model_f16} ${model_q2_k} q2_k
./bin/llama-quantize ${model_f16} ${model_q3_k} q3_k
./bin/llama-quantize ${model_f16} ${model_q4_k} q4_k
./bin/llama-quantize ${model_f16} ${model_q5_k} q5_k
./bin/llama-quantize ${model_f16} ${model_q6_k} q6_k
(time ./bin/main --model ${model_f16} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/main --model ${model_q8_0} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/main --model ${model_q4_0} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/main --model ${model_q4_1} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/main --model ${model_q5_0} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/main --model ${model_q5_1} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/main --model ${model_q2_k} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/main --model ${model_q3_k} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/main --model ${model_q4_k} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/main --model ${model_q5_k} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/main --model ${model_q6_k} -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-cli -no-cnv --model ${model_f16} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/perplexity --model ${model_f16} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/perplexity --model ${model_q8_0} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/perplexity --model ${model_q4_0} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/perplexity --model ${model_q4_1} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/perplexity --model ${model_q5_0} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/perplexity --model ${model_q5_1} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/perplexity --model ${model_q2_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/perplexity --model ${model_q3_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/perplexity --model ${model_q4_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/imatrix --model ${model_f16} -f ${wiki_test_60} -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/save-load-state --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -fa --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
function check_ppl {
qnt="$1"
@@ -529,7 +575,6 @@ function gg_sum_pythia_1_4b {
}
# pythia_2_8b
# requires: GG_BUILD_CUDA
function gg_run_pythia_2_8b {
cd ${SRC}
@@ -550,10 +595,10 @@ function gg_run_pythia_2_8b {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} -DLLAMA_CUDA=1 .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert-hf-to-gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
@@ -569,47 +614,47 @@ function gg_run_pythia_2_8b {
wiki_test="${path_wiki}/wiki.test.raw"
./bin/quantize ${model_f16} ${model_q8_0} q8_0
./bin/quantize ${model_f16} ${model_q4_0} q4_0
./bin/quantize ${model_f16} ${model_q4_1} q4_1
./bin/quantize ${model_f16} ${model_q5_0} q5_0
./bin/quantize ${model_f16} ${model_q5_1} q5_1
./bin/quantize ${model_f16} ${model_q2_k} q2_k
./bin/quantize ${model_f16} ${model_q3_k} q3_k
./bin/quantize ${model_f16} ${model_q4_k} q4_k
./bin/quantize ${model_f16} ${model_q5_k} q5_k
./bin/quantize ${model_f16} ${model_q6_k} q6_k
./bin/llama-quantize ${model_f16} ${model_q8_0} q8_0
./bin/llama-quantize ${model_f16} ${model_q4_0} q4_0
./bin/llama-quantize ${model_f16} ${model_q4_1} q4_1
./bin/llama-quantize ${model_f16} ${model_q5_0} q5_0
./bin/llama-quantize ${model_f16} ${model_q5_1} q5_1
./bin/llama-quantize ${model_f16} ${model_q2_k} q2_k
./bin/llama-quantize ${model_f16} ${model_q3_k} q3_k
./bin/llama-quantize ${model_f16} ${model_q4_k} q4_k
./bin/llama-quantize ${model_f16} ${model_q5_k} q5_k
./bin/llama-quantize ${model_f16} ${model_q6_k} q6_k
(time ./bin/main --model ${model_f16} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/main --model ${model_q8_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/main --model ${model_q4_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/main --model ${model_q4_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/main --model ${model_q5_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/main --model ${model_q5_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/main --model ${model_q2_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/main --model ${model_q3_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/main --model ${model_q4_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/main --model ${model_q5_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/main --model ${model_q6_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-cli -no-cnv --model ${model_f16} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/save-load-state -ngl 10 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -fa -ngl 10 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -ngl 99 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/save-load-state -fa -ngl 99 --model ${model_q4_0} ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
function check_ppl {
qnt="$1"
@@ -686,17 +731,17 @@ function gg_run_embd_bge_small {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert-hf-to-gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
./bin/quantize ${model_f16} ${model_q8_0} q8_0
./bin/llama-quantize ${model_f16} ${model_q8_0} q8_0
(time ./bin/embedding --model ${model_f16} -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/embedding --model ${model_q8_0} -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-embedding --model ${model_f16} -p "I believe the meaning of life is" -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-embedding --model ${model_q8_0} -p "I believe the meaning of life is" -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
set +e
}
@@ -710,17 +755,104 @@ function gg_sum_embd_bge_small {
gg_printf '- q8_0:\n```\n%s\n```\n' "$(cat $OUT/${ci}-tg-q8_0.log)"
}
# rerank_tiny
function gg_run_rerank_tiny {
cd ${SRC}
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/config.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/tokenizer.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/tokenizer_config.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/special_tokens_map.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/resolve/main/pytorch_model.bin
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/sentence_bert_config.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/vocab.txt
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/modules.json
gg_wget models-mnt/rerank-tiny/ https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/config.json
gg_wget models-mnt/rerank-tiny/1_Pooling https://huggingface.co/jinaai/jina-reranker-v1-tiny-en/raw/main/1_Pooling/config.json
path_models="../models-mnt/rerank-tiny"
rm -rf build-ci-release && mkdir build-ci-release && cd build-ci-release
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
# for this model, the SEP token is "</s>"
(time ./bin/llama-embedding --model ${model_f16} -p "what is panda?\thi\nwhat is panda?\tit's a bear\nwhat is panda?\tThe giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." -ngl 99 -c 0 --pooling rank --embd-normalize -1 --verbose-prompt) 2>&1 | tee -a $OUT/${ci}-rk-f16.log
# sample output
# rerank score 0: 0.029
# rerank score 1: 0.029
# rerank score 2: 0.135
# check that the score is in the range [$3, $4]
function check_score {
qnt="$1"
score=$(echo "$2" | grep -oE "[0-9]+\.[0-9]+" | tail -n 1)
if [ $(echo "$score < $3" | bc) -eq 1 ] || [ $(echo "$score > $4" | bc) -eq 1 ]; then
printf ' - %s @ %s (FAIL: score not in range [%s, %s])\n' "$qnt" "$score" "$3" "$4"
return 20
fi
printf ' - %s @ %s OK\n' "$qnt" "$score"
return 0
}
check_score "rerank score 0" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 0")" "0.00" "0.05" | tee -a $OUT/${ci}-rk-f16.log
check_score "rerank score 1" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 1")" "0.00" "0.05" | tee -a $OUT/${ci}-rk-f16.log
check_score "rerank score 2" "$(cat $OUT/${ci}-rk-f16.log | grep "rerank score 2")" "0.10" "0.30" | tee -a $OUT/${ci}-rk-f16.log
set +e
}
function gg_sum_rerank_tiny {
gg_printf '### %s\n\n' "${ci}"
gg_printf 'Rerank Tiny (Jina):\n'
gg_printf '- status: %s\n' "$(cat $OUT/${ci}.exit)"
gg_printf '- f16: \n```\n%s\n```\n' "$(cat $OUT/${ci}-rk-f16.log)"
}
function gg_check_build_requirements {
if ! command -v cmake &> /dev/null; then
gg_printf 'cmake not found, please install'
fi
if ! command -v make &> /dev/null; then
gg_printf 'make not found, please install'
fi
if ! command -v ctest &> /dev/null; then
gg_printf 'ctest not found, please install'
fi
}
## main
export LLAMA_LOG_PREFIX=1
export LLAMA_LOG_TIMESTAMPS=1
if [ -z ${GG_BUILD_LOW_PERF} ]; then
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models/models-mnt
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models
rm -rf ${SRC}/models-mnt
mnt_models=${MNT}/models
mkdir -p ${mnt_models}
ln -sfn ${mnt_models} ${SRC}/models-mnt
# Create a fresh python3 venv and enter it
python3 -m venv "$MNT/venv"
if ! python3 -m venv "$MNT/venv"; then
echo "Error: Failed to create Python virtual environment at $MNT/venv."
exit 1
fi
source "$MNT/venv/bin/activate"
pip install -r ${SRC}/requirements.txt --disable-pip-version-check
@@ -728,26 +860,33 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
fi
ret=0
test $ret -eq 0 && gg_run ctest_debug
if [ -z ${GG_BUILD_SYCL} ]; then
# SYCL build breaks with debug build flags
test $ret -eq 0 && gg_run ctest_debug
fi
test $ret -eq 0 && gg_run ctest_release
if [ -z ${GG_BUILD_LOW_PERF} ]; then
test $ret -eq 0 && gg_run embd_bge_small
test $ret -eq 0 && gg_run rerank_tiny
if [ -z ${GG_BUILD_CLOUD} ] || [ ${GG_BUILD_EXTRA_TESTS_0} ]; then
test $ret -eq 0 && gg_run test_scripts_debug
if [ -z ${GG_BUILD_SYCL} ]; then
test $ret -eq 0 && gg_run test_scripts_debug
fi
test $ret -eq 0 && gg_run test_scripts_release
fi
if [ -z ${GG_BUILD_VRAM_GB} ] || [ ${GG_BUILD_VRAM_GB} -ge 8 ]; then
if [ -z ${GG_BUILD_CUDA} ]; then
if [ -z ${GG_BUILD_CUDA} ] && [ -z ${GG_BUILD_VULKAN} ]; then
test $ret -eq 0 && gg_run pythia_1_4b
else
test $ret -eq 0 && gg_run pythia_2_8b
#test $ret -eq 0 && gg_run open_llama_7b_v2
fi
test $ret -eq 0 && gg_run ctest_with_model_debug
if [ -z ${GG_BUILD_SYCL} ]; then
test $ret -eq 0 && gg_run ctest_with_model_debug
fi
test $ret -eq 0 && gg_run ctest_with_model_release
fi
fi
+16
View File
@@ -0,0 +1,16 @@
set( CMAKE_SYSTEM_NAME Darwin )
set( CMAKE_SYSTEM_PROCESSOR arm64 )
set( target arm64-apple-darwin-macho )
set( CMAKE_C_COMPILER clang )
set( CMAKE_CXX_COMPILER clang++ )
set( CMAKE_C_COMPILER_TARGET ${target} )
set( CMAKE_CXX_COMPILER_TARGET ${target} )
set( arch_c_flags "-march=armv8.4-a -fvectorize -ffp-model=fast -fno-finite-math-only" )
set( warn_c_flags "-Wno-format -Wno-unused-variable -Wno-unused-function" )
set( CMAKE_C_FLAGS_INIT "${arch_c_flags} ${warn_c_flags}" )
set( CMAKE_CXX_FLAGS_INIT "${arch_c_flags} ${warn_c_flags}" )
-6
View File
@@ -1,6 +0,0 @@
set( CMAKE_SYSTEM_NAME Windows )
set( CMAKE_SYSTEM_PROCESSOR arm64 )
set( target arm64-pc-windows-msvc )
set( CMAKE_C_COMPILER_TARGET ${target} )
set( CMAKE_CXX_COMPILER_TARGET ${target} )
@@ -41,14 +41,20 @@ endif()
if(MSVC)
set(BUILD_COMPILER "${CMAKE_C_COMPILER_ID} ${CMAKE_C_COMPILER_VERSION}")
set(BUILD_TARGET ${CMAKE_VS_PLATFORM_NAME})
if (CMAKE_VS_PLATFORM_NAME)
set(BUILD_TARGET ${CMAKE_VS_PLATFORM_NAME})
else()
set(BUILD_TARGET "${CMAKE_SYSTEM_NAME} ${CMAKE_SYSTEM_PROCESSOR}")
endif()
else()
execute_process(
COMMAND sh -c "$@ --version | head -1" _ ${CMAKE_C_COMPILER}
COMMAND ${CMAKE_C_COMPILER} --version
OUTPUT_VARIABLE OUT
OUTPUT_STRIP_TRAILING_WHITESPACE
)
string(REGEX REPLACE " *\n.*" "" OUT "${OUT}")
set(BUILD_COMPILER ${OUT})
execute_process(
COMMAND ${CMAKE_C_COMPILER} -dumpmachine
OUTPUT_VARIABLE OUT
+35
View File
@@ -0,0 +1,35 @@
include("ggml/cmake/common.cmake")
function(llama_add_compile_flags)
if (LLAMA_FATAL_WARNINGS)
if (CMAKE_CXX_COMPILER_ID MATCHES "GNU" OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
list(APPEND C_FLAGS -Werror)
list(APPEND CXX_FLAGS -Werror)
elseif (CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
add_compile_options(/WX)
endif()
endif()
if (LLAMA_ALL_WARNINGS)
if (NOT MSVC)
list(APPEND C_FLAGS -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes
-Werror=implicit-int -Werror=implicit-function-declaration)
list(APPEND CXX_FLAGS -Wmissing-declarations -Wmissing-noreturn)
list(APPEND WARNING_FLAGS -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function)
list(APPEND C_FLAGS ${WARNING_FLAGS})
list(APPEND CXX_FLAGS ${WARNING_FLAGS})
ggml_get_flags(${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION})
add_compile_options("$<$<COMPILE_LANGUAGE:C>:${C_FLAGS};${GF_C_FLAGS}>"
"$<$<COMPILE_LANGUAGE:CXX>:${CXX_FLAGS};${GF_CXX_FLAGS}>")
else()
# todo : msvc
set(C_FLAGS "" PARENT_SCOPE)
set(CXX_FLAGS "" PARENT_SCOPE)
endif()
endif()
endfunction()
+22
View File
@@ -0,0 +1,22 @@
find_package(Git)
# the commit's SHA1
execute_process(COMMAND
"${GIT_EXECUTABLE}" describe --match=NeVeRmAtCh --always --abbrev=8
WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
OUTPUT_VARIABLE GIT_SHA1
ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
# the date of the commit
execute_process(COMMAND
"${GIT_EXECUTABLE}" log -1 --format=%ad --date=local
WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
OUTPUT_VARIABLE GIT_DATE
ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
# the subject of the commit
execute_process(COMMAND
"${GIT_EXECUTABLE}" log -1 --format=%s
WORKING_DIRECTORY "${CMAKE_SOURCE_DIR}"
OUTPUT_VARIABLE GIT_COMMIT_SUBJECT
ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)
+30
View File
@@ -0,0 +1,30 @@
set(LLAMA_VERSION @LLAMA_INSTALL_VERSION@)
set(LLAMA_BUILD_COMMIT @LLAMA_BUILD_COMMIT@)
set(LLAMA_BUILD_NUMBER @LLAMA_BUILD_NUMBER@)
set(LLAMA_SHARED_LIB @BUILD_SHARED_LIBS@)
@PACKAGE_INIT@
set_and_check(LLAMA_INCLUDE_DIR "@PACKAGE_LLAMA_INCLUDE_INSTALL_DIR@")
set_and_check(LLAMA_LIB_DIR "@PACKAGE_LLAMA_LIB_INSTALL_DIR@")
set_and_check(LLAMA_BIN_DIR "@PACKAGE_LLAMA_BIN_INSTALL_DIR@")
find_package(ggml REQUIRED HINTS ${LLAMA_LIB_DIR}/cmake)
find_library(llama_LIBRARY llama
REQUIRED
HINTS ${LLAMA_LIB_DIR}
NO_CMAKE_FIND_ROOT_PATH
)
add_library(llama UNKNOWN IMPORTED)
set_target_properties(llama
PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${LLAMA_INCLUDE_DIR}"
INTERFACE_LINK_LIBRARIES "ggml::ggml;ggml::ggml-base;"
IMPORTED_LINK_INTERFACE_LANGUAGES "CXX"
IMPORTED_LOCATION "${llama_LIBRARY}"
INTERFACE_COMPILE_FEATURES c_std_90
POSITION_INDEPENDENT_CODE ON)
check_required_components(Llama)
+5 -5
View File
@@ -1,10 +1,10 @@
prefix=@CMAKE_INSTALL_PREFIX@
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
exec_prefix=@CMAKE_INSTALL_PREFIX@
libdir=@CMAKE_INSTALL_FULL_LIBDIR@
includedir=@CMAKE_INSTALL_FULL_INCLUDEDIR@
Name: llama
Description: Port of Facebook's LLaMA model in C/C++
Version: @PROJECT_VERSION@
Libs: -L${libdir} -lllama
Version: @LLAMA_INSTALL_VERSION@
Libs: -L${libdir} -lggml -lggml-base -lllama
Cflags: -I${includedir}
+5
View File
@@ -0,0 +1,5 @@
set( CMAKE_SYSTEM_NAME Windows )
set( CMAKE_SYSTEM_PROCESSOR x86_64 )
set( CMAKE_C_COMPILER clang )
set( CMAKE_CXX_COMPILER clang++ )
-14
View File
@@ -1,14 +0,0 @@
comment: off
coverage:
status:
project:
default:
target: auto
threshold: 0
base: auto
patch:
default:
target: auto
threshold: 0
base: auto
+110 -35
View File
@@ -1,11 +1,14 @@
# common
find_package(Threads REQUIRED)
llama_add_compile_flags()
# Build info header
#
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/../.git")
set(GIT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../.git")
if(EXISTS "${PROJECT_SOURCE_DIR}/.git")
set(GIT_DIR "${PROJECT_SOURCE_DIR}/.git")
# Is git submodule
if(NOT IS_DIRECTORY "${GIT_DIR}")
@@ -15,34 +18,26 @@ if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/../.git")
if (SLASH_POS EQUAL 0)
set(GIT_DIR "${REAL_GIT_DIR}")
else()
set(GIT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../${REAL_GIT_DIR}")
set(GIT_DIR "${PROJECT_SOURCE_DIR}/${REAL_GIT_DIR}")
endif()
endif()
if(EXISTS "${GIT_DIR}/index")
set(GIT_INDEX "${GIT_DIR}/index")
# For build-info.cpp below
set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS "${GIT_DIR}/index")
else()
message(WARNING "Git index not found in git repository.")
set(GIT_INDEX "")
endif()
else()
message(WARNING "Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.")
set(GIT_INDEX "")
endif()
# Add a custom command to rebuild build-info.cpp when .git/index changes
add_custom_command(
OUTPUT "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp"
COMMENT "Generating build details from Git"
COMMAND ${CMAKE_COMMAND} -DMSVC=${MSVC} -DCMAKE_C_COMPILER_VERSION=${CMAKE_C_COMPILER_VERSION}
-DCMAKE_C_COMPILER_ID=${CMAKE_C_COMPILER_ID} -DCMAKE_VS_PLATFORM_NAME=${CMAKE_VS_PLATFORM_NAME}
-DCMAKE_C_COMPILER=${CMAKE_C_COMPILER} -P "${CMAKE_CURRENT_SOURCE_DIR}/../scripts/gen-build-info-cpp.cmake"
WORKING_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/.."
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in" ${GIT_INDEX}
VERBATIM
)
set(TEMPLATE_FILE "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in")
set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/build-info.cpp")
configure_file(${TEMPLATE_FILE} ${OUTPUT_FILE})
set(TARGET build_info)
add_library(${TARGET} OBJECT build-info.cpp)
add_library(${TARGET} OBJECT ${OUTPUT_FILE})
if (BUILD_SHARED_LIBS)
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
endif()
@@ -50,21 +45,31 @@ endif()
set(TARGET common)
add_library(${TARGET} STATIC
arg.cpp
arg.h
base64.hpp
common.h
chat-parser.cpp
chat-parser.h
chat.cpp
chat.h
common.cpp
sampling.h
sampling.cpp
console.h
common.h
console.cpp
grammar-parser.h
grammar-parser.cpp
json.hpp
console.h
json-partial.cpp
json-partial.h
json-schema-to-grammar.cpp
train.h
train.cpp
ngram-cache.h
llguidance.cpp
log.cpp
log.h
ngram-cache.cpp
ngram-cache.h
regex-partial.cpp
regex-partial.h
sampling.cpp
sampling.h
speculative.cpp
speculative.h
)
if (BUILD_SHARED_LIBS)
@@ -75,13 +80,83 @@ set(LLAMA_COMMON_EXTRA_LIBS build_info)
# Use curl to download model url
if (LLAMA_CURL)
find_package(CURL REQUIRED)
add_definitions(-DLLAMA_USE_CURL)
find_package(CURL)
if (NOT CURL_FOUND)
message(FATAL_ERROR "Could NOT find CURL. Hint: to disable this feature, set -DLLAMA_CURL=OFF")
endif()
target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_CURL)
include_directories(${CURL_INCLUDE_DIRS})
find_library(CURL_LIBRARY curl REQUIRED)
set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} ${CURL_LIBRARY})
set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} ${CURL_LIBRARIES})
endif ()
target_include_directories(${TARGET} PUBLIC .)
target_compile_features(${TARGET} PUBLIC cxx_std_11)
target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama)
if (LLAMA_LLGUIDANCE)
include(ExternalProject)
set(LLGUIDANCE_SRC ${CMAKE_BINARY_DIR}/llguidance/source)
set(LLGUIDANCE_PATH ${LLGUIDANCE_SRC}/target/release)
# Set the correct library file extension based on platform
if (WIN32)
set(LLGUIDANCE_LIB_NAME "llguidance.lib")
# Add Windows-specific libraries
set(LLGUIDANCE_PLATFORM_LIBS
ws2_32 # Windows Sockets API
userenv # For GetUserProfileDirectoryW
ntdll # For NT functions
bcrypt # For BCryptGenRandom
)
else()
set(LLGUIDANCE_LIB_NAME "libllguidance.a")
set(LLGUIDANCE_PLATFORM_LIBS "")
endif()
ExternalProject_Add(llguidance_ext
GIT_REPOSITORY https://github.com/guidance-ai/llguidance
# v1.0.1:
GIT_TAG d795912fedc7d393de740177ea9ea761e7905774
PREFIX ${CMAKE_BINARY_DIR}/llguidance
SOURCE_DIR ${LLGUIDANCE_SRC}
BUILD_IN_SOURCE TRUE
CONFIGURE_COMMAND ""
BUILD_COMMAND cargo build --release --package llguidance
INSTALL_COMMAND ""
BUILD_BYPRODUCTS ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME} ${LLGUIDANCE_PATH}/llguidance.h
UPDATE_COMMAND ""
)
target_compile_definitions(${TARGET} PUBLIC LLAMA_USE_LLGUIDANCE)
add_library(llguidance STATIC IMPORTED)
set_target_properties(llguidance PROPERTIES IMPORTED_LOCATION ${LLGUIDANCE_PATH}/${LLGUIDANCE_LIB_NAME})
add_dependencies(llguidance llguidance_ext)
target_include_directories(${TARGET} PRIVATE ${LLGUIDANCE_PATH})
# Add platform libraries to the main target
set(LLAMA_COMMON_EXTRA_LIBS ${LLAMA_COMMON_EXTRA_LIBS} llguidance ${LLGUIDANCE_PLATFORM_LIBS})
endif ()
target_include_directories(${TARGET} PUBLIC . ../vendor)
target_compile_features (${TARGET} PUBLIC cxx_std_17)
target_link_libraries (${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama Threads::Threads)
#
# copy the license files
#
# Check if running in GitHub Actions
if (DEFINED ENV{GITHUB_ACTIONS} AND "$ENV{GITHUB_ACTIONS}" STREQUAL "true")
message(STATUS "Running inside GitHub Actions - copying license files")
# Copy all files from licenses/ to build/bin/
file(GLOB LICENSE_FILES "${CMAKE_SOURCE_DIR}/licenses/*")
foreach(LICENSE_FILE ${LICENSE_FILES})
get_filename_component(FILENAME ${LICENSE_FILE} NAME)
add_custom_command(
POST_BUILD
TARGET ${TARGET}
COMMAND ${CMAKE_COMMAND} -E copy_if_different
"${LICENSE_FILE}"
"$<TARGET_FILE_DIR:llama>/${FILENAME}"
COMMENT "Copying ${FILENAME} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}")
message(STATUS "Copying ${LICENSE_FILE} to ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/${FILENAME}")
endforeach()
endif()
+3471
View File
File diff suppressed because it is too large Load Diff

Some files were not shown because too many files have changed in this diff Show More