llama.cpp/bench-smem-m5-baseline.txt

=== SMEM M5 Benchmark: baseline ===
Model: Qwen3.5-35B-A3B-Q8_0.gguf
Date: Sat Mar 28 21:45:40 CDT 2026

--- turbo3 @ short ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x105cffcb0 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x105cfeb30 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 6.440 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |           tg128 |         78.47 ± 0.56 |

build: 13afec1 (178)

--- turbo3 @ 8192 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1040cfae0 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1040ce960 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |          pp8192 |      2144.16 ± 30.18 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |           tg128 |         78.90 ± 0.24 |

build: 13afec1 (178)

--- turbo3 @ 16384 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x10500fc00 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x10500ea80 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |         pp16384 |      1704.41 ± 21.63 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |           tg128 |         78.64 ± 0.44 |

build: 13afec1 (178)

--- turbo3 @ 32768 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x101c8fb00 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x101c8e980 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.013 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |         pp32768 |       1238.85 ± 6.06 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo3 | turbo3 |  1 |           tg128 |         78.17 ± 0.69 |

build: 13afec1 (178)

--- turbo4 @ short ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103c17f70 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103c16df0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |           tg128 |         80.40 ± 0.72 |

build: 13afec1 (178)

--- turbo4 @ 8192 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103e57d30 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103e56bb0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |          pp8192 |      2048.90 ± 43.42 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |           tg128 |         79.84 ± 0.95 |

build: 13afec1 (178)

--- turbo4 @ 16384 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1060bf740 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1060be5c0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.009 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |         pp16384 |      1605.18 ± 20.70 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |           tg128 |         79.45 ± 1.55 |

build: 13afec1 (178)

--- turbo4 @ 32768 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1040ef870 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1040ee6f0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |         pp32768 |       1157.30 ± 8.01 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 | turbo4 | turbo4 |  1 |           tg128 |         80.64 ± 0.72 |

build: 13afec1 (178)

--- q8_0 @ short ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1055e78c0 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1055e6740 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |           tg128 |         85.48 ± 1.34 |

build: 13afec1 (178)

--- q8_0 @ 8192 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x105ac8540 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x105ac73c0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.010 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |          pp8192 |      2106.47 ± 64.66 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |           tg128 |         76.72 ± 2.13 |

build: 13afec1 (178)

--- q8_0 @ 16384 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103fefa70 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x103fee8f0 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |         pp16384 |      1723.71 ± 28.56 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |           tg128 |         78.09 ± 3.70 |

build: 13afec1 (178)

--- q8_0 @ 32768 ---
ggml_metal_device_init: testing tensor API for f16 support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1035f7b10 | th_max = 1024 | th_width =   32
ggml_metal_device_init: testing tensor API for bfloat support
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'dummy_kernel', name = 'dummy_kernel'
ggml_metal_library_compile_pipeline: loaded dummy_kernel                                  0x1035f6990 | th_max = 1024 | th_width =   32
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: turbo3 sparse V dequant enabled
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple10  (1010)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 115448.73 MB
| model                          |       size |     params | backend    | threads | type_k | type_v | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |         pp32768 |      1216.99 ± 28.64 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | MTL,BLAS   |       1 |   q8_0 |   q8_0 |  1 |           tg128 |         86.83 ± 0.34 |

build: 13afec1 (178)

=== Done: baseline ===