CLAMP f16 returns inf on HIP (RDNA3.5 / gfx1151) #1

Open
opened 2026-05-14 00:47:24 +02:00 by shahondin1624 · 0 comments
Owner

Summary

test-backend-ops -o CLAMP produces ERR = inf for all type=f16 cases on HIP with gfx1151. F32 cases pass.

Repro

cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DLLAMA_BUILD_TESTS=ON
cmake --build build-hip -j 32
GGML_CUDA_DISABLE_GRAPHS=1 ./build-hip/bin/test-backend-ops -o CLAMP

Observed

[CLAMP] ERR = inf > 0.000000100   CLAMP(type=f16,ne=[10,5,4,3],min=-0.500000,max=0.500000): FAIL
[CLAMP] ERR = inf > 0.000000100   CLAMP(type=f16,ne=[7,1,5,3],min=-0.500000,max=0.500000): FAIL
[CLAMP] ERR = inf > 0.000000100   CLAMP(type=f16,ne=[1024,1024,1,1],min=-0.500000,max=0.500000): FAIL
  CLAMP(type=f32,...): OK (all 3 cases)

ERR is inf rather than a small tolerance miss — the kernel produces actual infinities, not a precision drift.

Source

ggml/src/ggml-cuda/clamp.cu — templated kernel op_clamp_kernel<T> instantiated for half. Code is identical to upstream master; this is not introduced by the TurboQuant merge (commit 7161dee3f).

Scope of practical impact

In llama.cpp's model graphs, ggml_clamp is called on the output of build_lora_mm(...), which is F32 even when weights are F16/quantized. So inference graphs do not normally hit the F16 kernel path. The bug affects:

  • Direct calls to ggml_clamp on F16 tensors (none observed in the current model code in src/models/).
  • Anyone writing a graph where the F16 clamp kernel is actually invoked.

It does NOT currently affect MPT, OLMo, DBRX, MoE routers, or Qwen inference, because all those clamp call sites operate on F32 tensors.

Hardware / env

  • GPU: Radeon 8060S (gfx1151, RDNA3.5)
  • ROCm: 7.x (hipcc Clang 19.0.0)
  • Branch: master @ 7161dee3f

Workaround

None needed for current inference workloads — the F16 path isn't exercised by existing models. If a future model adds a direct F16 clamp, route through F32 explicitly via ggml_cast before clamping.

## Summary `test-backend-ops -o CLAMP` produces `ERR = inf` for all `type=f16` cases on HIP with gfx1151. F32 cases pass. ## Repro ``` cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DLLAMA_BUILD_TESTS=ON cmake --build build-hip -j 32 GGML_CUDA_DISABLE_GRAPHS=1 ./build-hip/bin/test-backend-ops -o CLAMP ``` ## Observed ``` [CLAMP] ERR = inf > 0.000000100 CLAMP(type=f16,ne=[10,5,4,3],min=-0.500000,max=0.500000): FAIL [CLAMP] ERR = inf > 0.000000100 CLAMP(type=f16,ne=[7,1,5,3],min=-0.500000,max=0.500000): FAIL [CLAMP] ERR = inf > 0.000000100 CLAMP(type=f16,ne=[1024,1024,1,1],min=-0.500000,max=0.500000): FAIL CLAMP(type=f32,...): OK (all 3 cases) ``` ERR is `inf` rather than a small tolerance miss — the kernel produces actual infinities, not a precision drift. ## Source `ggml/src/ggml-cuda/clamp.cu` — templated kernel `op_clamp_kernel<T>` instantiated for `half`. Code is identical to upstream master; this is not introduced by the TurboQuant merge (commit `7161dee3f`). ## Scope of practical impact In llama.cpp's model graphs, `ggml_clamp` is called on the output of `build_lora_mm(...)`, which is F32 even when weights are F16/quantized. So inference graphs do not normally hit the F16 kernel path. The bug affects: - Direct calls to `ggml_clamp` on F16 tensors (none observed in the current model code in `src/models/`). - Anyone writing a graph where the F16 clamp kernel is actually invoked. It does NOT currently affect MPT, OLMo, DBRX, MoE routers, or Qwen inference, because all those clamp call sites operate on F32 tensors. ## Hardware / env - GPU: Radeon 8060S (gfx1151, RDNA3.5) - ROCm: 7.x (hipcc Clang 19.0.0) - Branch: `master` @ `7161dee3f` ## Workaround None needed for current inference workloads — the F16 path isn't exercised by existing models. If a future model adds a direct F16 clamp, route through F32 explicitly via `ggml_cast` before clamping.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shahondin1624/llama.cpp#1