Flash attention bf16 hipStreamEndCapture failure during CUDA-graph capture (RDNA3.5 / gfx1151) #2

Open
opened 2026-05-14 00:48:19 +02:00 by shahondin1624 · 0 comments
Owner

Summary

test-backend-ops aborts on FLASH_ATTN_EXT with type_K=bf16, type_V=bf16 when CUDA-graph capture is enabled on HIP gfx1151. Disabling graphs (GGML_CUDA_DISABLE_GRAPHS=1) bypasses the abort and the remaining FA tests pass.

Repro

cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DLLAMA_BUILD_TESTS=ON
cmake --build build-hip -j 32
./build-hip/bin/test-backend-ops -o FLASH_ATTN_EXT      # aborts
GGML_CUDA_DISABLE_GRAPHS=1 ./build-hip/bin/test-backend-ops -o FLASH_ATTN_EXT   # passes

Observed

FLASH_ATTN_EXT(hsk=64,hsv=64,nh=4,nr23=[4,3],kv=512,nb=1,mask=1,sinks=0,...,type_K=bf16,type_V=bf16,permute=[0,1,2,3]):
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
ROCm error: operation failed due to a previous error during capture
  current device: 0, in function ggml_cuda_graph_evaluate_and_capture at ggml/src/ggml-cuda/ggml-cuda.cu:4474
  hipStreamEndCapture(cuda_ctx->stream(), &graph->graph)
ggml/src/ggml-cuda/ggml-cuda.cu:104: ROCm error

Diagnosis

The warmup completes (twice, indicating two captures attempted), then hipStreamEndCapture returns an error referring to a 'previous error during capture'. This means something inside the captured graph emitted a ROCm error that only surfaces at EndCapture. Likely candidates: a bf16-specific op invocation that the HIP runtime can't represent in a captured graph on RDNA3.5.

Note that without graphs, the same FA bf16 kernels execute successfully — turbo3 FA cases (and Q4_0, Q8_0, F16 cases) all print OK in the non-graph run.

Source

FA backend code is identical to upstream master; this is not introduced by the TurboQuant merge (commit 7161dee3f). The TurboQuant turbo3 FA cases pass on this hardware.

Scope of practical impact

  • Affects any model whose KV cache type is bf16 when running on HIP/gfx1151 with CUDA graphs enabled.
  • bf16-native model families on the radar: Gemma-2, Gemma-3, some Phi-3 variants, certain Llama-3 distillations published in bf16.
  • F16 / Q4_K_M / Q5_K_M / Q8_0 quantized GGUFs are unaffected.

Workaround

export GGML_CUDA_DISABLE_GRAPHS=1

Performance penalty is minor (single-digit %); inference output is correct.

Hardware / env

  • GPU: Radeon 8060S (gfx1151, RDNA3.5)
  • ROCm: 7.x (hipcc Clang 19.0.0)
  • Branch: master @ 7161dee3f
## Summary `test-backend-ops` aborts on `FLASH_ATTN_EXT` with `type_K=bf16, type_V=bf16` when CUDA-graph capture is enabled on HIP gfx1151. Disabling graphs (`GGML_CUDA_DISABLE_GRAPHS=1`) bypasses the abort and the remaining FA tests pass. ## Repro ``` cmake -B build-hip -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DLLAMA_BUILD_TESTS=ON cmake --build build-hip -j 32 ./build-hip/bin/test-backend-ops -o FLASH_ATTN_EXT # aborts GGML_CUDA_DISABLE_GRAPHS=1 ./build-hip/bin/test-backend-ops -o FLASH_ATTN_EXT # passes ``` ## Observed ``` FLASH_ATTN_EXT(hsk=64,hsv=64,nh=4,nr23=[4,3],kv=512,nb=1,mask=1,sinks=0,...,type_K=bf16,type_V=bf16,permute=[0,1,2,3]): ggml_backend_cuda_graph_compute: CUDA graph warmup complete ggml_backend_cuda_graph_compute: CUDA graph warmup complete ROCm error: operation failed due to a previous error during capture current device: 0, in function ggml_cuda_graph_evaluate_and_capture at ggml/src/ggml-cuda/ggml-cuda.cu:4474 hipStreamEndCapture(cuda_ctx->stream(), &graph->graph) ggml/src/ggml-cuda/ggml-cuda.cu:104: ROCm error ``` ## Diagnosis The warmup completes (twice, indicating two captures attempted), then `hipStreamEndCapture` returns an error referring to a 'previous error during capture'. This means something inside the captured graph emitted a ROCm error that only surfaces at `EndCapture`. Likely candidates: a bf16-specific op invocation that the HIP runtime can't represent in a captured graph on RDNA3.5. Note that without graphs, the same FA bf16 kernels execute successfully — turbo3 FA cases (and Q4_0, Q8_0, F16 cases) all print OK in the non-graph run. ## Source FA backend code is identical to upstream master; this is not introduced by the TurboQuant merge (commit `7161dee3f`). The TurboQuant turbo3 FA cases pass on this hardware. ## Scope of practical impact - Affects any model whose KV cache type is bf16 when running on HIP/gfx1151 with CUDA graphs enabled. - bf16-native model families on the radar: Gemma-2, Gemma-3, some Phi-3 variants, certain Llama-3 distillations published in bf16. - F16 / Q4_K_M / Q5_K_M / Q8_0 quantized GGUFs are unaffected. ## Workaround ``` export GGML_CUDA_DISABLE_GRAPHS=1 ``` Performance penalty is minor (single-digit %); inference output is correct. ## Hardware / env - GPU: Radeon 8060S (gfx1151, RDNA3.5) - ROCm: 7.x (hipcc Clang 19.0.0) - Branch: `master` @ `7161dee3f`
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shahondin1624/llama.cpp#2