Files
llm-multiverse/docs/hardware-baseline.md
shahondin1624 0616dd3d58 docs: add hardware baseline and Ollama configuration (issue #6)
Consolidates all Phase 0 benchmark results: GPU detection, model
performance metrics, VRAM budget, and known issues for RX 9070 XT
with RDNA4 native ROCm support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 21:32:14 +01:00

2.9 KiB

Hardware Baseline and Ollama Configuration

Hardware

Component Specification
CPU AMD Ryzen 7 2700x
RAM 64 GB DDR4
GPU AMD Radeon RX 9070 XT (16 GB VRAM, RDNA4)
GPU Compute gfx1201 (native ROCm support)
GPU Driver 60342.13
OS Fedora 42
Kernel 6.18.6-100.fc42.x86_64

Ollama Configuration

Setting Value
Install method curl -fsSL https://ollama.com/install.sh | sh
ROCm Bundled with Ollama (no separate ROCm install)
HSA_OVERRIDE_GFX_VERSION Not needed — gfx1201 detected natively
Host http://127.0.0.1:11434
Model storage /usr/share/ollama/.ollama/models
Service systemd unit ollama.service

Model Benchmarks

All benchmarks run with browser closed to avoid VRAM contention.

Inference Models

Model Role VRAM Processor Prompt Eval Gen Rate Context
qwen2.5:7b-instruct Orchestrator / tool selection 4.9 GB 100% GPU 611 tok/s 99 tok/s 4096
qwen2.5-coder:14b-instruct Coder subagent 9.7 GB 100% GPU 520 tok/s 53 tok/s 4096
qwen2.5:14b-instruct Researcher / summarization 9.7 GB 100% GPU 625 tok/s 54 tok/s 4096

Embedding Model

Model Role VRAM Dimensions Cold Load Warm Inference
nomic-embed-text Embeddings ~0.3 GB 768 ~4.6s ~30ms (7 tokens)

Model Selection Rationale

The architecture document (planning-agent-prompt.md) assigns models as follows:

Role Model Justification
Orchestrator / tool selection qwen2.5:7b-instruct Fast (99 tok/s), small (4.9 GB), leaves VRAM for embedding model
Coder subagent qwen2.5-coder:14b-instruct Code-specialized, fits in 16 GB VRAM
Researcher / summarization qwen2.5:14b-instruct General-purpose 14B, same VRAM as coder variant
Embeddings nomic-embed-text Minimal VRAM (~0.3 GB), 768-dim vectors, fast inference

Key constraint: The two 14B models (coder and researcher) cannot run simultaneously — they each consume 9.7 GB VRAM. Ollama handles model swapping automatically between tasks. The 7B orchestrator model can potentially stay resident on the secondary (CPU) machine to keep GPU VRAM free for subagent models.

VRAM Budget

Scenario VRAM Used VRAM Free
7B + nomic-embed-text ~5.2 GB ~10.7 GB
14B alone 9.7 GB 6.2 GB
14B + nomic-embed-text ~10.0 GB ~5.9 GB

Known Issues

  1. Browser VRAM contention: Brave browser crashed when a 14B model was loaded, likely due to VRAM exhaustion. Close GPU-heavy applications before running large models.
  2. Model swap latency: Cold-loading a 14B model takes ~11.4s. This is acceptable per the architecture doc ("Model inference latency is acceptable").
  3. Ollama model store location: Default path is /usr/share/ollama/.ollama/models — ensure sufficient disk space (~25 GB for all four models).