Files

shahondin1624 0616dd3d58 docs: add hardware baseline and Ollama configuration (issue #6 )

Consolidates all Phase 0 benchmark results: GPU detection, model
performance metrics, VRAM budget, and known issues for RX 9070 XT
with RDNA4 native ROCm support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-08 21:32:14 +01:00

2.9 KiB

Raw Blame History

Hardware Baseline and Ollama Configuration

Hardware

Component	Specification
CPU	AMD Ryzen 7 2700x
RAM	64 GB DDR4
GPU	AMD Radeon RX 9070 XT (16 GB VRAM, RDNA4)
GPU Compute	gfx1201 (native ROCm support)
GPU Driver	60342.13
OS	Fedora 42
Kernel	6.18.6-100.fc42.x86_64

Ollama Configuration

Setting	Value
Install method	`curl -fsSL https://ollama.com/install.sh \| sh`
ROCm	Bundled with Ollama (no separate ROCm install)
HSA_OVERRIDE_GFX_VERSION	Not needed — gfx1201 detected natively
Host	`http://127.0.0.1:11434`
Model storage	`/usr/share/ollama/.ollama/models`
Service	`systemd` unit `ollama.service`

Model Benchmarks

All benchmarks run with browser closed to avoid VRAM contention.

Inference Models

Model	Role	VRAM	Processor	Prompt Eval	Gen Rate	Context
qwen2.5:7b-instruct	Orchestrator / tool selection	4.9 GB	100% GPU	611 tok/s	99 tok/s	4096
qwen2.5-coder:14b-instruct	Coder subagent	9.7 GB	100% GPU	520 tok/s	53 tok/s	4096
qwen2.5:14b-instruct	Researcher / summarization	9.7 GB	100% GPU	625 tok/s	54 tok/s	4096

Embedding Model

Model	Role	VRAM	Dimensions	Cold Load	Warm Inference
nomic-embed-text	Embeddings	~0.3 GB	768	~4.6s	~30ms (7 tokens)

Model Selection Rationale

The architecture document (planning-agent-prompt.md) assigns models as follows:

Role	Model	Justification
Orchestrator / tool selection	qwen2.5:7b-instruct	Fast (99 tok/s), small (4.9 GB), leaves VRAM for embedding model
Coder subagent	qwen2.5-coder:14b-instruct	Code-specialized, fits in 16 GB VRAM
Researcher / summarization	qwen2.5:14b-instruct	General-purpose 14B, same VRAM as coder variant
Embeddings	nomic-embed-text	Minimal VRAM (~0.3 GB), 768-dim vectors, fast inference

Key constraint: The two 14B models (coder and researcher) cannot run simultaneously — they each consume 9.7 GB VRAM. Ollama handles model swapping automatically between tasks. The 7B orchestrator model can potentially stay resident on the secondary (CPU) machine to keep GPU VRAM free for subagent models.

VRAM Budget

Scenario	VRAM Used	VRAM Free
7B + nomic-embed-text	~5.2 GB	~10.7 GB
14B alone	9.7 GB	6.2 GB
14B + nomic-embed-text	~10.0 GB	~5.9 GB

Known Issues

Browser VRAM contention: Brave browser crashed when a 14B model was loaded, likely due to VRAM exhaustion. Close GPU-heavy applications before running large models.
Model swap latency: Cold-loading a 14B model takes ~11.4s. This is acceptable per the architecture doc ("Model inference latency is acceptable").
Ollama model store location: Default path is /usr/share/ollama/.ollama/models — ensure sufficient disk space (~25 GB for all four models).

2.9 KiB Raw Blame History