Hardware Baseline and Ollama Configuration
Hardware
| Component |
Specification |
| CPU |
AMD Ryzen 7 2700x |
| RAM |
64 GB DDR4 |
| GPU |
AMD Radeon RX 9070 XT (16 GB VRAM, RDNA4) |
| GPU Compute |
gfx1201 (native ROCm support) |
| GPU Driver |
60342.13 |
| OS |
Fedora 42 |
| Kernel |
6.18.6-100.fc42.x86_64 |
Ollama Configuration
| Setting |
Value |
| Install method |
curl -fsSL https://ollama.com/install.sh | sh |
| ROCm |
Bundled with Ollama (no separate ROCm install) |
| HSA_OVERRIDE_GFX_VERSION |
Not needed — gfx1201 detected natively |
| Host |
http://127.0.0.1:11434 |
| Model storage |
/usr/share/ollama/.ollama/models |
| Service |
systemd unit ollama.service |
Model Benchmarks
All benchmarks run with browser closed to avoid VRAM contention.
Inference Models
| Model |
Role |
VRAM |
Processor |
Prompt Eval |
Gen Rate |
Context |
| qwen2.5:7b-instruct |
Orchestrator / tool selection |
4.9 GB |
100% GPU |
611 tok/s |
99 tok/s |
4096 |
| qwen2.5-coder:14b-instruct |
Coder subagent |
9.7 GB |
100% GPU |
520 tok/s |
53 tok/s |
4096 |
| qwen2.5:14b-instruct |
Researcher / summarization |
9.7 GB |
100% GPU |
625 tok/s |
54 tok/s |
4096 |
Embedding Model
| Model |
Role |
VRAM |
Dimensions |
Cold Load |
Warm Inference |
| nomic-embed-text |
Embeddings |
~0.3 GB |
768 |
~4.6s |
~30ms (7 tokens) |
Model Selection Rationale
The architecture document (planning-agent-prompt.md) assigns models as follows:
| Role |
Model |
Justification |
| Orchestrator / tool selection |
qwen2.5:7b-instruct |
Fast (99 tok/s), small (4.9 GB), leaves VRAM for embedding model |
| Coder subagent |
qwen2.5-coder:14b-instruct |
Code-specialized, fits in 16 GB VRAM |
| Researcher / summarization |
qwen2.5:14b-instruct |
General-purpose 14B, same VRAM as coder variant |
| Embeddings |
nomic-embed-text |
Minimal VRAM (~0.3 GB), 768-dim vectors, fast inference |
Key constraint: The two 14B models (coder and researcher) cannot run simultaneously — they each consume 9.7 GB VRAM. Ollama handles model swapping automatically between tasks. The 7B orchestrator model can potentially stay resident on the secondary (CPU) machine to keep GPU VRAM free for subagent models.
VRAM Budget
| Scenario |
VRAM Used |
VRAM Free |
| 7B + nomic-embed-text |
~5.2 GB |
~10.7 GB |
| 14B alone |
9.7 GB |
6.2 GB |
| 14B + nomic-embed-text |
~10.0 GB |
~5.9 GB |
Known Issues
- Browser VRAM contention: Brave browser crashed when a 14B model was loaded, likely due to VRAM exhaustion. Close GPU-heavy applications before running large models.
- Model swap latency: Cold-loading a 14B model takes ~11.4s. This is acceptable per the architecture doc ("Model inference latency is acceptable").
- Ollama model store location: Default path is
/usr/share/ollama/.ollama/models — ensure sufficient disk space (~25 GB for all four models).