Files

T

shahondin1624 198c9537c9 Add setup + status markdown docs

- setup-guide.md: client-side install + cert/preset recipes (partly
  superseded by ai-server/README.md which goes deeper on mTLS gotchas).
- report.md: point-in-time status report of the ai-server infrastructure
  setup (llama.cpp build, Caddy route, mTLS cert chain, remaining
  action items at time of writing).

Kept out of main to separate operational history from the runtime
extension code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 21:16:20 +02:00

7.7 KiB

Raw Blame History

AI Server Setup — Status Report

Goal

Configure the mini PC (192.168.2.3, Fedora 43) as an AI inference server. Only the user connects to it (via PI agent). The server exposes LLM endpoints through the Caddy server (192.168.2.2) with mTLS authentication.

What We've Done

✅ 1. llama.cpp Installed & Rebuilt

Original state: llama.cpp source at ~/llama.cpp, binary at ~/llama.cpp/build/bin/llama-server
Issue: The binary had a symbol mismatch — it looked for ggml_backend_init in backend .so files, but those exported ggml_backend_vk_init / ggml_backend_cpu_init. This caused load_backend: failed to find ggml_backend_init errors.

Fix: Pulled latest master and rebuilt from source:

cd ~/llama.cpp && git pull origin master
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

Result: Binary now works (version 3, commit 12568ca). Vulkan backend loads successfully.

✅ 2. One Model Downloaded

~/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf (35GB, Q8_0 quant)
Qwen3.6-35B-A3B is an MoE model (35B total params, 3B active per token)
Native context: 262k tokens

📋 Additional Models in Preset File (Not Yet Downloaded)

The llama-server router preset (~/.llama-models.ini) has these placeholders:

small-7b — 7B model (Q8_0, ~4GB)
medium-32b — 32B model (Q8_0, ~32GB)
large-70b-q5 — 70B model (Q5_K_M, ~46GB)
large-70b-q6 — 70B model (Q6_K, ~54GB)

These need actual .gguf files placed in ~/models/ to become active.

✅ 3. llama.cpp Router Mode Working

Tested manually — llama-server --models-dir ~/models --models-max 6 --models-autoload starts a router server
Auto-discovers models from the directory
On-demand loading via POST /models/load
On-demand unloading via POST /models/unload
OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions
Test request succeeded — model loaded, responded to "Say hello in one word"
load_backend: failed to find ggml_backend_init errors are harmless warnings (not fatal)

✅ 4. Systemd Service Created

~/.config/systemd/user/llama-server.service — user-level service
Starts llama-server with router mode, Vulkan GPU offload
Preset file: ~/.llama-models.ini

✅ 5. Preset File Created

~/.llama-models.ini — per-model settings for Qwen3.6-35B-A3B and placeholders for future models
Includes ctx-size, temperature, cache-type, gpu-layers per model

✅ 6. Caddy Route Added

Added ai.shahondin1624.de route to Caddyfile
Route proxies to 192.168.2.3:8080 (the mini PC)
mTLS configuration with client_auth { mode require_and_verify }

✅ 7. mTLS Certificates Generated

Root CA: /mnt/ssdpool/@docker/caddy/certs/root-ca.pem
Caddy server cert: caddy.pem + caddy-key.pem (signed by root CA)
Client cert: client.crt + client.key (signed by root CA)
Client P12 bundle: client.p12 (ready for import to PI machines)
CA file for clients: root-ca.pem (clients need to trust this)

What Still Needs To Be Done

✅ 8. Caddy Container Running with mTLS

Fixed: The Caddy container was failing because two issues:

The certs directory wasn't mounted into the container
The key file was named caddy.key instead of caddy-key.pem (as expected by the Caddyfile)

Fix applied:

Updated docker-compose.yml to add ./certs:/etc/caddy/certs:ro volume mount
Renamed caddy.key → caddy-key.pem to match Caddyfile expectations
Recreated the container with docker compose up -d

Verification:

Caddy is running and serving all routes
mTLS is active on ai.shahondin1624.de (strict SNI enforcement confirmed)
Chat completion test successful — Qwen3.6-35B-A3B responded correctly

Status: The AI inference pipeline is fully operational:

curl https://ai.shahondin1624.de/v1/chat/completions -k \
  --cert client.pem --key client-key.pem --cacert root-ca.pem \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen_Qwen3.6-35B-A3B-Q8_0","messages":[{"role":"user","content":"Say hello"}],"temperature":0.7}'
# Response: {"choices":[{"message":{"content":"Hello",...}}]}

✅ 9. systemd Service Running

Service file created at ~/.config/systemd/user/llama-server.service
Preset file at ~/.llama-models.ini
Model loaded and responding through Caddy
VRAM monitor script at ~/vram-monitor.sh (basic version, still works)
Confirmed: Service is active (model responds to requests via router API)

🟡 10. VRAM Monitor Script Needs Improvement

Basic version created at ~/vram-monitor.sh
Currently only logs loaded/loaded/unloaded status
Needs: Actual idle-time checking and auto-unload logic (requires /models/{name}/stats endpoint or similar)
Status: The llama.cpp router exposes /models/{alias}/stats — can query {"loaded":true/false,"vram_used":0,"cpu_used":0.0} for auto-unload decisions

🟡 11. PI Machine Certificate Installation

Client cert bundle (client.p12) needs to be copied to the PI machine (192.168.2.35)
Root CA (root-ca.pem) needs to be trusted on the PI machine
PI agent config needs to reference the cert files
Certs location: /mnt/ssdpool/@docker/caddy/certs/client.p12 and root-ca.pem on Caddy server (192.168.2.2)

🟢 12. Model Testing & Selection

User wants to test different models to find the sweet spot
Currently only has Qwen3.6-35B-A3B-Q8_0 (35B, Q8_0)
Planning:
- 7B models → Q8_0 → ~4GB → fits easily, good for always-warm
- 13B models → Q8_0 → ~13GB → fits with room
- 32B models → Q8_0 → ~32GB → fits with room
- 70B models → Q6_K (~54GB) or Q5_K_M (~46GB) → fits, leaves room for KV cache
- KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0) essential for large contexts

🟢 13. Context Window Planning

User wants 256k context for small models, 32k-64k for 70B
Reality check:
- Qwen3.6-35B-A3B Q8_0 @ 262k context → ~27GB VRAM on RTX 5090 (32GB)
- On Strix Halo (~110GB shared RAM), even 70B @ 262k is tight
- Recommendation: 256k for 7B models, 128k for 13B, 64k for 32B, 32k for 70B

🟢 14. Future: Whisper (STT) & Other Services

Whisper.cpp for speech-to-text (optional, later)
Would run as separate systemd service on mini PC
Exposed through Caddy as voice.shahondin1624.de

Current Status

✅ Caddy container is running with mTLS enabled on ai.shahondin1624.de ✅ LLM pipeline is fully operational — model loaded, responding to chat requests ✅ All 5 models configured in the router (1 loaded, 4 on-demand)

Remaining Action Items

Priority	Item	Status
🟡 Medium	PI machine cert installation (192.168.2.35)	Needs manual copy of client.p12 + root-ca.pem
🟡 Medium	VRAM monitor auto-unload logic	Needs idle-time checking via `/models/{name}/stats`
🟢 Later	Test additional models (7B, 32B, 70B)	Need to download and configure
🟢 Later	Context window tuning	Per-model recommendations above
🟢 Later	Whisper (STT) service	Future enhancement

Quick Commands Reference

# Caddy container
ssh shahondin1624@192.168.2.2 "docker compose -f /mnt/ssdpool/@docker/caddy/docker-compose.yml up -d"
ssh shahondin1624@192.168.2.2 "docker logs caddy --tail 30"

# AI API test (from Caddy server)
ssh shahondin1624@192.168.2.2 "docker exec caddy curl https://ai.shahondin1624.de/models -k --cert /etc/caddy/certs/caddy.pem --key /etc/caddy/certs/caddy-key.pem --cacert /etc/caddy/certs/root-ca.pem"

# Mini PC (192.168.2.3) - needs password auth
ssh shahondin1624@192.168.2.3 'systemctl --user status llama-server.service'
ssh shahondin1624@192.168.2.3 'curl http://127.0.0.1:8080/models'

7.7 KiB Raw Blame History