198c9537c9
- setup-guide.md: client-side install + cert/preset recipes (partly superseded by ai-server/README.md which goes deeper on mTLS gotchas). - report.md: point-in-time status report of the ai-server infrastructure setup (llama.cpp build, Caddy route, mTLS cert chain, remaining action items at time of writing). Kept out of main to separate operational history from the runtime extension code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.7 KiB
7.7 KiB
AI Server Setup — Status Report
Goal
Configure the mini PC (192.168.2.3, Fedora 43) as an AI inference server. Only the user connects to it (via PI agent). The server exposes LLM endpoints through the Caddy server (192.168.2.2) with mTLS authentication.
What We've Done
✅ 1. llama.cpp Installed & Rebuilt
- Original state: llama.cpp source at
~/llama.cpp, binary at~/llama.cpp/build/bin/llama-server - Issue: The binary had a symbol mismatch — it looked for
ggml_backend_initin backend.sofiles, but those exportedggml_backend_vk_init/ggml_backend_cpu_init. This causedload_backend: failed to find ggml_backend_initerrors. - Fix: Pulled latest master and rebuilt from source:
cd ~/llama.cpp && git pull origin master cmake -B build -DGGML_VULKAN=ON cmake --build build --config Release -j$(nproc) - Result: Binary now works (version 3, commit 12568ca). Vulkan backend loads successfully.
✅ 2. One Model Downloaded
~/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf(35GB, Q8_0 quant)- Qwen3.6-35B-A3B is an MoE model (35B total params, 3B active per token)
- Native context: 262k tokens
📋 Additional Models in Preset File (Not Yet Downloaded)
The llama-server router preset (~/.llama-models.ini) has these placeholders:
small-7b— 7B model (Q8_0, ~4GB)medium-32b— 32B model (Q8_0, ~32GB)large-70b-q5— 70B model (Q5_K_M, ~46GB)large-70b-q6— 70B model (Q6_K, ~54GB)
These need actual .gguf files placed in ~/models/ to become active.
✅ 3. llama.cpp Router Mode Working
- Tested manually —
llama-server --models-dir ~/models --models-max 6 --models-autoloadstarts a router server - Auto-discovers models from the directory
- On-demand loading via
POST /models/load - On-demand unloading via
POST /models/unload - OpenAI-compatible API at
http://127.0.0.1:8080/v1/chat/completions - Test request succeeded — model loaded, responded to "Say hello in one word"
load_backend: failed to find ggml_backend_initerrors are harmless warnings (not fatal)
✅ 4. Systemd Service Created
~/.config/systemd/user/llama-server.service— user-level service- Starts llama-server with router mode, Vulkan GPU offload
- Preset file:
~/.llama-models.ini
✅ 5. Preset File Created
~/.llama-models.ini— per-model settings for Qwen3.6-35B-A3B and placeholders for future models- Includes ctx-size, temperature, cache-type, gpu-layers per model
✅ 6. Caddy Route Added
- Added
ai.shahondin1624.deroute to Caddyfile - Route proxies to
192.168.2.3:8080(the mini PC) - mTLS configuration with
client_auth { mode require_and_verify }
✅ 7. mTLS Certificates Generated
- Root CA:
/mnt/ssdpool/@docker/caddy/certs/root-ca.pem - Caddy server cert:
caddy.pem+caddy-key.pem(signed by root CA) - Client cert:
client.crt+client.key(signed by root CA) - Client P12 bundle:
client.p12(ready for import to PI machines) - CA file for clients:
root-ca.pem(clients need to trust this)
What Still Needs To Be Done
✅ 8. Caddy Container Running with mTLS
Fixed: The Caddy container was failing because two issues:
- The
certsdirectory wasn't mounted into the container - The key file was named
caddy.keyinstead ofcaddy-key.pem(as expected by the Caddyfile)
Fix applied:
- Updated
docker-compose.ymlto add./certs:/etc/caddy/certs:rovolume mount - Renamed
caddy.key→caddy-key.pemto match Caddyfile expectations - Recreated the container with
docker compose up -d
Verification:
- Caddy is running and serving all routes
- mTLS is active on
ai.shahondin1624.de(strict SNI enforcement confirmed) - Chat completion test successful — Qwen3.6-35B-A3B responded correctly
Status: The AI inference pipeline is fully operational:
curl https://ai.shahondin1624.de/v1/chat/completions -k \
--cert client.pem --key client-key.pem --cacert root-ca.pem \
-H 'Content-Type: application/json' \
-d '{"model":"Qwen_Qwen3.6-35B-A3B-Q8_0","messages":[{"role":"user","content":"Say hello"}],"temperature":0.7}'
# Response: {"choices":[{"message":{"content":"Hello",...}}]}
✅ 9. systemd Service Running
- Service file created at
~/.config/systemd/user/llama-server.service - Preset file at
~/.llama-models.ini - Model loaded and responding through Caddy
- VRAM monitor script at
~/vram-monitor.sh(basic version, still works) - Confirmed: Service is active (model responds to requests via router API)
🟡 10. VRAM Monitor Script Needs Improvement
- Basic version created at
~/vram-monitor.sh - Currently only logs loaded/loaded/unloaded status
- Needs: Actual idle-time checking and auto-unload logic (requires
/models/{name}/statsendpoint or similar) - Status: The llama.cpp router exposes
/models/{alias}/stats— can query{"loaded":true/false,"vram_used":0,"cpu_used":0.0}for auto-unload decisions
🟡 11. PI Machine Certificate Installation
- Client cert bundle (
client.p12) needs to be copied to the PI machine (192.168.2.35) - Root CA (
root-ca.pem) needs to be trusted on the PI machine - PI agent config needs to reference the cert files
- Certs location:
/mnt/ssdpool/@docker/caddy/certs/client.p12androot-ca.pemon Caddy server (192.168.2.2)
🟢 12. Model Testing & Selection
- User wants to test different models to find the sweet spot
- Currently only has Qwen3.6-35B-A3B-Q8_0 (35B, Q8_0)
- Planning:
- 7B models → Q8_0 → ~4GB → fits easily, good for always-warm
- 13B models → Q8_0 → ~13GB → fits with room
- 32B models → Q8_0 → ~32GB → fits with room
- 70B models → Q6_K (~54GB) or Q5_K_M (~46GB) → fits, leaves room for KV cache
- KV cache quantization (
--cache-type-k q8_0 --cache-type-v q8_0) essential for large contexts
🟢 13. Context Window Planning
- User wants 256k context for small models, 32k-64k for 70B
- Reality check:
- Qwen3.6-35B-A3B Q8_0 @ 262k context → ~27GB VRAM on RTX 5090 (32GB)
- On Strix Halo (~110GB shared RAM), even 70B @ 262k is tight
- Recommendation: 256k for 7B models, 128k for 13B, 64k for 32B, 32k for 70B
🟢 14. Future: Whisper (STT) & Other Services
- Whisper.cpp for speech-to-text (optional, later)
- Would run as separate systemd service on mini PC
- Exposed through Caddy as
voice.shahondin1624.de
Current Status
✅ Caddy container is running with mTLS enabled on ai.shahondin1624.de
✅ LLM pipeline is fully operational — model loaded, responding to chat requests
✅ All 5 models configured in the router (1 loaded, 4 on-demand)
Remaining Action Items
| Priority | Item | Status |
|---|---|---|
| 🟡 Medium | PI machine cert installation (192.168.2.35) | Needs manual copy of client.p12 + root-ca.pem |
| 🟡 Medium | VRAM monitor auto-unload logic | Needs idle-time checking via /models/{name}/stats |
| 🟢 Later | Test additional models (7B, 32B, 70B) | Need to download and configure |
| 🟢 Later | Context window tuning | Per-model recommendations above |
| 🟢 Later | Whisper (STT) service | Future enhancement |
Quick Commands Reference
# Caddy container
ssh shahondin1624@192.168.2.2 "docker compose -f /mnt/ssdpool/@docker/caddy/docker-compose.yml up -d"
ssh shahondin1624@192.168.2.2 "docker logs caddy --tail 30"
# AI API test (from Caddy server)
ssh shahondin1624@192.168.2.2 "docker exec caddy curl https://ai.shahondin1624.de/models -k --cert /etc/caddy/certs/caddy.pem --key /etc/caddy/certs/caddy-key.pem --cacert /etc/caddy/certs/root-ca.pem"
# Mini PC (192.168.2.3) - needs password auth
ssh shahondin1624@192.168.2.3 'systemctl --user status llama-server.service'
ssh shahondin1624@192.168.2.3 'curl http://127.0.0.1:8080/models'