Files
pi-extensions/report.md
T
shahondin1624 198c9537c9 Add setup + status markdown docs
- setup-guide.md: client-side install + cert/preset recipes (partly
  superseded by ai-server/README.md which goes deeper on mTLS gotchas).
- report.md: point-in-time status report of the ai-server infrastructure
  setup (llama.cpp build, Caddy route, mTLS cert chain, remaining
  action items at time of writing).

Kept out of main to separate operational history from the runtime
extension code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:16:20 +02:00

7.7 KiB

AI Server Setup — Status Report

Goal

Configure the mini PC (192.168.2.3, Fedora 43) as an AI inference server. Only the user connects to it (via PI agent). The server exposes LLM endpoints through the Caddy server (192.168.2.2) with mTLS authentication.


What We've Done

1. llama.cpp Installed & Rebuilt

  • Original state: llama.cpp source at ~/llama.cpp, binary at ~/llama.cpp/build/bin/llama-server
  • Issue: The binary had a symbol mismatch — it looked for ggml_backend_init in backend .so files, but those exported ggml_backend_vk_init / ggml_backend_cpu_init. This caused load_backend: failed to find ggml_backend_init errors.
  • Fix: Pulled latest master and rebuilt from source:
    cd ~/llama.cpp && git pull origin master
    cmake -B build -DGGML_VULKAN=ON
    cmake --build build --config Release -j$(nproc)
    
  • Result: Binary now works (version 3, commit 12568ca). Vulkan backend loads successfully.

2. One Model Downloaded

  • ~/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf (35GB, Q8_0 quant)
  • Qwen3.6-35B-A3B is an MoE model (35B total params, 3B active per token)
  • Native context: 262k tokens

📋 Additional Models in Preset File (Not Yet Downloaded)

The llama-server router preset (~/.llama-models.ini) has these placeholders:

  • small-7b — 7B model (Q8_0, ~4GB)
  • medium-32b — 32B model (Q8_0, ~32GB)
  • large-70b-q5 — 70B model (Q5_K_M, ~46GB)
  • large-70b-q6 — 70B model (Q6_K, ~54GB)

These need actual .gguf files placed in ~/models/ to become active.

3. llama.cpp Router Mode Working

  • Tested manually — llama-server --models-dir ~/models --models-max 6 --models-autoload starts a router server
  • Auto-discovers models from the directory
  • On-demand loading via POST /models/load
  • On-demand unloading via POST /models/unload
  • OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions
  • Test request succeeded — model loaded, responded to "Say hello in one word"
  • load_backend: failed to find ggml_backend_init errors are harmless warnings (not fatal)

4. Systemd Service Created

  • ~/.config/systemd/user/llama-server.service — user-level service
  • Starts llama-server with router mode, Vulkan GPU offload
  • Preset file: ~/.llama-models.ini

5. Preset File Created

  • ~/.llama-models.ini — per-model settings for Qwen3.6-35B-A3B and placeholders for future models
  • Includes ctx-size, temperature, cache-type, gpu-layers per model

6. Caddy Route Added

  • Added ai.shahondin1624.de route to Caddyfile
  • Route proxies to 192.168.2.3:8080 (the mini PC)
  • mTLS configuration with client_auth { mode require_and_verify }

7. mTLS Certificates Generated

  • Root CA: /mnt/ssdpool/@docker/caddy/certs/root-ca.pem
  • Caddy server cert: caddy.pem + caddy-key.pem (signed by root CA)
  • Client cert: client.crt + client.key (signed by root CA)
  • Client P12 bundle: client.p12 (ready for import to PI machines)
  • CA file for clients: root-ca.pem (clients need to trust this)

What Still Needs To Be Done

8. Caddy Container Running with mTLS

Fixed: The Caddy container was failing because two issues:

  1. The certs directory wasn't mounted into the container
  2. The key file was named caddy.key instead of caddy-key.pem (as expected by the Caddyfile)

Fix applied:

  • Updated docker-compose.yml to add ./certs:/etc/caddy/certs:ro volume mount
  • Renamed caddy.keycaddy-key.pem to match Caddyfile expectations
  • Recreated the container with docker compose up -d

Verification:

  • Caddy is running and serving all routes
  • mTLS is active on ai.shahondin1624.de (strict SNI enforcement confirmed)
  • Chat completion test successful — Qwen3.6-35B-A3B responded correctly

Status: The AI inference pipeline is fully operational:

curl https://ai.shahondin1624.de/v1/chat/completions -k \
  --cert client.pem --key client-key.pem --cacert root-ca.pem \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwen_Qwen3.6-35B-A3B-Q8_0","messages":[{"role":"user","content":"Say hello"}],"temperature":0.7}'
# Response: {"choices":[{"message":{"content":"Hello",...}}]}

9. systemd Service Running

  • Service file created at ~/.config/systemd/user/llama-server.service
  • Preset file at ~/.llama-models.ini
  • Model loaded and responding through Caddy
  • VRAM monitor script at ~/vram-monitor.sh (basic version, still works)
  • Confirmed: Service is active (model responds to requests via router API)

🟡 10. VRAM Monitor Script Needs Improvement

  • Basic version created at ~/vram-monitor.sh
  • Currently only logs loaded/loaded/unloaded status
  • Needs: Actual idle-time checking and auto-unload logic (requires /models/{name}/stats endpoint or similar)
  • Status: The llama.cpp router exposes /models/{alias}/stats — can query {"loaded":true/false,"vram_used":0,"cpu_used":0.0} for auto-unload decisions

🟡 11. PI Machine Certificate Installation

  • Client cert bundle (client.p12) needs to be copied to the PI machine (192.168.2.35)
  • Root CA (root-ca.pem) needs to be trusted on the PI machine
  • PI agent config needs to reference the cert files
  • Certs location: /mnt/ssdpool/@docker/caddy/certs/client.p12 and root-ca.pem on Caddy server (192.168.2.2)

🟢 12. Model Testing & Selection

  • User wants to test different models to find the sweet spot
  • Currently only has Qwen3.6-35B-A3B-Q8_0 (35B, Q8_0)
  • Planning:
    • 7B models → Q8_0 → ~4GB → fits easily, good for always-warm
    • 13B models → Q8_0 → ~13GB → fits with room
    • 32B models → Q8_0 → ~32GB → fits with room
    • 70B models → Q6_K (~54GB) or Q5_K_M (~46GB) → fits, leaves room for KV cache
    • KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0) essential for large contexts

🟢 13. Context Window Planning

  • User wants 256k context for small models, 32k-64k for 70B
  • Reality check:
    • Qwen3.6-35B-A3B Q8_0 @ 262k context → ~27GB VRAM on RTX 5090 (32GB)
    • On Strix Halo (~110GB shared RAM), even 70B @ 262k is tight
    • Recommendation: 256k for 7B models, 128k for 13B, 64k for 32B, 32k for 70B

🟢 14. Future: Whisper (STT) & Other Services

  • Whisper.cpp for speech-to-text (optional, later)
  • Would run as separate systemd service on mini PC
  • Exposed through Caddy as voice.shahondin1624.de

Current Status

Caddy container is running with mTLS enabled on ai.shahondin1624.de LLM pipeline is fully operational — model loaded, responding to chat requests All 5 models configured in the router (1 loaded, 4 on-demand)

Remaining Action Items

Priority Item Status
🟡 Medium PI machine cert installation (192.168.2.35) Needs manual copy of client.p12 + root-ca.pem
🟡 Medium VRAM monitor auto-unload logic Needs idle-time checking via /models/{name}/stats
🟢 Later Test additional models (7B, 32B, 70B) Need to download and configure
🟢 Later Context window tuning Per-model recommendations above
🟢 Later Whisper (STT) service Future enhancement

Quick Commands Reference

# Caddy container
ssh shahondin1624@192.168.2.2 "docker compose -f /mnt/ssdpool/@docker/caddy/docker-compose.yml up -d"
ssh shahondin1624@192.168.2.2 "docker logs caddy --tail 30"

# AI API test (from Caddy server)
ssh shahondin1624@192.168.2.2 "docker exec caddy curl https://ai.shahondin1624.de/models -k --cert /etc/caddy/certs/caddy.pem --key /etc/caddy/certs/caddy-key.pem --cacert /etc/caddy/certs/root-ca.pem"

# Mini PC (192.168.2.3) - needs password auth
ssh shahondin1624@192.168.2.3 'systemctl --user status llama-server.service'
ssh shahondin1624@192.168.2.3 'curl http://127.0.0.1:8080/models'