Add setup + status markdown docs

- setup-guide.md: client-side install + cert/preset recipes (partly superseded by ai-server/README.md which goes deeper on mTLS gotchas). - report.md: point-in-time status report of the ai-server infrastructure setup (llama.cpp build, Caddy route, mTLS cert chain, remaining action items at time of writing). Kept out of main to separate operational history from the runtime extension code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:16:20 +02:00
parent e321f90fe9
commit 198c9537c9
3 changed files with 396 additions and 0 deletions
@@ -10,3 +10,5 @@
 !/README.md
 !/ai-server/
 !/local-llama.ts
+!/report.md
+!/setup-guide.md
@@ -0,0 +1,165 @@
+# AI Server Setup — Status Report
+
+## Goal
+Configure the mini PC (192.168.2.3, Fedora 43) as an AI inference server. Only the user connects to it (via PI agent). The server exposes LLM endpoints through the Caddy server (192.168.2.2) with mTLS authentication.
+
+---
+
+## What We've Done
+
+### ✅ 1. llama.cpp Installed & Rebuilt
+- **Original state:** llama.cpp source at `~/llama.cpp`, binary at `~/llama.cpp/build/bin/llama-server`
+- **Issue:** The binary had a symbol mismatch — it looked for `ggml_backend_init` in backend `.so` files, but those exported `ggml_backend_vk_init` / `ggml_backend_cpu_init`. This caused `load_backend: failed to find ggml_backend_init` errors.
+- **Fix:** Pulled latest master and rebuilt from source:
+  ```bash
+  cd ~/llama.cpp && git pull origin master
+  cmake -B build -DGGML_VULKAN=ON
+  cmake --build build --config Release -j$(nproc)
+  ```
+- **Result:** Binary now works (version 3, commit 12568ca). Vulkan backend loads successfully.
+
+### ✅ 2. One Model Downloaded
+- `~/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf` (35GB, Q8_0 quant)
+- Qwen3.6-35B-A3B is an MoE model (35B total params, 3B active per token)
+- Native context: 262k tokens
+
+### 📋 Additional Models in Preset File (Not Yet Downloaded)
+The llama-server router preset (`~/.llama-models.ini`) has these placeholders:
+- `small-7b` — 7B model (Q8_0, ~4GB)
+- `medium-32b` — 32B model (Q8_0, ~32GB)
+- `large-70b-q5` — 70B model (Q5_K_M, ~46GB)
+- `large-70b-q6` — 70B model (Q6_K, ~54GB)
+
+These need actual `.gguf` files placed in `~/models/` to become active.
+
+### ✅ 3. llama.cpp Router Mode Working
+- Tested manually — `llama-server --models-dir ~/models --models-max 6 --models-autoload` starts a router server
+- Auto-discovers models from the directory
+- On-demand loading via `POST /models/load`
+- On-demand unloading via `POST /models/unload`
+- OpenAI-compatible API at `http://127.0.0.1:8080/v1/chat/completions`
+- **Test request succeeded** — model loaded, responded to "Say hello in one word"
+- `load_backend: failed to find ggml_backend_init` errors are harmless warnings (not fatal)
+
+### ✅ 4. Systemd Service Created
+- `~/.config/systemd/user/llama-server.service` — user-level service
+- Starts llama-server with router mode, Vulkan GPU offload
+- Preset file: `~/.llama-models.ini`
+
+### ✅ 5. Preset File Created
+- `~/.llama-models.ini` — per-model settings for Qwen3.6-35B-A3B and placeholders for future models
+- Includes ctx-size, temperature, cache-type, gpu-layers per model
+
+### ✅ 6. Caddy Route Added
+- Added `ai.shahondin1624.de` route to Caddyfile
+- Route proxies to `192.168.2.3:8080` (the mini PC)
+- mTLS configuration with `client_auth { mode require_and_verify }`
+
+### ✅ 7. mTLS Certificates Generated
+- **Root CA:** `/mnt/ssdpool/@docker/caddy/certs/root-ca.pem`
+- **Caddy server cert:** `caddy.pem` + `caddy-key.pem` (signed by root CA)
+- **Client cert:** `client.crt` + `client.key` (signed by root CA)
+- **Client P12 bundle:** `client.p12` (ready for import to PI machines)
+- **CA file for clients:** `root-ca.pem` (clients need to trust this)
+
+---
+
+## What Still Needs To Be Done
+
+### ✅ 8. Caddy Container Running with mTLS
+**Fixed:** The Caddy container was failing because two issues:
+1. The `certs` directory wasn't mounted into the container
+2. The key file was named `caddy.key` instead of `caddy-key.pem` (as expected by the Caddyfile)
+
+**Fix applied:**
+- Updated `docker-compose.yml` to add `./certs:/etc/caddy/certs:ro` volume mount
+- Renamed `caddy.key` → `caddy-key.pem` to match Caddyfile expectations
+- Recreated the container with `docker compose up -d`
+
+**Verification:**
+- Caddy is running and serving all routes
+- mTLS is active on `ai.shahondin1624.de` (strict SNI enforcement confirmed)
+- Chat completion test successful — Qwen3.6-35B-A3B responded correctly
+
+**Status:** The AI inference pipeline is fully operational:
+```bash
+curl https://ai.shahondin1624.de/v1/chat/completions -k \
+  --cert client.pem --key client-key.pem --cacert root-ca.pem \
+  -H 'Content-Type: application/json' \
+  -d '{"model":"Qwen_Qwen3.6-35B-A3B-Q8_0","messages":[{"role":"user","content":"Say hello"}],"temperature":0.7}'
+# Response: {"choices":[{"message":{"content":"Hello",...}}]}
+```
+
+### ✅ 9. systemd Service Running
+- Service file created at `~/.config/systemd/user/llama-server.service`
+- Preset file at `~/.llama-models.ini`
+- Model loaded and responding through Caddy
+- VRAM monitor script at `~/vram-monitor.sh` (basic version, still works)
+- **Confirmed:** Service is active (model responds to requests via router API)
+
+### 🟡 10. VRAM Monitor Script Needs Improvement
+- Basic version created at `~/vram-monitor.sh`
+- Currently only logs loaded/loaded/unloaded status
+- **Needs:** Actual idle-time checking and auto-unload logic (requires `/models/{name}/stats` endpoint or similar)
+- **Status:** The llama.cpp router exposes `/models/{alias}/stats` — can query `{"loaded":true/false,"vram_used":0,"cpu_used":0.0}` for auto-unload decisions
+
+### 🟡 11. PI Machine Certificate Installation
+- Client cert bundle (`client.p12`) needs to be copied to the PI machine (192.168.2.35)
+- Root CA (`root-ca.pem`) needs to be trusted on the PI machine
+- PI agent config needs to reference the cert files
+- **Certs location:** `/mnt/ssdpool/@docker/caddy/certs/client.p12` and `root-ca.pem` on Caddy server (192.168.2.2)
+
+### 🟢 12. Model Testing & Selection
+- User wants to test different models to find the sweet spot
+- Currently only has Qwen3.6-35B-A3B-Q8_0 (35B, Q8_0)
+- **Planning:**
+  - 7B models → Q8_0 → ~4GB → fits easily, good for always-warm
+  - 13B models → Q8_0 → ~13GB → fits with room
+  - 32B models → Q8_0 → ~32GB → fits with room
+  - 70B models → Q6_K (~54GB) or Q5_K_M (~46GB) → fits, leaves room for KV cache
+  - KV cache quantization (`--cache-type-k q8_0 --cache-type-v q8_0`) essential for large contexts
+
+### 🟢 13. Context Window Planning
+- User wants 256k context for small models, 32k-64k for 70B
+- Reality check:
+  - Qwen3.6-35B-A3B Q8_0 @ 262k context → ~27GB VRAM on RTX 5090 (32GB)
+  - On Strix Halo (~110GB shared RAM), even 70B @ 262k is tight
+  - **Recommendation:** 256k for 7B models, 128k for 13B, 64k for 32B, 32k for 70B
+
+### 🟢 14. Future: Whisper (STT) & Other Services
+- Whisper.cpp for speech-to-text (optional, later)
+- Would run as separate systemd service on mini PC
+- Exposed through Caddy as `voice.shahondin1624.de`
+
+---
+
+## Current Status
+
+✅ **Caddy container is running** with mTLS enabled on `ai.shahondin1624.de`
+✅ **LLM pipeline is fully operational** — model loaded, responding to chat requests
+✅ **All 5 models configured** in the router (1 loaded, 4 on-demand)
+
+## Remaining Action Items
+
+| Priority | Item | Status |
+|----------|------|--------|
+| 🟡 Medium | PI machine cert installation (192.168.2.35) | Needs manual copy of client.p12 + root-ca.pem |
+| 🟡 Medium | VRAM monitor auto-unload logic | Needs idle-time checking via `/models/{name}/stats` |
+| 🟢 Later | Test additional models (7B, 32B, 70B) | Need to download and configure |
+| 🟢 Later | Context window tuning | Per-model recommendations above |
+| 🟢 Later | Whisper (STT) service | Future enhancement |
+
+## Quick Commands Reference
+
+```bash
+# Caddy container
+ssh shahondin1624@192.168.2.2 "docker compose -f /mnt/ssdpool/@docker/caddy/docker-compose.yml up -d"
+ssh shahondin1624@192.168.2.2 "docker logs caddy --tail 30"
+
+# AI API test (from Caddy server)
+ssh shahondin1624@192.168.2.2 "docker exec caddy curl https://ai.shahondin1624.de/models -k --cert /etc/caddy/certs/caddy.pem --key /etc/caddy/certs/caddy-key.pem --cacert /etc/caddy/certs/root-ca.pem"
+
+# Mini PC (192.168.2.3) - needs password auth
+ssh shahondin1624@192.168.2.3 'systemctl --user status llama-server.service'
+ssh shahondin1624@192.168.2.3 'curl http://127.0.0.1:8080/models'
+```
@@ -0,0 +1,229 @@
+# AI Server Setup Guide
+
+This guide covers connecting clients to the AI inference server via mTLS, using the pi extension, and configuring new models on the server.
+
+---
+
+## 1. Server Infrastructure
+
+| Component | Host | Purpose |
+|-----------|------|---------|
+| **Caddy** (reverse proxy + mTLS) | `192.168.2.2` | HTTPS termination, mTLS authentication |
+| **Mini PC** (AI inference) | `192.168.2.3` | llama.cpp router, model serving |
+
+---
+
+## 2. Generate Client Certificates
+
+On the **Caddy server** (`192.168.2.2`):
+
+```bash
+cd /mnt/ssdpool/@docker/caddy/certs
+
+# Generate Root CA (if not already done)
+openssl genrsa -out root-ca.key 4096
+openssl req -new -x509 -days 3650 -key root-ca.key -out root-ca.pem -subj "/CN=ShahODin Root CA/O=ShahODin/C=DE"
+
+# Generate client key + CSR
+openssl genrsa -out client.key 4096
+openssl req -new -key client.key -out client.csr -subj "/CN=ShahODin Client/O=ShahODin/C=DE"
+
+# Sign client cert with root CA
+openssl x509 -req -in client.csr -CA root-ca.pem -CAkey root-ca.key -CAcreateserial -out client.crt -days 3650
+
+# Create P12 bundle (for easy import)
+openssl pkcs12 -export -out client.p12 -inkey client.key -in client.crt -certfile root-ca.pem -passout pass:
+```
+
+Copy the cert files to your client machine:
+
+```bash
+scp client.crt client.key root-ca.pem user@client:~/.pi/agent/certs/
+```
+
+---
+
+## 3. Connect a New Client via mTLS
+
+### 3.1 Copy Certificates
+
+On the **client machine**, place the files in:
+
+```bash
+mkdir -p ~/.pi/agent/certs
+# Copy client.pem, client-key.pem, root-ca.pem from the Caddy server
+```
+
+### 3.2 Create the ai-server Extension
+
+Copy the `ai-server/` directory to `~/.pi/agent/extensions/ai-server/`:
+
+```
+~/.pi/agent/extensions/ai-server/
+├── index.ts      # entry point (pi discovers this via the `<name>/index.ts` convention)
+├── config.ts     # URLs, cert paths, model presets
+├── messages.ts   # Context → OpenAI messages conversion
+└── stream.ts     # mTLS HTTPS + SSE parsing, emits pi-ai event stream
+```
+
+The extension auto-discovers certs from `~/.pi/agent/certs/`. No config changes needed if certs are in the default location.
+
+Override via env vars if you want: `AI_SERVER_URL`, `AI_SERVER_CERTS_DIR`, `AI_SERVER_CA`, `AI_SERVER_CLIENT_CERT`, `AI_SERVER_CLIENT_KEY`, `AI_SERVER_TIMEOUT_MS`.
+
+### 3.3 Restart pi
+
+```bash
+# Close pi and restart — the extension loads automatically
+pi
+```
+
+The model `Qwen3.6-35B-A3B (AI Server, mTLS)` will appear alongside your local models.
+
+### 3.4 Select the Remote Model
+
+In pi, press **`/model`** (or **Ctrl+L**) and choose **Qwen3.6-35B-A3B (AI Server, mTLS)**.
+
+---
+
+## 4. Configure New Models on the Server
+
+### 4.1 Add a Model Preset
+
+SSH into the mini PC (`192.168.2.3`) and edit the preset file:
+
+```bash
+ssh shahondin1624@192.168.2.3
+nano ~/.llama-models.ini
+```
+
+Add a new section — the section name is the model ID used in API requests:
+
+```ini
+[model-id-you-choose]
+model = /home/ai-server/models/your-model.gguf
+ctx-size = 32768
+temp = 0.7
+cache-type-k = q8_0
+cache-type-v = q8_0
+n-gpu-layers = 99
+```
+
+### 4.2 Update the Extension (if needed)
+
+To expose a new model to pi, add an entry to the `MODELS` array in `~/.pi/agent/extensions/ai-server/config.ts`. The `id` must match the preset section name in `~/.llama-models.ini`:
+
+```typescript
+export const MODELS: ServerModel[] = [
+  {
+    id: "Qwen_Qwen3.6-35B-A3B-Q8_0",
+    name: "Qwen3.6-35B-A3B (AI Server, mTLS)",
+    reasoning: true,
+    contextWindow: 262_144,
+    maxTokens: 16_384,
+  },
+  {
+    id: "your-model-id",
+    name: "Your Model",
+    reasoning: false,
+    contextWindow: 65_536,
+    maxTokens: 8_192,
+  },
+];
+```
+
+Run `/reload` in pi (or restart) to pick up the new list.
+
+### 4.3 Reload the Model
+
+After adding a model to the preset, reload it via the router API:
+
+```bash
+# Unload any existing instance
+curl -s http://127.0.0.1:8080/models/unload \
+  -H "Content-Type: application/json" \
+  -d '{"id":"your-model-id"}'
+
+# Load with new preset settings
+sleep 2
+curl -s http://127.0.0.1:8080/models/load \
+  -H "Content-Type: application/json" \
+  -d '{"id":"your-model-id"}'
+```
+
+### 4.4 Change Context Size
+
+```bash
+# Update preset
+sed -i 's/^ctx-size = 32768$/ctx-size = 65536/' ~/.llama-models.ini
+
+# Reload model
+curl -s http://127.0.0.1:8080/models/unload -H "Content-Type: application/json" -d '{"id":"your-model-id"}'
+sleep 2
+curl -s http://127.0.0.1:8080/models/load -H "Content-Type: application/json" -d '{"id":"your-model-id"}'
+```
+
+### 4.5 Verify
+
+```bash
+curl -s http://127.0.0.1:8080/models | jq '.data[] | select(.id == "your-model-id")'
+```
+
+---
+
+## 5. Testing the Connection
+
+### 5.1 From Terminal (curl)
+
+```bash
+curl -s https://ai.shahondin1624.de/v1/chat/completions \
+  --cert ~/.pi/agent/certs/client.pem \
+  --key ~/.pi/agent/certs/client-key.pem \
+  --cacert ~/.pi/agent/certs/root-ca.pem \
+  -H "Content-Type: application/json" \
+  -d '{"model":"Qwen_Qwen3.6-35B-A3B-Q8_0","messages":[{"role":"user","content":"Say hi"}]}'
+```
+
+### 5.2 From pi
+
+Use `/model` to switch to the ai-server model and chat normally.
+
+---
+
+## 6. Quick Reference
+
+| File | Purpose |
+|------|---------|
+| `~/.pi/agent/certs/client.pem` | Client certificate |
+| `~/.pi/agent/certs/client-key.pem` | Client private key |
+| `~/.pi/agent/certs/root-ca.pem` | Root CA (trust anchor) |
+| `~/.pi/agent/extensions/ai-server/` | PI extension directory (mTLS provider; entry = `index.ts`) |
+| `~/.pi/agent/extensions/report.md` | Setup status report |
+| `~/.llama-models.ini` | Server-side model presets |
+| `/mnt/ssdpool/@docker/caddy/certs/` | Caddy server cert store |
+| `/mnt/ssdpool/@docker/caddy/docker-compose.yml` | Caddy container config |
+
+---
+
+## 7. Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| `certificate verify failed` | Ensure `root-ca.pem` is trusted; check `--cacert` path |
+| `client certificate required` | Verify `client.pem` and `client-key.pem` are in `~/.pi/agent/certs/` |
+| Extension not loading | Check `~/.pi/agent/extensions/ai-server/index.ts` syntax; restart pi |
+| Model not found | Ensure model ID matches the `.ini` section name exactly |
+| Slow model loading | Use `POST /models/load` to pre-warm; monitor with VRAM stats |
+
+---
+
+## 8. Recommended Context Sizes
+
+| Model | VRAM (Q8_0) | Recommended ctx-size |
+|-------|-------------|---------------------|
+| 7B | ~4 GB | 131072 (128k) |
+| 13B | ~13 GB | 65536 (64k) |
+| 32B | ~32 GB | 65536 (64k) |
+| 70B Q5_K_M | ~46 GB | 32768 (32k) |
+| 70B Q6_K | ~54 GB | 32768 (32k) |
+
+KV cache quantization (`cache-type-k = q8_0`, `cache-type-v = q8_0`) is essential for large contexts.