f7af660727
Endpoint rewrites:
- GET /v1/models + /running → merged listModels() with running flag
- POST /models/load → GET /upstream/<id>/health (warm load)
- POST /models/unload → POST /api/models/unload/<id> (no body)
- Added POST /api/models/unload for unloadAll()
Config migration:
- Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml
- Service unit: llama-server.service → llama-swap.service
- setPresetKey() rewritten from INI awk to YAML-aware awk for
editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks
Per-model ctx-size (fixes 0/33k bug):
- parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per
model block → Map<id, ctxSize>
- extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string
- discoverModels(): Promise.all(listModels, listRunning, readPreset),
ctx priority: running cmd → yaml → 32768 fallback
- Removed broken extractCtxSize stub and dangling imports
Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3,
isShardArtefact ×3, isReasoningModel ×3)
README: full rewrite covering llama-swap architecture, YAML config format,
new endpoints, troubleshooting table updated.
386 lines
16 KiB
Markdown
386 lines
16 KiB
Markdown
# ai-server — PI extension for a self-hosted llama-swap server behind mTLS
|
||
|
||
A multi-file pi extension that exposes a remote llama-swap instance as a
|
||
provider to pi, with dynamic model discovery and admin slash commands. Chat
|
||
streams use client-certificate TLS so the endpoint can be exposed over the
|
||
public internet without a bearer token.
|
||
|
||
---
|
||
|
||
## 1. Architecture
|
||
|
||
```
|
||
┌────────────┐ mTLS (HTTPS) ┌──────────────┐ HTTP ┌─────────────────┐
|
||
│ pi client │───────────────►│ Caddy │────────►│ llama-swap │
|
||
│ (this ext) │ │ 192.168.2.2 │ │ 192.168.2.3:8080 │
|
||
└────────────┘ client cert │ ai.… │ │ swap mode │
|
||
└──────────────┘ │ globalTTL: 1800 │
|
||
│ scheduler: one │
|
||
└─────────────────┘
|
||
│
|
||
~/.config/llama-swap/config.yaml
|
||
(YAML model config)
|
||
```
|
||
|
||
- **Caddy** terminates TLS and enforces `require_and_verify` client-cert auth
|
||
on `ai.shahondin1624.de`. Plaintext HTTP is forwarded to llama-swap.
|
||
- **llama-swap** runs in swap mode, managing model lifecycle (load/unload/swap)
|
||
with a YAML config at `~/.config/llama-swap/config.yaml`.
|
||
- **This extension** performs OpenAI-compatible chat streaming over mTLS and
|
||
surfaces admin endpoints as pi slash commands.
|
||
|
||
## 2. Extension layout
|
||
|
||
```
|
||
~/.pi/agent/extensions/ai-server/
|
||
├── index.ts entry: async discovery + registerProvider + commands
|
||
├── config.ts URLs, SSH host, cert paths, MODELS[] fallback
|
||
├── messages.ts Context → OpenAI chat/completions messages
|
||
├── stream.ts custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
|
||
├── admin.ts router HTTP client + SSH helpers (YAML edit, systemctl)
|
||
└── README.md this file
|
||
```
|
||
|
||
## 3. Environment variables
|
||
|
||
All are optional — the defaults match the current host.
|
||
|
||
| Env var | Default | Purpose |
|
||
|---|---|---|
|
||
| `AI_SERVER_URL` | `https://ai.shahondin1624.de` | Base URL of the Caddy endpoint |
|
||
| `AI_SERVER_CERTS_DIR` | `~/.pi/agent/certs` | Dir holding client cert + key + CA |
|
||
| `AI_SERVER_CA` | `<certs>/root-ca.pem` | CA file |
|
||
| `AI_SERVER_CLIENT_CERT` | `<certs>/client.pem` | Client cert |
|
||
| `AI_SERVER_CLIENT_KEY` | `<certs>/client-key.pem` | Client private key |
|
||
| `AI_SERVER_TIMEOUT_MS` | `300000` | Per-request stream timeout |
|
||
| `AI_SERVER_SSH_HOST` | `ai-server@192.168.2.3` | SSH target for admin commands |
|
||
| `AI_SERVER_PRESET_PATH` | `~/.config/llama-swap/config.yaml` | YAML config on the SSH target |
|
||
| `AI_SERVER_SERVICE_UNIT` | `llama-swap.service` | systemd unit name |
|
||
| `AI_SERVER_MODELS_PATH` | `/v1/models` | Models list endpoint |
|
||
| `AI_SERVER_RUNNING_PATH` | `/running` | Currently running models endpoint |
|
||
| `AI_SERVER_UNLOAD_PATH` | `/api/models/unload/<id>` | Unload single model |
|
||
| `AI_SERVER_UNLOAD_ALL_PATH` | `/api/models/unload` | Unload all models |
|
||
| `AI_SERVER_UPSTREAM_HEALTH_PATH` | `/upstream/<id>/health` | Warm-load / health endpoint |
|
||
|
||
## 4. Server-side setup (192.168.2.3)
|
||
|
||
### 4.1 llama-swap install
|
||
|
||
```bash
|
||
npm install -g llama-swap
|
||
# or use the binary release from the llama-swap GitHub repo
|
||
```
|
||
|
||
### 4.2 Model storage
|
||
|
||
```
|
||
~/models/<model-name>.gguf
|
||
```
|
||
|
||
### 4.3 Config file — `~/.config/llama-swap/config.yaml`
|
||
|
||
llama-swap uses a YAML config file. Each model is defined under `models:` with
|
||
a `cmd:` block containing the llama-server invocation.
|
||
|
||
```yaml
|
||
globalTTL: 1800
|
||
models:
|
||
Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||
cmd: |
|
||
/home/ai-server/llama.cpp/build/bin/llama-server
|
||
--model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
|
||
--ctx-size 262144
|
||
--temp 0.7
|
||
--cache-type-k q8_0
|
||
--cache-type-v q8_0
|
||
--n-gpu-layers 99
|
||
|
||
MiniMax-M2.7-IQ3_XXS:
|
||
cmd: |
|
||
/home/ai-server/llama.cpp/build/bin/llama-server
|
||
--model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
|
||
--ctx-size 131072
|
||
--temp 1.0
|
||
--cache-type-k q8_0
|
||
--cache-type-v q8_0
|
||
--n-gpu-layers 99
|
||
```
|
||
|
||
### 4.4 Systemd user service — `~/.config/systemd/user/llama-swap.service`
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=LLaMA-swap AI Server (Swap Mode)
|
||
After=network.target
|
||
Wants=network.target
|
||
|
||
[Service]
|
||
Type=simple
|
||
User=ai-server
|
||
Group=ai-server
|
||
WorkingDirectory=/home/ai-server
|
||
ExecStart=/home/ai-server/node_modules/.bin/llama-swap \
|
||
--host 0.0.0.0 \
|
||
--port 8080 \
|
||
--config /home/ai-server/.config/llama-swap/config.yaml
|
||
|
||
LimitNOFILE=65536
|
||
LimitMEMLOCK=unlimited
|
||
LimitMEMLOCK_BYTES=107374182400
|
||
|
||
Restart=on-failure
|
||
RestartSec=5
|
||
StandardOutput=journal
|
||
StandardError=journal
|
||
|
||
[Install]
|
||
WantedBy=default.target
|
||
```
|
||
|
||
Enable and start:
|
||
|
||
```bash
|
||
systemctl --user daemon-reload && systemctl --user enable --now llama-swap.service
|
||
loginctl enable-linger $(whoami) # keep user services running across logouts
|
||
```
|
||
|
||
### 4.5 Router HTTP API (reference)
|
||
|
||
| Method | Path | Body | Notes |
|
||
|---|---|---|---|
|
||
| `GET` | `/v1/models` | — | List models; `{"data":[{id,object,created,owned_by}]}` |
|
||
| `GET` | `/running` | — | Currently loaded models; `{"running":[{id,...}]}` |
|
||
| `POST` | `/api/models/unload` | — | Unload all models; returns `{"msg":"ok"}` |
|
||
| `POST` | `/api/models/unload/<id>` | — | Unload specific model; plain text `OK` |
|
||
| `GET` | `/upstream/<id>/health` | — | Warm-load model (forces spawn without inference) |
|
||
| `GET` | `/health` | — | Plain text `OK` (not JSON) |
|
||
| `POST` | `/v1/chat/completions` | OpenAI Chat Completions payload | What pi and the web UI use |
|
||
|
||
> **Note:** Response bodies are mixed JSON and plain text. The extension's
|
||
> `routerRequest()` falls back to `{raw: buf}` for non-JSON responses, so
|
||
> unload calls won't crash — they'll return `{raw: "OK"}`.
|
||
|
||
## 5. Caddy + mTLS setup (192.168.2.2)
|
||
|
||
Caddy config lives at `/mnt/ssdpool/@docker/caddy/` (Caddyfile, docker-compose,
|
||
certs). The domain `ai.shahondin1624.de` is configured with strict mTLS:
|
||
|
||
```caddy
|
||
ai.shahondin1624.de {
|
||
tls /etc/caddy/certs/caddy.pem /etc/caddy/certs/caddy-key.pem {
|
||
client_auth {
|
||
mode require_and_verify
|
||
trusted_ca_cert_file /etc/caddy/certs/root-ca.pem
|
||
}
|
||
}
|
||
reverse_proxy 192.168.2.3:8080
|
||
}
|
||
```
|
||
|
||
The volume mount in docker-compose must expose `./certs` into the container at
|
||
`/etc/caddy/certs:ro` — Caddy cannot read cert files that aren't inside its
|
||
filesystem namespace.
|
||
|
||
### 5.1 Certificate generation
|
||
|
||
Run on the Caddy host (192.168.2.2):
|
||
|
||
```bash
|
||
cd /mnt/ssdpool/@docker/caddy/certs
|
||
openssl genrsa -out root-ca.key 4096
|
||
openssl req -new -x509 -days 3650 -key root-ca.key -out root-ca.pem -subj "/CN=ShahODin Root CA/O=ShahODin/C=DE"
|
||
openssl genrsa -out client.key 4096
|
||
openssl req -new -key client.key -out client.csr -subj "/CN=ShahODin Client/O=ShahODin/C=DE"
|
||
openssl x509 -req -in client.csr -CA root-ca.pem -CAkey root-ca.key -CAcreateserial -out client.crt -days 3650
|
||
```
|
||
|
||
Bundle into a PKCS#12 for browser import. **Use `-legacy`** so NSS-based stores
|
||
(Firefox, Chromium on Linux, Brave Flatpak) can read it — OpenSSL 3 defaults to
|
||
PBES2/AES-256 which older parsers reject:
|
||
|
||
```bash
|
||
openssl pkcs12 -legacy -export -out client-legacy.p12 -inkey client.key -in client.crt -certfile root-ca.pem -passout pass:
|
||
```
|
||
|
||
Files needed on each client: `client.crt` (as `client.pem`), `client.key` (as
|
||
`client-key.pem`), `root-ca.pem`. For CLI usage copy them to `~/.pi/agent/certs/`
|
||
on the client machine; the extension reads them from there.
|
||
|
||
## 6. Client-side — installing the extension
|
||
|
||
```bash
|
||
# 1) Copy certs to the canonical client location
|
||
mkdir -p ~/.pi/agent/certs
|
||
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.crt ~/.pi/agent/certs/client.pem
|
||
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.key ~/.pi/agent/certs/client-key.pem
|
||
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/root-ca.pem ~/.pi/agent/certs/
|
||
|
||
# 2) Copy the extension directory
|
||
scp -r user@source:~/.pi/agent/extensions/ai-server ~/.pi/agent/extensions/
|
||
|
||
# 3) Optionally configure SSH key auth to the AI server (for admin commands)
|
||
ssh-copy-id ai-server@192.168.2.3
|
||
```
|
||
|
||
Run `/reload` in pi — the extension loads, discovers models from the router,
|
||
registers the `ai-server` provider, and installs the admin slash commands.
|
||
|
||
## 7. Slash commands
|
||
|
||
| Command | Purpose | Transport |
|
||
|---|---|---|
|
||
| `/ai-server-status` | Tabular view of models, status, ctx size | HTTPS mTLS |
|
||
| `/ai-server-refresh` | Re-discover models and re-register the provider | HTTPS mTLS |
|
||
| `/ai-server-load <id>` | Warm-load a model via `/upstream/<id>/health` | HTTPS mTLS |
|
||
| `/ai-server-unload <id>` | Unload a model via `/api/models/unload/<id>` | HTTPS mTLS |
|
||
| `/ai-server-ctx <id> <size>` | Edit YAML config ctx-size, reload the model | SSH + HTTPS |
|
||
| `/ai-server-preset` | Print the server's llama-swap config (YAML) | SSH |
|
||
| `/ai-server-restart` | `systemctl --user restart llama-swap.service` | SSH |
|
||
|
||
`<id>` arguments tab-complete against the live router model list.
|
||
|
||
## 8. Adding a new model
|
||
|
||
```bash
|
||
# On the AI server
|
||
ssh ai-server@192.168.2.3
|
||
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .
|
||
|
||
# Add a config block to ~/.config/llama-swap/config.yaml (see example in §4.3)
|
||
```
|
||
|
||
Then from pi:
|
||
|
||
```
|
||
/ai-server-refresh # discovers the new model
|
||
/ai-server-load <id> # first load may take a minute for a cold GGUF
|
||
```
|
||
|
||
No extension-side config changes are needed — discovery picks it up.
|
||
|
||
## 9. Browser access to the built-in web UI
|
||
|
||
Navigate to `https://ai.shahondin1624.de/` in any browser that has the client
|
||
cert and trusts the root CA.
|
||
|
||
### 9.1 Firefox (simplest path, always works)
|
||
|
||
Firefox uses its own NSS trust exclusively. Import `client-legacy.p12` under
|
||
*Preferences → Privacy & Security → Certificates → Your Certificates*, and
|
||
`root-ca.pem` under *Authorities* with "trust to identify websites" checked.
|
||
|
||
### 9.2 Chromium / Brave
|
||
|
||
Chromium on Linux now uses the bundled **Chrome Root Store** for server cert
|
||
validation. Neither `/etc/pki/ca-trust/source/anchors/` (system trust) nor the
|
||
user's `~/.pki/nssdb` alone are consulted for server cert chain verification in
|
||
recent Brave/Chrome builds. Two workarounds:
|
||
|
||
1. **`brave://certificate-manager/` → Custom** (Chromium ≥137) — import
|
||
`root-ca.pem` here and flag it as trusted for websites. This is the modern
|
||
replacement for the removed `ChromeRootStoreEnabled` policy.
|
||
2. **Fallback: Firefox** — if the Custom tab isn't available or the feature is
|
||
still buggy in a given build, use Firefox for the web UI. The mTLS client
|
||
cert import path is straightforward there.
|
||
|
||
Client-cert auth (mTLS handshake itself) still works via NSS even when server
|
||
cert validation goes through CRS, so installing the client `.p12` into NSS is
|
||
enough for handshake. Only the padlock/trust UI is affected by the CRS issue.
|
||
|
||
### 9.3 Brave Flatpak specifics
|
||
|
||
The Brave Flatpak has its own isolated NSS database at
|
||
`~/.var/app/com.brave.Browser/.pki/nssdb/`. Import directly into it:
|
||
|
||
```bash
|
||
pk12util -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -i ~/client-legacy.p12 -W ''
|
||
certutil -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -A -t "CT,C,C" -n "ShahODin Root CA" -i ~/root-ca.pem
|
||
```
|
||
|
||
To stop the "select a certificate" prompt on each page load, write a Brave
|
||
enterprise policy:
|
||
|
||
```bash
|
||
sudo flatpak override com.brave.Browser --filesystem=/etc/brave:ro
|
||
sudo install -D -m 644 /path/to/policy.json /etc/brave/policies/managed/shahondin1624.json
|
||
flatpak kill com.brave.Browser
|
||
```
|
||
|
||
Where `policy.json` contains:
|
||
|
||
```json
|
||
{
|
||
"AutoSelectCertificateForUrls": [
|
||
"{\"pattern\":\"https://ai.shahondin1624.de\",\"filter\":{\"ISSUER\":{\"CN\":\"ShahODin Root CA\"}}}"
|
||
]
|
||
}
|
||
```
|
||
|
||
Verify under `brave://policy`. The policy must show status **OK**, not
|
||
**Error** (an Error usually means the key has been renamed or removed upstream).
|
||
|
||
## 10. Troubleshooting
|
||
|
||
| Symptom | Likely cause | Fix |
|
||
|---|---|---|
|
||
| pi: `HTTP 400: request exceeds available context size` | Model config has a small `--ctx-size` | Increase `--ctx-size` in the YAML config |
|
||
| pi: `HTTP 400: File Not Found` on load | Wrong model id — check `/v1/models` | Use the exact id from the models list |
|
||
| Model shows as `[unloaded]` in `/ai-server-status` | Model isn't currently loaded in llama-swap | Run `/ai-server-load <id>` to warm it |
|
||
| First request is slow | Cold model load — no preload configured | Add `hooks.on_startup.preload: [<id>]` to config |
|
||
| `certutil: unable to open …root-ca.pem` | CA file not yet scp'd locally | Copy `root-ca.pem` from the Caddy host |
|
||
| Brave: p12 import "Invalid or corrupt file" | OpenSSL 3 default PBES2/AES-256 encryption | Regenerate with `openssl pkcs12 -legacy -export …` |
|
||
| Brave: site loads but padlock is red | Chrome Root Store issue | Use `brave://certificate-manager/` → Custom |
|
||
| Cert selection prompt appears on every page load | `AutoSelectCertificateForUrls` policy missing or malformed | See §9.3 |
|
||
| System-trust update-ca-trust has no effect on Brave | Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust` | Import directly into the sandbox's NSS DB (§9.3) |
|
||
| Chat first-token latency seems long | Cold model load | First chat turn may wait 10–60s while the GGUF mmap's in |
|
||
| `/ai-server-restart` fails | Wrong service unit name | Check `AI_SERVER_SERVICE_UNIT` / create the proper unit |
|
||
| `/ai-server-ctx` fails | YAML format changed | Edit `~/.config/llama-swap/config.yaml` manually first |
|
||
|
||
## 11. Security notes
|
||
|
||
- The client private key (`client.key` / `client-key.pem` / `client-legacy.p12`)
|
||
is the sole credential for API access. Treat it like an SSH key — do not
|
||
share, do not commit, do not email.
|
||
- To revoke a client, regenerate the root CA's cert list and remove/rename the
|
||
offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this
|
||
is a single-user deployment.)
|
||
- The `apiKey: "ai-server-mtls"` string in `index.ts` is a placeholder required
|
||
by the pi model registry; no bearer token is sent over the wire. All auth is
|
||
cert-based.
|
||
- Every admin slash command with a mutating side-effect (`ctx`, `restart`) is
|
||
gated behind a `ctx.ui.confirm` dialog.
|
||
|
||
## 12. Paths reference
|
||
|
||
### On the AI server (192.168.2.3)
|
||
|
||
| Path | Purpose |
|
||
|---|---|
|
||
| `~/llama.cpp/` | llama.cpp source + build tree |
|
||
| `~/llama.cpp/build/bin/llama-server` | Binary (invoked by llama-swap) |
|
||
| `~/models/*.gguf` | Model weights |
|
||
| `~/.config/llama-swap/config.yaml` | llama-swap YAML config |
|
||
| `~/.config/systemd/user/llama-swap.service` | Service unit |
|
||
| `~/vram-monitor.sh` | Optional idle-unload cron helper |
|
||
|
||
### On the Caddy host (192.168.2.2)
|
||
|
||
| Path | Purpose |
|
||
|---|---|
|
||
| `/mnt/ssdpool/@docker/caddy/Caddyfile` | Caddy config |
|
||
| `/mnt/ssdpool/@docker/caddy/docker-compose.yml` | Caddy container definition |
|
||
| `/mnt/ssdpool/@docker/caddy/certs/root-ca.pem` | Root CA (public) |
|
||
| `/mnt/ssdpool/@docker/caddy/certs/root-ca.key` | Root CA private key (keep offline-ish) |
|
||
| `/mnt/ssdpool/@docker/caddy/certs/caddy.pem` + `caddy-key.pem` | Server cert for `ai.shahondin1624.de` |
|
||
| `/mnt/ssdpool/@docker/caddy/certs/client.crt` + `client.key` | Client cert/key |
|
||
| `/mnt/ssdpool/@docker/caddy/certs/client-legacy.p12` | Browser-import bundle (legacy-encoded) |
|
||
|
||
### On each pi client
|
||
|
||
| Path | Purpose |
|
||
|---|---|
|
||
| `~/.pi/agent/certs/client.pem` | Client cert |
|
||
| `~/.pi/agent/certs/client-key.pem` | Client private key |
|
||
| `~/.pi/agent/certs/root-ca.pem` | Root CA |
|
||
| `~/.pi/agent/extensions/ai-server/` | This extension |
|