Files
pi-extensions/ai-server/README.md
T
shahondin1624 f7af660727 migrate ai-server extension from llama.cpp router to llama-swap
Endpoint rewrites:
  - GET /v1/models + /running → merged listModels() with running flag
  - POST /models/load → GET /upstream/<id>/health (warm load)
  - POST /models/unload → POST /api/models/unload/<id> (no body)
  - Added POST /api/models/unload for unloadAll()

Config migration:
  - Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml
  - Service unit: llama-server.service → llama-swap.service
  - setPresetKey() rewritten from INI awk to YAML-aware awk for
    editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks

Per-model ctx-size (fixes 0/33k bug):
  - parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per
    model block → Map<id, ctxSize>
  - extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string
  - discoverModels(): Promise.all(listModels, listRunning, readPreset),
    ctx priority: running cmd → yaml → 32768 fallback
  - Removed broken extractCtxSize stub and dangling imports

Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3,
isShardArtefact ×3, isReasoningModel ×3)

README: full rewrite covering llama-swap architecture, YAML config format,
new endpoints, troubleshooting table updated.
2026-05-27 10:42:19 +02:00

16 KiB
Raw Blame History

ai-server — PI extension for a self-hosted llama-swap server behind mTLS

A multi-file pi extension that exposes a remote llama-swap instance as a provider to pi, with dynamic model discovery and admin slash commands. Chat streams use client-certificate TLS so the endpoint can be exposed over the public internet without a bearer token.


1. Architecture

┌────────────┐    mTLS (HTTPS) ┌──────────────┐   HTTP  ┌─────────────────┐
│ pi client  │───────────────►│ Caddy        │────────►│ llama-swap       │
│ (this ext) │                │ 192.168.2.2  │         │ 192.168.2.3:8080 │
└────────────┘   client cert  │ ai.…         │         │ swap mode         │
                              └──────────────┘         │ globalTTL: 1800  │
                                                       │ scheduler: one   │
                                                       └─────────────────┘
                                                               │
                                                      ~/.config/llama-swap/config.yaml
                                                      (YAML model config)
  • Caddy terminates TLS and enforces require_and_verify client-cert auth on ai.shahondin1624.de. Plaintext HTTP is forwarded to llama-swap.
  • llama-swap runs in swap mode, managing model lifecycle (load/unload/swap) with a YAML config at ~/.config/llama-swap/config.yaml.
  • This extension performs OpenAI-compatible chat streaming over mTLS and surfaces admin endpoints as pi slash commands.

2. Extension layout

~/.pi/agent/extensions/ai-server/
├── index.ts       entry: async discovery + registerProvider + commands
├── config.ts      URLs, SSH host, cert paths, MODELS[] fallback
├── messages.ts    Context → OpenAI chat/completions messages
├── stream.ts      custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
├── admin.ts       router HTTP client + SSH helpers (YAML edit, systemctl)
└── README.md      this file

3. Environment variables

All are optional — the defaults match the current host.

Env var Default Purpose
AI_SERVER_URL https://ai.shahondin1624.de Base URL of the Caddy endpoint
AI_SERVER_CERTS_DIR ~/.pi/agent/certs Dir holding client cert + key + CA
AI_SERVER_CA <certs>/root-ca.pem CA file
AI_SERVER_CLIENT_CERT <certs>/client.pem Client cert
AI_SERVER_CLIENT_KEY <certs>/client-key.pem Client private key
AI_SERVER_TIMEOUT_MS 300000 Per-request stream timeout
AI_SERVER_SSH_HOST ai-server@192.168.2.3 SSH target for admin commands
AI_SERVER_PRESET_PATH ~/.config/llama-swap/config.yaml YAML config on the SSH target
AI_SERVER_SERVICE_UNIT llama-swap.service systemd unit name
AI_SERVER_MODELS_PATH /v1/models Models list endpoint
AI_SERVER_RUNNING_PATH /running Currently running models endpoint
AI_SERVER_UNLOAD_PATH /api/models/unload/<id> Unload single model
AI_SERVER_UNLOAD_ALL_PATH /api/models/unload Unload all models
AI_SERVER_UPSTREAM_HEALTH_PATH /upstream/<id>/health Warm-load / health endpoint

4. Server-side setup (192.168.2.3)

4.1 llama-swap install

npm install -g llama-swap
# or use the binary release from the llama-swap GitHub repo

4.2 Model storage

~/models/<model-name>.gguf

4.3 Config file — ~/.config/llama-swap/config.yaml

llama-swap uses a YAML config file. Each model is defined under models: with a cmd: block containing the llama-server invocation.

globalTTL: 1800
models:
  Qwen_Qwen3.6-35B-A3B-Q8_0:
    cmd: |
      /home/ai-server/llama.cpp/build/bin/llama-server
      --model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
      --ctx-size 262144
      --temp 0.7
      --cache-type-k q8_0
      --cache-type-v q8_0
      --n-gpu-layers 99

  MiniMax-M2.7-IQ3_XXS:
    cmd: |
      /home/ai-server/llama.cpp/build/bin/llama-server
      --model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
      --ctx-size 131072
      --temp 1.0
      --cache-type-k q8_0
      --cache-type-v q8_0
      --n-gpu-layers 99

4.4 Systemd user service — ~/.config/systemd/user/llama-swap.service

[Unit]
Description=LLaMA-swap AI Server (Swap Mode)
After=network.target
Wants=network.target

[Service]
Type=simple
User=ai-server
Group=ai-server
WorkingDirectory=/home/ai-server
ExecStart=/home/ai-server/node_modules/.bin/llama-swap \
    --host 0.0.0.0 \
    --port 8080 \
    --config /home/ai-server/.config/llama-swap/config.yaml

LimitNOFILE=65536
LimitMEMLOCK=unlimited
LimitMEMLOCK_BYTES=107374182400

Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target

Enable and start:

systemctl --user daemon-reload && systemctl --user enable --now llama-swap.service
loginctl enable-linger $(whoami)   # keep user services running across logouts

4.5 Router HTTP API (reference)

Method Path Body Notes
GET /v1/models List models; {"data":[{id,object,created,owned_by}]}
GET /running Currently loaded models; {"running":[{id,...}]}
POST /api/models/unload Unload all models; returns {"msg":"ok"}
POST /api/models/unload/<id> Unload specific model; plain text OK
GET /upstream/<id>/health Warm-load model (forces spawn without inference)
GET /health Plain text OK (not JSON)
POST /v1/chat/completions OpenAI Chat Completions payload What pi and the web UI use

Note: Response bodies are mixed JSON and plain text. The extension's routerRequest() falls back to {raw: buf} for non-JSON responses, so unload calls won't crash — they'll return {raw: "OK"}.

5. Caddy + mTLS setup (192.168.2.2)

Caddy config lives at /mnt/ssdpool/@docker/caddy/ (Caddyfile, docker-compose, certs). The domain ai.shahondin1624.de is configured with strict mTLS:

ai.shahondin1624.de {
    tls /etc/caddy/certs/caddy.pem /etc/caddy/certs/caddy-key.pem {
        client_auth {
            mode require_and_verify
            trusted_ca_cert_file /etc/caddy/certs/root-ca.pem
        }
    }
    reverse_proxy 192.168.2.3:8080
}

The volume mount in docker-compose must expose ./certs into the container at /etc/caddy/certs:ro — Caddy cannot read cert files that aren't inside its filesystem namespace.

5.1 Certificate generation

Run on the Caddy host (192.168.2.2):

cd /mnt/ssdpool/@docker/caddy/certs
openssl genrsa -out root-ca.key 4096
openssl req -new -x509 -days 3650 -key root-ca.key -out root-ca.pem -subj "/CN=ShahODin Root CA/O=ShahODin/C=DE"
openssl genrsa -out client.key 4096
openssl req -new -key client.key -out client.csr -subj "/CN=ShahODin Client/O=ShahODin/C=DE"
openssl x509 -req -in client.csr -CA root-ca.pem -CAkey root-ca.key -CAcreateserial -out client.crt -days 3650

Bundle into a PKCS#12 for browser import. Use -legacy so NSS-based stores (Firefox, Chromium on Linux, Brave Flatpak) can read it — OpenSSL 3 defaults to PBES2/AES-256 which older parsers reject:

openssl pkcs12 -legacy -export -out client-legacy.p12 -inkey client.key -in client.crt -certfile root-ca.pem -passout pass:

Files needed on each client: client.crt (as client.pem), client.key (as client-key.pem), root-ca.pem. For CLI usage copy them to ~/.pi/agent/certs/ on the client machine; the extension reads them from there.

6. Client-side — installing the extension

# 1) Copy certs to the canonical client location
mkdir -p ~/.pi/agent/certs
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.crt ~/.pi/agent/certs/client.pem
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.key ~/.pi/agent/certs/client-key.pem
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/root-ca.pem ~/.pi/agent/certs/

# 2) Copy the extension directory
scp -r user@source:~/.pi/agent/extensions/ai-server ~/.pi/agent/extensions/

# 3) Optionally configure SSH key auth to the AI server (for admin commands)
ssh-copy-id ai-server@192.168.2.3

Run /reload in pi — the extension loads, discovers models from the router, registers the ai-server provider, and installs the admin slash commands.

7. Slash commands

Command Purpose Transport
/ai-server-status Tabular view of models, status, ctx size HTTPS mTLS
/ai-server-refresh Re-discover models and re-register the provider HTTPS mTLS
/ai-server-load <id> Warm-load a model via /upstream/<id>/health HTTPS mTLS
/ai-server-unload <id> Unload a model via /api/models/unload/<id> HTTPS mTLS
/ai-server-ctx <id> <size> Edit YAML config ctx-size, reload the model SSH + HTTPS
/ai-server-preset Print the server's llama-swap config (YAML) SSH
/ai-server-restart systemctl --user restart llama-swap.service SSH

<id> arguments tab-complete against the live router model list.

8. Adding a new model

# On the AI server
ssh ai-server@192.168.2.3
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .

# Add a config block to ~/.config/llama-swap/config.yaml (see example in §4.3)

Then from pi:

/ai-server-refresh      # discovers the new model
/ai-server-load <id>    # first load may take a minute for a cold GGUF

No extension-side config changes are needed — discovery picks it up.

9. Browser access to the built-in web UI

Navigate to https://ai.shahondin1624.de/ in any browser that has the client cert and trusts the root CA.

9.1 Firefox (simplest path, always works)

Firefox uses its own NSS trust exclusively. Import client-legacy.p12 under Preferences → Privacy & Security → Certificates → Your Certificates, and root-ca.pem under Authorities with "trust to identify websites" checked.

9.2 Chromium / Brave

Chromium on Linux now uses the bundled Chrome Root Store for server cert validation. Neither /etc/pki/ca-trust/source/anchors/ (system trust) nor the user's ~/.pki/nssdb alone are consulted for server cert chain verification in recent Brave/Chrome builds. Two workarounds:

  1. brave://certificate-manager/ → Custom (Chromium ≥137) — import root-ca.pem here and flag it as trusted for websites. This is the modern replacement for the removed ChromeRootStoreEnabled policy.
  2. Fallback: Firefox — if the Custom tab isn't available or the feature is still buggy in a given build, use Firefox for the web UI. The mTLS client cert import path is straightforward there.

Client-cert auth (mTLS handshake itself) still works via NSS even when server cert validation goes through CRS, so installing the client .p12 into NSS is enough for handshake. Only the padlock/trust UI is affected by the CRS issue.

9.3 Brave Flatpak specifics

The Brave Flatpak has its own isolated NSS database at ~/.var/app/com.brave.Browser/.pki/nssdb/. Import directly into it:

pk12util -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -i ~/client-legacy.p12 -W ''
certutil -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -A -t "CT,C,C" -n "ShahODin Root CA" -i ~/root-ca.pem

To stop the "select a certificate" prompt on each page load, write a Brave enterprise policy:

sudo flatpak override com.brave.Browser --filesystem=/etc/brave:ro
sudo install -D -m 644 /path/to/policy.json /etc/brave/policies/managed/shahondin1624.json
flatpak kill com.brave.Browser

Where policy.json contains:

{
  "AutoSelectCertificateForUrls": [
    "{\"pattern\":\"https://ai.shahondin1624.de\",\"filter\":{\"ISSUER\":{\"CN\":\"ShahODin Root CA\"}}}"
  ]
}

Verify under brave://policy. The policy must show status OK, not Error (an Error usually means the key has been renamed or removed upstream).

10. Troubleshooting

Symptom Likely cause Fix
pi: HTTP 400: request exceeds available context size Model config has a small --ctx-size Increase --ctx-size in the YAML config
pi: HTTP 400: File Not Found on load Wrong model id — check /v1/models Use the exact id from the models list
Model shows as [unloaded] in /ai-server-status Model isn't currently loaded in llama-swap Run /ai-server-load <id> to warm it
First request is slow Cold model load — no preload configured Add hooks.on_startup.preload: [<id>] to config
certutil: unable to open …root-ca.pem CA file not yet scp'd locally Copy root-ca.pem from the Caddy host
Brave: p12 import "Invalid or corrupt file" OpenSSL 3 default PBES2/AES-256 encryption Regenerate with openssl pkcs12 -legacy -export …
Brave: site loads but padlock is red Chrome Root Store issue Use brave://certificate-manager/ → Custom
Cert selection prompt appears on every page load AutoSelectCertificateForUrls policy missing or malformed See §9.3
System-trust update-ca-trust has no effect on Brave Brave is a Flatpak; sandbox doesn't see host /etc/pki/ca-trust Import directly into the sandbox's NSS DB (§9.3)
Chat first-token latency seems long Cold model load First chat turn may wait 1060s while the GGUF mmap's in
/ai-server-restart fails Wrong service unit name Check AI_SERVER_SERVICE_UNIT / create the proper unit
/ai-server-ctx fails YAML format changed Edit ~/.config/llama-swap/config.yaml manually first

11. Security notes

  • The client private key (client.key / client-key.pem / client-legacy.p12) is the sole credential for API access. Treat it like an SSH key — do not share, do not commit, do not email.
  • To revoke a client, regenerate the root CA's cert list and remove/rename the offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this is a single-user deployment.)
  • The apiKey: "ai-server-mtls" string in index.ts is a placeholder required by the pi model registry; no bearer token is sent over the wire. All auth is cert-based.
  • Every admin slash command with a mutating side-effect (ctx, restart) is gated behind a ctx.ui.confirm dialog.

12. Paths reference

On the AI server (192.168.2.3)

Path Purpose
~/llama.cpp/ llama.cpp source + build tree
~/llama.cpp/build/bin/llama-server Binary (invoked by llama-swap)
~/models/*.gguf Model weights
~/.config/llama-swap/config.yaml llama-swap YAML config
~/.config/systemd/user/llama-swap.service Service unit
~/vram-monitor.sh Optional idle-unload cron helper

On the Caddy host (192.168.2.2)

Path Purpose
/mnt/ssdpool/@docker/caddy/Caddyfile Caddy config
/mnt/ssdpool/@docker/caddy/docker-compose.yml Caddy container definition
/mnt/ssdpool/@docker/caddy/certs/root-ca.pem Root CA (public)
/mnt/ssdpool/@docker/caddy/certs/root-ca.key Root CA private key (keep offline-ish)
/mnt/ssdpool/@docker/caddy/certs/caddy.pem + caddy-key.pem Server cert for ai.shahondin1624.de
/mnt/ssdpool/@docker/caddy/certs/client.crt + client.key Client cert/key
/mnt/ssdpool/@docker/caddy/certs/client-legacy.p12 Browser-import bundle (legacy-encoded)

On each pi client

Path Purpose
~/.pi/agent/certs/client.pem Client cert
~/.pi/agent/certs/client-key.pem Client private key
~/.pi/agent/certs/root-ca.pem Root CA
~/.pi/agent/extensions/ai-server/ This extension