Files

T

shahondin1624 99ad3630fc stream: report cached tokens; indicator: suppress pi's "Working..."

Two small fixes:

ai-server/stream.ts
- llama.cpp reports cached prompt tokens via
    usage.prompt_tokens_details.cached_tokens
  and we were ignoring it. Populate output.usage.cacheRead so pi's
  footer can show the "R<tokens>" field. cacheRead is a subset of
  prompt_tokens (already counted in input), so totalTokens stays
  input + output — no double-counting.

dark-mechanicus-indicator.ts
- Pi appends "Working... (ESC to interrupt)" next to custom working
  indicator frames via a separate message slot. Call
  ctx.ui.setWorkingMessage("") on session_start + every turn_start to
  clear that suffix so the indicator line is just
    ⚙ <quote> · <elapsed>
  with no trailing "Working...".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 23:28:16 +02:00

admin.ts

ai-server: stop pinning server cert to private CA (LE is now live)

2026-04-23 23:03:35 +02:00

config.ts

Initial commit — ai-server and local-llama extensions

2026-04-23 21:14:40 +02:00

index.ts

Gate startup logs behind PI_DEBUG + skip GGUF-shard phantom entries

2026-04-23 22:38:09 +02:00

messages.ts

Initial commit — ai-server and local-llama extensions

2026-04-23 21:14:40 +02:00

README.md

Initial commit — ai-server and local-llama extensions

2026-04-23 21:14:40 +02:00

stream.ts

stream: report cached tokens; indicator: suppress pi's "Working..."

2026-04-23 23:28:16 +02:00

README.md

ai-server — PI extension for a self-hosted llama.cpp router behind mTLS

A multi-file pi extension that exposes a remote llama.cpp router as a provider to pi, with dynamic model discovery and admin slash commands. Chat streams use client-certificate TLS so the endpoint can be exposed over the public internet without a bearer token.

1. Architecture

┌────────────┐    mTLS (HTTPS) ┌──────────────┐   HTTP  ┌─────────────────┐
│ pi client  │───────────────►│ Caddy        │────────►│ llama-server     │
│ (this ext) │                │ 192.168.2.2  │         │ 192.168.2.3:8080 │
└────────────┘   client cert  │ ai.…         │         │ router mode      │
                              └──────────────┘         │ --models-max 1   │
                                                       └─────────────────┘
                                                               │
                                                      ~/.llama-models.ini
                                                      (per-model presets)

Caddy terminates TLS and enforces require_and_verify client-cert auth on ai.shahondin1624.de. Plaintext HTTP is forwarded to the llama-server router.
llama-server runs in --models-mode router with --models-max 1, so exactly one worker is loaded at a time; selecting a different model unloads the previous one.
This extension performs OpenAI-compatible chat streaming over mTLS and surfaces admin endpoints as pi slash commands.

2. Extension layout

~/.pi/agent/extensions/ai-server/
├── index.ts       entry: async discovery + registerProvider + commands
├── config.ts      URLs, SSH host, cert paths, MODELS[] fallback
├── messages.ts    Context → OpenAI chat/completions messages
├── stream.ts      custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
├── admin.ts       router HTTP client + SSH helpers (preset edit, systemctl)
└── README.md      this file

3. Environment variables

All are optional — the defaults match the current host.

Env var	Default	Purpose
`AI_SERVER_URL`	`https://ai.shahondin1624.de`	Base URL of the Caddy endpoint
`AI_SERVER_CERTS_DIR`	`~/.pi/agent/certs`	Dir holding client cert + key + CA
`AI_SERVER_CA`	`<certs>/root-ca.pem`	CA file
`AI_SERVER_CLIENT_CERT`	`<certs>/client.pem`	Client cert
`AI_SERVER_CLIENT_KEY`	`<certs>/client-key.pem`	Client private key
`AI_SERVER_TIMEOUT_MS`	`300000`	Per-request stream timeout
`AI_SERVER_SSH_HOST`	`ai-server@192.168.2.3`	SSH target for admin commands
`AI_SERVER_PRESET_PATH`	`~/.llama-models.ini`	Preset path on the SSH target

4. Server-side setup (192.168.2.3)

4.1 llama.cpp build

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)

Vulkan is used for GPU offload on the Strix Halo iGPU (no ROCm needed). The binary ends up at ~/llama.cpp/build/bin/llama-server.

4.2 Model storage

~/models/<model-name>.gguf

Multi-shard GGUFs (*-00001-of-NNNNN.gguf) work too — point the preset at the first shard and llama.cpp auto-loads the rest.

4.3 Preset file — `~/.llama-models.ini`

Router mode consults this file. Each [section] is a model id usable in API requests. The section name and model = path are the only required fields; the rest become --flag value args to the per-model worker when it spawns.

[Qwen_Qwen3.6-35B-A3B-Q8_0]
model = /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
ctx-size = 262144
temp = 0.7
cache-type-k = q8_0
cache-type-v = q8_0
n-gpu-layers = 99

[MiniMax-M2.7-IQ3_XXS]
model = /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS-00001-of-NNNNN.gguf
ctx-size = 131072
temp = 1.0
cache-type-k = q8_0
cache-type-v = q8_0
n-gpu-layers = 99

Placeholder sections (without model =) show up in GET /models but are filtered out by the extension's discovery — they would fail on load.

4.4 Systemd user service — `~/.config/systemd/user/llama-server.service`

[Unit]
Description=LLaMA.cpp AI Server (Router Mode, Vulkan)
After=network.target
Wants=network.target

[Service]
Type=simple
User=ai-server
Group=ai-server
WorkingDirectory=/home/ai-server
ExecStart=/home/ai-server/llama.cpp/build/bin/llama-server \
    --host 0.0.0.0 \
    --port 8080 \
    --models-dir /home/ai-server/models \
    --models-max 1 \
    --models-autoload \
    --models-preset /home/ai-server/.llama-models.ini \
    --gpu-layers 99 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0

LimitNOFILE=65536
LimitMEMLOCK=unlimited
LimitMEMLOCK_BYTES=107374182400

Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target

Important flags:

No -c <N> at the router level. That flag is inherited by every child worker and silently caps the preset's ctx-size. Let per-model presets win.
--models-max 1 enforces single-model concurrency (matters on shared unified-memory hardware where two workers would fight for VRAM).
--models-autoload spawns workers on demand via POST /models/load.

Enable and start:

systemctl --user daemon-reload && systemctl --user enable --now llama-server.service
loginctl enable-linger $(whoami)   # keep user services running across logouts

4.5 Router HTTP API (reference)

Method	Path	Body	Notes
`GET`	`/models`	—	List models; `status.args` contains the spawned worker's command line
`POST`	`/models/load`	`{"model":"<id>"}`	Payload key is `model`, not `id`
`POST`	`/models/unload`	`{"model":"<id>"}`	Same
`GET`	`/health`	—	`{"status":"ok"}` when router is up
`POST`	`/v1/chat/completions`	OpenAI Chat Completions payload	What pi and the web UI use
`GET`	`/`	—	Built-in SvelteKit chat UI with a model picker

5. Caddy + mTLS setup (192.168.2.2)

Caddy config lives at /mnt/ssdpool/@docker/caddy/ (Caddyfile, docker-compose, certs). The domain ai.shahondin1624.de is configured with strict mTLS:

ai.shahondin1624.de {
    tls /etc/caddy/certs/caddy.pem /etc/caddy/certs/caddy-key.pem {
        client_auth {
            mode require_and_verify
            trusted_ca_cert_file /etc/caddy/certs/root-ca.pem
        }
    }
    reverse_proxy 192.168.2.3:8080
}

The volume mount in docker-compose must expose ./certs into the container at /etc/caddy/certs:ro — Caddy cannot read cert files that aren't inside its filesystem namespace.

5.1 Certificate generation

Run on the Caddy host (192.168.2.2):

cd /mnt/ssdpool/@docker/caddy/certs
openssl genrsa -out root-ca.key 4096
openssl req -new -x509 -days 3650 -key root-ca.key -out root-ca.pem -subj "/CN=ShahODin Root CA/O=ShahODin/C=DE"
openssl genrsa -out client.key 4096
openssl req -new -key client.key -out client.csr -subj "/CN=ShahODin Client/O=ShahODin/C=DE"
openssl x509 -req -in client.csr -CA root-ca.pem -CAkey root-ca.key -CAcreateserial -out client.crt -days 3650

Bundle into a PKCS#12 for browser import. Use -legacy so NSS-based stores (Firefox, Chromium on Linux, Brave Flatpak) can read it — OpenSSL 3 defaults to PBES2/AES-256 which older parsers reject:

openssl pkcs12 -legacy -export -out client-legacy.p12 -inkey client.key -in client.crt -certfile root-ca.pem -passout pass:

Files needed on each client: client.crt (as client.pem), client.key (as client-key.pem), root-ca.pem. For CLI usage copy them to ~/.pi/agent/certs/ on the client machine; the extension reads them from there.

6. Client-side — installing the extension

# 1) Copy certs to the canonical client location
mkdir -p ~/.pi/agent/certs
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.crt ~/.pi/agent/certs/client.pem
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/client.key ~/.pi/agent/certs/client-key.pem
scp user@caddy-host:/mnt/ssdpool/@docker/caddy/certs/root-ca.pem ~/.pi/agent/certs/

# 2) Copy the extension directory
scp -r user@source:~/.pi/agent/extensions/ai-server ~/.pi/agent/extensions/

# 3) Optionally configure SSH key auth to the AI server (for admin commands)
ssh-copy-id ai-server@192.168.2.3

Run /reload in pi — the extension loads, discovers models from the router, registers the ai-server provider, and installs the admin slash commands.

7. Slash commands

Command	Purpose	Transport
`/ai-server-status`	Tabular view of models, status, ctx size	HTTPS mTLS
`/ai-server-refresh`	Re-discover models and re-register the provider	HTTPS mTLS
`/ai-server-load <id>`	Load a model on-demand	HTTPS mTLS
`/ai-server-unload <id>`	Unload a model	HTTPS mTLS
`/ai-server-ctx <id> <size>`	Edit preset ctx-size, unload + reload	SSH + HTTPS
`/ai-server-preset`	Print the server's `~/.llama-models.ini`	SSH
`/ai-server-restart`	`systemctl --user restart llama-server.service`	SSH

<id> arguments tab-complete against the live router model list.

8. Adding a new model

# On the AI server
ssh ai-server@192.168.2.3
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .

# Add a preset section to ~/.llama-models.ini — section name = model id
# (see example in §4.3)

Then from pi:

/ai-server-refresh      # discovers the new preset
/ai-server-load <id>    # first load may take a minute for a cold GGUF

No extension-side config changes are needed — discovery picks it up.

9. Browser access to the built-in web UI

llama-server ships a SvelteKit chat UI at / with a model picker. Navigate to https://ai.shahondin1624.de/ in any browser that has the client cert and trusts the root CA.

9.1 Firefox (simplest path, always works)

Firefox uses its own NSS trust exclusively. Import client-legacy.p12 under Preferences → Privacy & Security → Certificates → Your Certificates, and root-ca.pem under Authorities with "trust to identify websites" checked.

9.2 Chromium / Brave

Chromium on Linux now uses the bundled Chrome Root Store for server cert validation. Neither /etc/pki/ca-trust/source/anchors/ (system trust) nor the user's ~/.pki/nssdb alone are consulted for server cert chain verification in recent Brave/Chrome builds. Two workarounds:

brave://certificate-manager/ → Custom (Chromium ≥137) — import root-ca.pem here and flag it as trusted for websites. This is the modern replacement for the removed ChromeRootStoreEnabled policy.
Fallback: Firefox — if the Custom tab isn't available or the feature is still buggy in a given build, use Firefox for the web UI. The mTLS client cert import path is straightforward there.

Client-cert auth (mTLS handshake itself) still works via NSS even when server cert validation goes through CRS, so installing the client .p12 into NSS is enough for handshake. Only the padlock/trust UI is affected by the CRS issue.

9.3 Brave Flatpak specifics

The Brave Flatpak has its own isolated NSS database at ~/.var/app/com.brave.Browser/.pki/nssdb/. Import directly into it:

pk12util -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -i ~/client-legacy.p12 -W ''
certutil -d sql:$HOME/.var/app/com.brave.Browser/.pki/nssdb -A -t "CT,C,C" -n "ShahODin Root CA" -i ~/root-ca.pem

To stop the "select a certificate" prompt on each page load, write a Brave enterprise policy:

sudo flatpak override com.brave.Browser --filesystem=/etc/brave:ro
sudo install -D -m 644 /path/to/policy.json /etc/brave/policies/managed/shahondin1624.json
flatpak kill com.brave.Browser

Where policy.json contains:

{
  "AutoSelectCertificateForUrls": [
    "{\"pattern\":\"https://ai.shahondin1624.de\",\"filter\":{\"ISSUER\":{\"CN\":\"ShahODin Root CA\"}}}"
  ]
}

Verify under brave://policy. The policy must show status OK, not Error (an Error usually means the key has been renamed or removed upstream).

10. Troubleshooting

Symptom	Likely cause	Fix
pi: `HTTP 400: request exceeds available context size`	Router started with `-c <small>`, overriding the preset's larger `ctx-size`	Remove the router-level `-c` flag from the systemd ExecStart
pi: `HTTP 400: File Not Found` on `/models/load`	Wrong JSON body key (older versions used `id`)	Must be `{"model":"<id>"}` — the extension's `admin.ts` already does this
`certutil: unable to open …root-ca.pem`	CA file not yet scp'd locally	Copy `root-ca.pem` from the Caddy host
Brave: p12 import "Invalid or corrupt file"	OpenSSL 3 default PBES2/AES-256 encryption	Regenerate with `openssl pkcs12 -legacy -export …`
Brave: site loads but padlock is red, `ChromeRootStoreEnabled: Error` in `brave://policy`	Policy was removed upstream	Use `brave://certificate-manager/` → Custom, or use Firefox
Cert selection prompt appears on every page load	`AutoSelectCertificateForUrls` policy missing or malformed	See §9.3
System-trust update-ca-trust has no effect on Brave	Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust`	Import directly into the sandbox's NSS DB (§9.3)
Model shows as `[no model path]` in `/ai-server-status`	Preset section in `~/.llama-models.ini` has no `model =` line	Add the path, then `/ai-server-refresh`
Chat first-token latency seems long	Cold model load is not counted separately	First chat turn may wait 10–60s while the GGUF mmap's in; subsequent turns stream immediately

11. Security notes

The client private key (client.key / client-key.pem / client-legacy.p12) is the sole credential for API access. Treat it like an SSH key — do not share, do not commit, do not email.
To revoke a client, regenerate the root CA's cert list and remove/rename the offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this is a single-user deployment.)
The apiKey: "ai-server-mtls" string in index.ts is a placeholder required by the pi model registry; no bearer token is sent over the wire. All auth is cert-based.
Every admin slash command with a mutating side-effect (ctx, restart) is gated behind a ctx.ui.confirm dialog.

12. Paths reference

On the AI server (192.168.2.3)

Path	Purpose
`~/llama.cpp/`	llama.cpp source + build tree
`~/llama.cpp/build/bin/llama-server`	Binary
`~/models/*.gguf`	Model weights
`~/.llama-models.ini`	Router preset file
`~/.config/systemd/user/llama-server.service`	Service unit
`~/vram-monitor.sh`	Optional idle-unload cron helper

On the Caddy host (192.168.2.2)

Path	Purpose
`/mnt/ssdpool/@docker/caddy/Caddyfile`	Caddy config
`/mnt/ssdpool/@docker/caddy/docker-compose.yml`	Caddy container definition
`/mnt/ssdpool/@docker/caddy/certs/root-ca.pem`	Root CA (public)
`/mnt/ssdpool/@docker/caddy/certs/root-ca.key`	Root CA private key (keep offline-ish)
`/mnt/ssdpool/@docker/caddy/certs/caddy.pem` + `caddy-key.pem`	Server cert for `ai.shahondin1624.de`
`/mnt/ssdpool/@docker/caddy/certs/client.crt` + `client.key`	Client cert/key
`/mnt/ssdpool/@docker/caddy/certs/client-legacy.p12`	Browser-import bundle (legacy-encoded)

On each pi client

Path	Purpose
`~/.pi/agent/certs/client.pem`	Client cert
`~/.pi/agent/certs/client-key.pem`	Client private key
`~/.pi/agent/certs/root-ca.pem`	Root CA
`~/.pi/agent/extensions/ai-server/`	This extension

README.md Unescape Escape

ai-server — PI extension for a self-hosted llama.cpp router behind mTLS

1. Architecture

2. Extension layout

3. Environment variables

4. Server-side setup (192.168.2.3)

4.1 llama.cpp build

4.2 Model storage

4.3 Preset file — ~/.llama-models.ini

4.4 Systemd user service — ~/.config/systemd/user/llama-server.service

4.5 Router HTTP API (reference)

5. Caddy + mTLS setup (192.168.2.2)

5.1 Certificate generation

6. Client-side — installing the extension

7. Slash commands

8. Adding a new model

9. Browser access to the built-in web UI

9.1 Firefox (simplest path, always works)

9.2 Chromium / Brave

9.3 Brave Flatpak specifics

10. Troubleshooting

11. Security notes

12. Paths reference

On the AI server (192.168.2.3)

On the Caddy host (192.168.2.2)

On each pi client

README.md

4.3 Preset file — `~/.llama-models.ini`

4.4 Systemd user service — `~/.config/systemd/user/llama-server.service`