migrate ai-server extension from llama.cpp router to llama-swap

Endpoint rewrites:
  - GET /v1/models + /running → merged listModels() with running flag
  - POST /models/load → GET /upstream/<id>/health (warm load)
  - POST /models/unload → POST /api/models/unload/<id> (no body)
  - Added POST /api/models/unload for unloadAll()

Config migration:
  - Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml
  - Service unit: llama-server.service → llama-swap.service
  - setPresetKey() rewritten from INI awk to YAML-aware awk for
    editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks

Per-model ctx-size (fixes 0/33k bug):
  - parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per
    model block → Map<id, ctxSize>
  - extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string
  - discoverModels(): Promise.all(listModels, listRunning, readPreset),
    ctx priority: running cmd → yaml → 32768 fallback
  - Removed broken extractCtxSize stub and dangling imports

Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3,
isShardArtefact ×3, isReasoningModel ×3)

README: full rewrite covering llama-swap architecture, YAML config format,
new endpoints, troubleshooting table updated.
This commit is contained in:
shahondin1624
2026-05-27 10:42:19 +02:00
parent 6a70995a98
commit f7af660727
6 changed files with 414 additions and 171 deletions
+87 -93
View File
@@ -1,9 +1,9 @@
# ai-server — PI extension for a self-hosted llama.cpp router behind mTLS # ai-server — PI extension for a self-hosted llama-swap server behind mTLS
A multi-file pi extension that exposes a remote llama.cpp router as a provider A multi-file pi extension that exposes a remote llama-swap instance as a
to pi, with dynamic model discovery and admin slash commands. Chat streams use provider to pi, with dynamic model discovery and admin slash commands. Chat
client-certificate TLS so the endpoint can be exposed over the public internet streams use client-certificate TLS so the endpoint can be exposed over the
without a bearer token. public internet without a bearer token.
--- ---
@@ -11,21 +11,21 @@ without a bearer token.
``` ```
┌────────────┐ mTLS (HTTPS) ┌──────────────┐ HTTP ┌─────────────────┐ ┌────────────┐ mTLS (HTTPS) ┌──────────────┐ HTTP ┌─────────────────┐
│ pi client │───────────────►│ Caddy │────────►│ llama-server │ pi client │───────────────►│ Caddy │────────►│ llama-swap
│ (this ext) │ │ 192.168.2.2 │ │ 192.168.2.3:8080 │ │ (this ext) │ │ 192.168.2.2 │ │ 192.168.2.3:8080 │
└────────────┘ client cert │ ai.… │ │ router mode │ └────────────┘ client cert │ ai.… │ │ swap mode
└──────────────┘ │ --models-max 1 └──────────────┘ │ globalTTL: 1800
│ scheduler: one │
└─────────────────┘ └─────────────────┘
~/.llama-models.ini ~/.config/llama-swap/config.yaml
(per-model presets) (YAML model config)
``` ```
- **Caddy** terminates TLS and enforces `require_and_verify` client-cert auth on - **Caddy** terminates TLS and enforces `require_and_verify` client-cert auth
`ai.shahondin1624.de`. Plaintext HTTP is forwarded to the llama-server router. on `ai.shahondin1624.de`. Plaintext HTTP is forwarded to llama-swap.
- **llama-server** runs in `--models-mode router` with `--models-max 1`, so - **llama-swap** runs in swap mode, managing model lifecycle (load/unload/swap)
exactly one worker is loaded at a time; selecting a different model unloads with a YAML config at `~/.config/llama-swap/config.yaml`.
the previous one.
- **This extension** performs OpenAI-compatible chat streaming over mTLS and - **This extension** performs OpenAI-compatible chat streaming over mTLS and
surfaces admin endpoints as pi slash commands. surfaces admin endpoints as pi slash commands.
@@ -37,7 +37,7 @@ without a bearer token.
├── config.ts URLs, SSH host, cert paths, MODELS[] fallback ├── config.ts URLs, SSH host, cert paths, MODELS[] fallback
├── messages.ts Context → OpenAI chat/completions messages ├── messages.ts Context → OpenAI chat/completions messages
├── stream.ts custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events ├── stream.ts custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
├── admin.ts router HTTP client + SSH helpers (preset edit, systemctl) ├── admin.ts router HTTP client + SSH helpers (YAML edit, systemctl)
└── README.md this file └── README.md this file
``` ```
@@ -54,61 +54,63 @@ All are optional — the defaults match the current host.
| `AI_SERVER_CLIENT_KEY` | `<certs>/client-key.pem` | Client private key | | `AI_SERVER_CLIENT_KEY` | `<certs>/client-key.pem` | Client private key |
| `AI_SERVER_TIMEOUT_MS` | `300000` | Per-request stream timeout | | `AI_SERVER_TIMEOUT_MS` | `300000` | Per-request stream timeout |
| `AI_SERVER_SSH_HOST` | `ai-server@192.168.2.3` | SSH target for admin commands | | `AI_SERVER_SSH_HOST` | `ai-server@192.168.2.3` | SSH target for admin commands |
| `AI_SERVER_PRESET_PATH` | `~/.llama-models.ini` | Preset path on the SSH target | | `AI_SERVER_PRESET_PATH` | `~/.config/llama-swap/config.yaml` | YAML config on the SSH target |
| `AI_SERVER_SERVICE_UNIT` | `llama-swap.service` | systemd unit name |
| `AI_SERVER_MODELS_PATH` | `/v1/models` | Models list endpoint |
| `AI_SERVER_RUNNING_PATH` | `/running` | Currently running models endpoint |
| `AI_SERVER_UNLOAD_PATH` | `/api/models/unload/<id>` | Unload single model |
| `AI_SERVER_UNLOAD_ALL_PATH` | `/api/models/unload` | Unload all models |
| `AI_SERVER_UPSTREAM_HEALTH_PATH` | `/upstream/<id>/health` | Warm-load / health endpoint |
## 4. Server-side setup (192.168.2.3) ## 4. Server-side setup (192.168.2.3)
### 4.1 llama.cpp build ### 4.1 llama-swap install
```bash ```bash
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp npm install -g llama-swap
cd ~/llama.cpp && cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc) # or use the binary release from the llama-swap GitHub repo
``` ```
Vulkan is used for GPU offload on the Strix Halo iGPU (no ROCm needed). The
binary ends up at `~/llama.cpp/build/bin/llama-server`.
### 4.2 Model storage ### 4.2 Model storage
``` ```
~/models/<model-name>.gguf ~/models/<model-name>.gguf
``` ```
Multi-shard GGUFs (`*-00001-of-NNNNN.gguf`) work too — point the preset at the ### 4.3 Config file — `~/.config/llama-swap/config.yaml`
first shard and llama.cpp auto-loads the rest.
### 4.3 Preset file — `~/.llama-models.ini` llama-swap uses a YAML config file. Each model is defined under `models:` with
a `cmd:` block containing the llama-server invocation.
Router mode consults this file. Each `[section]` is a model id usable in API ```yaml
requests. The section name and `model =` path are the only required fields; globalTTL: 1800
the rest become `--flag value` args to the per-model worker when it spawns. models:
Qwen_Qwen3.6-35B-A3B-Q8_0:
cmd: |
/home/ai-server/llama.cpp/build/bin/llama-server
--model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
--ctx-size 262144
--temp 0.7
--cache-type-k q8_0
--cache-type-v q8_0
--n-gpu-layers 99
```ini MiniMax-M2.7-IQ3_XXS:
[Qwen_Qwen3.6-35B-A3B-Q8_0] cmd: |
model = /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf /home/ai-server/llama.cpp/build/bin/llama-server
ctx-size = 262144 --model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
temp = 0.7 --ctx-size 131072
cache-type-k = q8_0 --temp 1.0
cache-type-v = q8_0 --cache-type-k q8_0
n-gpu-layers = 99 --cache-type-v q8_0
--n-gpu-layers 99
[MiniMax-M2.7-IQ3_XXS]
model = /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS-00001-of-NNNNN.gguf
ctx-size = 131072
temp = 1.0
cache-type-k = q8_0
cache-type-v = q8_0
n-gpu-layers = 99
``` ```
Placeholder sections (without `model =`) show up in `GET /models` but are ### 4.4 Systemd user service — `~/.config/systemd/user/llama-swap.service`
filtered out by the extension's discovery — they would fail on load.
### 4.4 Systemd user service — `~/.config/systemd/user/llama-server.service`
```ini ```ini
[Unit] [Unit]
Description=LLaMA.cpp AI Server (Router Mode, Vulkan) Description=LLaMA-swap AI Server (Swap Mode)
After=network.target After=network.target
Wants=network.target Wants=network.target
@@ -117,16 +119,10 @@ Type=simple
User=ai-server User=ai-server
Group=ai-server Group=ai-server
WorkingDirectory=/home/ai-server WorkingDirectory=/home/ai-server
ExecStart=/home/ai-server/llama.cpp/build/bin/llama-server \ ExecStart=/home/ai-server/node_modules/.bin/llama-swap \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 8080 \ --port 8080 \
--models-dir /home/ai-server/models \ --config /home/ai-server/.config/llama-swap/config.yaml
--models-max 1 \
--models-autoload \
--models-preset /home/ai-server/.llama-models.ini \
--gpu-layers 99 \
--cache-type-k q8_0 \
--cache-type-v q8_0
LimitNOFILE=65536 LimitNOFILE=65536
LimitMEMLOCK=unlimited LimitMEMLOCK=unlimited
@@ -141,18 +137,10 @@ StandardError=journal
WantedBy=default.target WantedBy=default.target
``` ```
Important flags:
- **No `-c <N>`** at the router level. That flag is inherited by every child
worker and silently caps the preset's `ctx-size`. Let per-model presets win.
- **`--models-max 1`** enforces single-model concurrency (matters on shared
unified-memory hardware where two workers would fight for VRAM).
- **`--models-autoload`** spawns workers on demand via `POST /models/load`.
Enable and start: Enable and start:
```bash ```bash
systemctl --user daemon-reload && systemctl --user enable --now llama-server.service systemctl --user daemon-reload && systemctl --user enable --now llama-swap.service
loginctl enable-linger $(whoami) # keep user services running across logouts loginctl enable-linger $(whoami) # keep user services running across logouts
``` ```
@@ -160,12 +148,17 @@ loginctl enable-linger $(whoami) # keep user services running across logouts
| Method | Path | Body | Notes | | Method | Path | Body | Notes |
|---|---|---|---| |---|---|---|---|
| `GET` | `/models` | — | List models; `status.args` contains the spawned worker's command line | | `GET` | `/v1/models` | — | List models; `{"data":[{id,object,created,owned_by}]}` |
| `POST` | `/models/load` | `{"model":"<id>"}` | Payload key is `model`, **not** `id` | | `GET` | `/running` | — | Currently loaded models; `{"running":[{id,...}]}` |
| `POST` | `/models/unload` | `{"model":"<id>"}` | Same | | `POST` | `/api/models/unload` | — | Unload all models; returns `{"msg":"ok"}` |
| `GET` | `/health` | — | `{"status":"ok"}` when router is up | | `POST` | `/api/models/unload/<id>` | — | Unload specific model; plain text `OK` |
| `GET` | `/upstream/<id>/health` | — | Warm-load model (forces spawn without inference) |
| `GET` | `/health` | — | Plain text `OK` (not JSON) |
| `POST` | `/v1/chat/completions` | OpenAI Chat Completions payload | What pi and the web UI use | | `POST` | `/v1/chat/completions` | OpenAI Chat Completions payload | What pi and the web UI use |
| `GET` | `/` | — | Built-in SvelteKit chat UI with a model picker |
> **Note:** Response bodies are mixed JSON and plain text. The extension's
> `routerRequest()` falls back to `{raw: buf}` for non-JSON responses, so
> unload calls won't crash — they'll return `{raw: "OK"}`.
## 5. Caddy + mTLS setup (192.168.2.2) ## 5. Caddy + mTLS setup (192.168.2.2)
@@ -238,11 +231,11 @@ registers the `ai-server` provider, and installs the admin slash commands.
|---|---|---| |---|---|---|
| `/ai-server-status` | Tabular view of models, status, ctx size | HTTPS mTLS | | `/ai-server-status` | Tabular view of models, status, ctx size | HTTPS mTLS |
| `/ai-server-refresh` | Re-discover models and re-register the provider | HTTPS mTLS | | `/ai-server-refresh` | Re-discover models and re-register the provider | HTTPS mTLS |
| `/ai-server-load <id>` | Load a model on-demand | HTTPS mTLS | | `/ai-server-load <id>` | Warm-load a model via `/upstream/<id>/health` | HTTPS mTLS |
| `/ai-server-unload <id>` | Unload a model | HTTPS mTLS | | `/ai-server-unload <id>` | Unload a model via `/api/models/unload/<id>` | HTTPS mTLS |
| `/ai-server-ctx <id> <size>` | Edit preset ctx-size, unload + reload | SSH + HTTPS | | `/ai-server-ctx <id> <size>` | Edit YAML config ctx-size, reload the model | SSH + HTTPS |
| `/ai-server-preset` | Print the server's `~/.llama-models.ini` | SSH | | `/ai-server-preset` | Print the server's llama-swap config (YAML) | SSH |
| `/ai-server-restart` | `systemctl --user restart llama-server.service` | SSH | | `/ai-server-restart` | `systemctl --user restart llama-swap.service` | SSH |
`<id>` arguments tab-complete against the live router model list. `<id>` arguments tab-complete against the live router model list.
@@ -253,14 +246,13 @@ registers the `ai-server` provider, and installs the admin slash commands.
ssh ai-server@192.168.2.3 ssh ai-server@192.168.2.3
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir . cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .
# Add a preset section to ~/.llama-models.ini — section name = model id # Add a config block to ~/.config/llama-swap/config.yaml (see example in §4.3)
# (see example in §4.3)
``` ```
Then from pi: Then from pi:
``` ```
/ai-server-refresh # discovers the new preset /ai-server-refresh # discovers the new model
/ai-server-load <id> # first load may take a minute for a cold GGUF /ai-server-load <id> # first load may take a minute for a cold GGUF
``` ```
@@ -268,9 +260,8 @@ No extension-side config changes are needed — discovery picks it up.
## 9. Browser access to the built-in web UI ## 9. Browser access to the built-in web UI
`llama-server` ships a SvelteKit chat UI at `/` with a model picker. Navigate to Navigate to `https://ai.shahondin1624.de/` in any browser that has the client
`https://ai.shahondin1624.de/` in any browser that has the client cert and cert and trusts the root CA.
trusts the root CA.
### 9.1 Firefox (simplest path, always works) ### 9.1 Firefox (simplest path, always works)
@@ -332,15 +323,18 @@ Verify under `brave://policy`. The policy must show status **OK**, not
| Symptom | Likely cause | Fix | | Symptom | Likely cause | Fix |
|---|---|---| |---|---|---|
| pi: `HTTP 400: request exceeds available context size` | Router started with `-c <small>`, overriding the preset's larger `ctx-size` | Remove the router-level `-c` flag from the systemd ExecStart | | pi: `HTTP 400: request exceeds available context size` | Model config has a small `--ctx-size` | Increase `--ctx-size` in the YAML config |
| pi: `HTTP 400: File Not Found` on `/models/load` | Wrong JSON body key (older versions used `id`) | Must be `{"model":"<id>"}` — the extension's `admin.ts` already does this | | pi: `HTTP 400: File Not Found` on load | Wrong model id — check `/v1/models` | Use the exact id from the models list |
| Model shows as `[unloaded]` in `/ai-server-status` | Model isn't currently loaded in llama-swap | Run `/ai-server-load <id>` to warm it |
| First request is slow | Cold model load — no preload configured | Add `hooks.on_startup.preload: [<id>]` to config |
| `certutil: unable to open …root-ca.pem` | CA file not yet scp'd locally | Copy `root-ca.pem` from the Caddy host | | `certutil: unable to open …root-ca.pem` | CA file not yet scp'd locally | Copy `root-ca.pem` from the Caddy host |
| Brave: p12 import "Invalid or corrupt file" | OpenSSL 3 default PBES2/AES-256 encryption | Regenerate with `openssl pkcs12 -legacy -export …` | | Brave: p12 import "Invalid or corrupt file" | OpenSSL 3 default PBES2/AES-256 encryption | Regenerate with `openssl pkcs12 -legacy -export …` |
| Brave: site loads but padlock is red, `ChromeRootStoreEnabled: Error` in `brave://policy` | Policy was removed upstream | Use `brave://certificate-manager/` → Custom, or use Firefox | | Brave: site loads but padlock is red | Chrome Root Store issue | Use `brave://certificate-manager/` → Custom |
| Cert selection prompt appears on every page load | `AutoSelectCertificateForUrls` policy missing or malformed | See §9.3 | | Cert selection prompt appears on every page load | `AutoSelectCertificateForUrls` policy missing or malformed | See §9.3 |
| System-trust update-ca-trust has no effect on Brave | Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust` | Import directly into the sandbox's NSS DB (§9.3) | | System-trust update-ca-trust has no effect on Brave | Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust` | Import directly into the sandbox's NSS DB (§9.3) |
| Model shows as `[no model path]` in `/ai-server-status` | Preset section in `~/.llama-models.ini` has no `model =` line | Add the path, then `/ai-server-refresh` | | Chat first-token latency seems long | Cold model load | First chat turn may wait 1060s while the GGUF mmap's in |
| Chat first-token latency seems long | Cold model load is not counted separately | First chat turn may wait 1060s while the GGUF mmap's in; subsequent turns stream immediately | | `/ai-server-restart` fails | Wrong service unit name | Check `AI_SERVER_SERVICE_UNIT` / create the proper unit |
| `/ai-server-ctx` fails | YAML format changed | Edit `~/.config/llama-swap/config.yaml` manually first |
## 11. Security notes ## 11. Security notes
@@ -348,8 +342,8 @@ Verify under `brave://policy`. The policy must show status **OK**, not
is the sole credential for API access. Treat it like an SSH key — do not is the sole credential for API access. Treat it like an SSH key — do not
share, do not commit, do not email. share, do not commit, do not email.
- To revoke a client, regenerate the root CA's cert list and remove/rename the - To revoke a client, regenerate the root CA's cert list and remove/rename the
offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this is offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this
a single-user deployment.) is a single-user deployment.)
- The `apiKey: "ai-server-mtls"` string in `index.ts` is a placeholder required - The `apiKey: "ai-server-mtls"` string in `index.ts` is a placeholder required
by the pi model registry; no bearer token is sent over the wire. All auth is by the pi model registry; no bearer token is sent over the wire. All auth is
cert-based. cert-based.
@@ -363,10 +357,10 @@ Verify under `brave://policy`. The policy must show status **OK**, not
| Path | Purpose | | Path | Purpose |
|---|---| |---|---|
| `~/llama.cpp/` | llama.cpp source + build tree | | `~/llama.cpp/` | llama.cpp source + build tree |
| `~/llama.cpp/build/bin/llama-server` | Binary | | `~/llama.cpp/build/bin/llama-server` | Binary (invoked by llama-swap) |
| `~/models/*.gguf` | Model weights | | `~/models/*.gguf` | Model weights |
| `~/.llama-models.ini` | Router preset file | | `~/.config/llama-swap/config.yaml` | llama-swap YAML config |
| `~/.config/systemd/user/llama-server.service` | Service unit | | `~/.config/systemd/user/llama-swap.service` | Service unit |
| `~/vram-monitor.sh` | Optional idle-unload cron helper | | `~/vram-monitor.sh` | Optional idle-unload cron helper |
### On the Caddy host (192.168.2.2) ### On the Caddy host (192.168.2.2)
+142 -26
View File
@@ -3,21 +3,28 @@ import * as https from "node:https";
import { URL } from "node:url"; import { URL } from "node:url";
import { promisify } from "node:util"; import { promisify } from "node:util";
import { import {
AI_SERVER_MODELS_PATH,
AI_SERVER_PRESET_PATH, AI_SERVER_PRESET_PATH,
AI_SERVER_RUNNING_PATH,
AI_SERVER_SERVICE_UNIT,
AI_SERVER_SSH_HOST, AI_SERVER_SSH_HOST,
AI_SERVER_UNLOAD_ALL_PATH,
AI_SERVER_UNLOAD_PATH,
AI_SERVER_UPSTREAM_HEALTH_PATH,
AI_SERVER_URL, AI_SERVER_URL,
type ServerModel, type ServerModel,
getAdminTimeoutMs, getAdminTimeoutMs,
loadCerts, loadCerts,
} from "./config.js"; } from "./config.js";
import { import {
extractCtxSize, parseCtxMapFromYaml,
extractCtxFromRunningCmd,
isReasoningModel, isReasoningModel,
isShardArtefact, isShardArtefact,
} from "./router-utils.js"; } from "./router-utils.js";
// Re-export so existing index.ts imports keep working. // Re-export so existing index.ts imports keep working.
export { extractCtxSize, isReasoningModel }; export { isReasoningModel };
const exec = promisify(execCb); const exec = promisify(execCb);
@@ -84,12 +91,33 @@ async function routerRequest(
export interface RouterModel { export interface RouterModel {
id: string; id: string;
status: { value: "loaded" | "unloaded" | "loading"; args: string[] }; object?: string;
created?: number;
owned_by?: string;
/** Whether the model is currently loaded in llama-swap. */
running?: boolean;
} }
export async function listModels(): Promise<RouterModel[]> { export async function listModels(): Promise<RouterModel[]> {
const data = await routerRequest("GET", "/models"); // llama-swap: GET /v1/models returns { data: [{ id, object, created, owned_by }] }
return (data?.data ?? []) as RouterModel[]; // GET /running returns { running: [{ id, ... }] }
// We merge: every model from /v1/models gets a `running` flag from /running.
const [modelsRes, runningRes] = await Promise.all([
routerRequest("GET", AI_SERVER_MODELS_PATH),
routerRequest("GET", AI_SERVER_RUNNING_PATH),
]);
const models: RouterModel[] = (modelsRes?.data ?? []) as RouterModel[];
const runningIds = new Set<string>();
if (runningRes?.running && Array.isArray(runningRes.running)) {
for (const entry of runningRes.running as Record<string, unknown>[]) {
if (entry.id) runningIds.add(String(entry.id));
}
}
for (const m of models) {
m.running = runningIds.has(m.id);
}
return models;
} }
// Short TTL cache for listModels — tab-completion calls the completer on // Short TTL cache for listModels — tab-completion calls the completer on
@@ -113,32 +141,67 @@ export function invalidateListModelsCache(): void {
} }
export async function loadModel(id: string): Promise<unknown> { export async function loadModel(id: string): Promise<unknown> {
// The router's handler reads `body["model"]`; passing `{id}` yields a 404. // llama-swap: GET /upstream/<id>/health forces a spawn (warm load).
const r = await routerRequest("POST", "/models/load", { model: id }); // 2xx = success; plain text OK body is acceptable.
const r = await routerRequest("GET", AI_SERVER_UPSTREAM_HEALTH_PATH(id));
invalidateListModelsCache(); invalidateListModelsCache();
return r; return r;
} }
export async function unloadModel(id: string): Promise<unknown> { export async function unloadModel(id: string): Promise<unknown> {
const r = await routerRequest("POST", "/models/unload", { model: id }); // llama-swap: POST /api/models/unload/<id>, no body. Returns plain text "OK".
const r = await routerRequest("POST", AI_SERVER_UNLOAD_PATH(id));
invalidateListModelsCache(); invalidateListModelsCache();
return r; return r;
} }
// A preset is "runnable" only if it has a --model path. Placeholder sections export async function unloadAll(): Promise<unknown> {
// like [small-7b] without model = ... show up in /models but have no --model // llama-swap: POST /api/models/unload, no body.
// arg and would fail on load. const r = await routerRequest("POST", AI_SERVER_UNLOAD_ALL_PATH);
function isRunnable(m: RouterModel): boolean { invalidateListModelsCache();
return (m.status?.args ?? []).includes("--model"); return r;
}
// llama-swap /v1/models only returns registered presets (all have a model
// path). Placeholder sections are not exposed. We only filter out shard
// artefacts.
interface RunningEntry {
model: string;
cmd?: string;
state?: string;
ttl?: number;
proxy?: string;
}
async function listRunning(): Promise<RunningEntry[]> {
const res = await routerRequest("GET", AI_SERVER_RUNNING_PATH);
return Array.isArray((res as any)?.running)
? (res as any).running
: [];
} }
export async function discoverModels(): Promise<ServerModel[]> { export async function discoverModels(): Promise<ServerModel[]> {
const models = await listModels(); const [models, running, yaml] = await Promise.all([
listModels(),
listRunning().catch(() => [] as RunningEntry[]),
readPreset().catch(() => ""),
]);
const ctxFromYaml = parseCtxMapFromYaml(yaml);
const ctxFromRunning = new Map<string, number>();
for (const r of running) {
const n = extractCtxFromRunningCmd(r.cmd);
if (n) ctxFromRunning.set(r.model, n);
}
return models return models
.filter(isRunnable)
.filter((m) => !isShardArtefact(m.id)) .filter((m) => !isShardArtefact(m.id))
.map((m) => { .map((m) => {
const ctx = extractCtxSize(m) ?? 32768; const ctx =
ctxFromRunning.get(m.id) ?? // live process is authoritative
ctxFromYaml.get(m.id) ?? // config.yaml is next best
32768; // last-resort fallback
return { return {
id: m.id, id: m.id,
name: `${m.id} (AI Server)`, name: `${m.id} (AI Server)`,
@@ -177,30 +240,83 @@ export async function readPreset(): Promise<string> {
} }
/** /**
* Set a `key = value` line inside a named [section] of the preset file. * Set a `key = value` inside a named YAML section for llama-swap.
* Preserves comments and all other lines. Errors if the key is absent. *
* llama-swap config.yaml structure (relevant excerpt):
*
* models:
* Qwen_Qwen3.6-35B-A3B-Q8_0:
* cmd: |
* /path/to/llama-server --model /path/to/gguf ...
* --ctx-size 32768
* --temp 0.7
*
* This function finds the `<id>:` block under `models:`, locates the
* `--ctx-size N` line (or other supported flags), and replaces N.
*
* Supported keys: ctx-size, temp, n-gpu-layers
*/ */
export async function setPresetKey( export async function setPresetKey(
section: string, section: string,
key: string, key: string,
value: string, value: string,
): Promise<void> { ): Promise<void> {
// Map short key names to the actual CLI flag used in cmd:
const flagMap: Record<string, string> = {
"ctx-size": "--ctx-size",
"temp": "--temp",
"n-gpu-layers": "--n-gpu-layers",
};
const flag = flagMap[key] ?? `--${key}`;
// We use a sed-based approach on the YAML file:
// 1. Find the <section>: block under models:
// 2. Within that block, find the --flag N line
// 3. Replace N with the new value
//
// The sed script works line-by-line:
// - When we see ` ${section}:` under models:, enter editing mode
// - While editing, look for `--flag <number>` and replace it
// - Exit editing mode when we hit a line at the same or lesser indent
// that is not under this section
const escapedSection = section.replace(/[.[\]*/^$]/g, "\\$&");
const escapedFlag = flag.replace(/[.[\]*/^$]/g, "\\$&");
const awkScript = ` const awkScript = `
awk -v sec="[${section}]" -v key=${shQuote(key)} -v val=${shQuote(value)} ' awk -v sec="${escapedSection}" -v flag="${escapedFlag}" -v val="${value}" '
BEGIN { in_s = 0; found = 0 } BEGIN { in_sec = 0; indent = 0 }
/^\\[/ { in_s = ($0 == sec) } {
in_s && $1 == key && $2 == "=" { print key " = " val; found = 1; next } # Detect section header: " <section>:" (2-space indent, key followed by colon)
{ print } if (!in_sec && match($0, /^[[:space:]]{2}'${escapedSection}':[[:space:]]*$/)) {
END { if (!found) exit 2 } in_sec = 1;
indent = 2;
}
# If we are in a section, check if we left it
if (in_sec) {
lineIndent = 0;
m = match($0, /^[[:space:]]*/);
if (m > 0) lineIndent = RLENGTH;
# If indent is <= 2 and line is not empty and not a continuation of cmd,
# we have left this section
if (lineIndent <= 2 && $0 !~ /^[[:space:]]*$/) {
in_sec = 0;
}
}
if (in_sec && match($0, " " flag " [0-9]+")) {
sub(flag " [0-9]+", flag " " val);
}
print
}
' ${AI_SERVER_PRESET_PATH} > ${AI_SERVER_PRESET_PATH}.tmp && mv ${AI_SERVER_PRESET_PATH}.tmp ${AI_SERVER_PRESET_PATH} ' ${AI_SERVER_PRESET_PATH} > ${AI_SERVER_PRESET_PATH}.tmp && mv ${AI_SERVER_PRESET_PATH}.tmp ${AI_SERVER_PRESET_PATH}
`.trim(); `.trim();
try { try {
await runSsh(awkScript); await runSsh(awkScript);
} catch (err: any) { } catch (err: any) {
const msg = err?.message ?? String(err); const msg = err?.message ?? String(err);
if (msg.includes("exit code 2") || msg.match(/exit.*2/)) { if (msg.includes("exit code 2") || msg.match(/exit.*2/)) {
throw new Error( throw new Error(
`Key "${key}" not found in [${section}] — add it to the preset manually first.`, `Key "${key}" not found for model "${section}" — add it to the preset manually first.`,
); );
} }
throw err; throw err;
@@ -209,7 +325,7 @@ awk -v sec="[${section}]" -v key=${shQuote(key)} -v val=${shQuote(value)} '
export async function restartService(): Promise<string> { export async function restartService(): Promise<string> {
return runSsh( return runSsh(
"systemctl --user restart llama-server.service && systemctl --user is-active llama-server.service", `systemctl --user restart ${AI_SERVER_SERVICE_UNIT} && systemctl --user is-active ${AI_SERVER_SERVICE_UNIT}`,
); );
} }
+22 -1
View File
@@ -13,8 +13,29 @@ export const AI_SERVER_CHAT_PATH = "/v1/chat/completions";
// SSH target for admin operations (preset edits, systemctl). Uses key auth. // SSH target for admin operations (preset edits, systemctl). Uses key auth.
export const AI_SERVER_SSH_HOST = export const AI_SERVER_SSH_HOST =
process.env.AI_SERVER_SSH_HOST ?? "ai-server@192.168.2.3"; process.env.AI_SERVER_SSH_HOST ?? "ai-server@192.168.2.3";
// llama-swap endpoint paths
export const AI_SERVER_MODELS_PATH =
process.env.AI_SERVER_MODELS_PATH ?? "/v1/models";
export const AI_SERVER_RUNNING_PATH =
process.env.AI_SERVER_RUNNING_PATH ?? "/running";
export const AI_SERVER_UNLOAD_ALL_PATH =
process.env.AI_SERVER_UNLOAD_ALL_PATH ?? "/api/models/unload";
export const AI_SERVER_UNLOAD_PATH = (id: string) =>
process.env.AI_SERVER_UNLOAD_PATH ??
`/api/models/unload/${encodeURIComponent(id)}`;
export const AI_SERVER_UPSTREAM_HEALTH_PATH = (id: string) =>
process.env.AI_SERVER_UPSTREAM_HEALTH_PATH ??
`/upstream/${encodeURIComponent(id)}/health`;
// llama-swap config file (YAML, replaces old INI preset)
export const AI_SERVER_PRESET_PATH = export const AI_SERVER_PRESET_PATH =
process.env.AI_SERVER_PRESET_PATH ?? "~/.llama-models.ini"; process.env.AI_SERVER_PRESET_PATH ??
"~/.config/llama-swap/config.yaml";
// systemd service unit for llama-swap
export const AI_SERVER_SERVICE_UNIT =
process.env.AI_SERVER_SERVICE_UNIT ?? "llama-swap.service";
// Distinct api id so registering streamSimple does NOT overwrite the // Distinct api id so registering streamSimple does NOT overwrite the
// built-in openai-completions provider (the api-registry keys by api name). // built-in openai-completions provider (the api-registry keys by api name).
+4 -10
View File
@@ -1,7 +1,6 @@
import type { ExtensionAPI } from "@mariozechner/pi-coding-agent"; import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
import { import {
discoverModels, discoverModels,
extractCtxSize,
listModels, listModels,
listModelsCached, listModelsCached,
loadModel, loadModel,
@@ -122,13 +121,8 @@ export default async function (pi: ExtensionAPI) {
const routerModels = await listModels(); const routerModels = await listModels();
const lines = [`AI Server: ${AI_SERVER_URL}`]; const lines = [`AI Server: ${AI_SERVER_URL}`];
for (const m of routerModels) { for (const m of routerModels) {
const status = m.status?.value ?? "?"; const status = m.running ? "loaded" : "unloaded";
const ctx = extractCtxSize(m); lines.push(` ${m.id} [${status}]`);
const hasModel = (m.status?.args ?? []).includes("--model");
const marker = hasModel ? " " : " [no model path]";
lines.push(
` ${m.id} [${status}] ctx=${ctx ?? "?"}${marker}`,
);
} }
ctx.ui.notify(lines.join("\n"), "info"); ctx.ui.notify(lines.join("\n"), "info");
} catch (err) { } catch (err) {
@@ -246,7 +240,7 @@ export default async function (pi: ExtensionAPI) {
}); });
pi.registerCommand("ai-server-preset", { pi.registerCommand("ai-server-preset", {
description: "Print ~/.llama-models.ini on the ai-server", description: "Print llama-swap config on the ai-server",
handler: async (_args, ctx) => { handler: async (_args, ctx) => {
try { try {
const text = await readPreset(); const text = await readPreset();
@@ -261,7 +255,7 @@ export default async function (pi: ExtensionAPI) {
}); });
pi.registerCommand("ai-server-restart", { pi.registerCommand("ai-server-restart", {
description: "Restart the ai-server llama-server service", description: "Restart the ai-server llama-swap service",
handler: async (_args, ctx) => { handler: async (_args, ctx) => {
const ok = await ctx.ui.confirm( const ok = await ctx.ui.confirm(
"Restart llama-server?", "Restart llama-server?",
+70 -9
View File
@@ -6,19 +6,80 @@
export interface RouterModelMeta { export interface RouterModelMeta {
id: string; id: string;
status?: { value: string; args: string[] }; object?: string;
created?: number;
owned_by?: string;
/** Whether the model is currently loaded in llama-swap. */
running?: boolean;
} }
/** /**
* Pull `--ctx-size <N>` out of the worker's argv. Returns null if the flag * Parse ctx-size values from every model block in llama-swap's config.yaml.
* is missing, at the end of argv, or the value isn't a number. *
* The YAML has a structure like:
*
* models:
* Qwen_Qwen3.6-35B-A3B-Q8_0:
* cmd: |
* /path/to/llama-server
* --ctx-size 262144
* --temp 0.7
*
* This function scans for `--ctx-size N` lines within each model block and
* returns a Map of id → ctxSize. If a model appears multiple times it keeps
* the last value found.
*/ */
export function extractCtxSize(m: RouterModelMeta): number | null { export function parseCtxMapFromYaml(yaml: string): Map<string, number> {
const args = m.status?.args ?? []; const map = new Map<string, number>();
const i = args.indexOf("--ctx-size"); let currentId: string | null = null;
if (i < 0 || i + 1 >= args.length) return null;
const n = Number(args[i + 1]); for (const raw of yaml.split("\n")) {
return Number.isFinite(n) ? n : null; const line = raw.replace(/\r$/, "");
// Skip comments / blank
if (!line.trim() || line.trim().startsWith("#")) continue;
// New model block: exactly two-space indent, "<id>:" with nothing
// meaningful after the colon (llama-swap uses 2-space indent under
// `models:`).
const idMatch = /^ ([A-Za-z0-9._-]+):\s*$/.exec(line);
if (idMatch) {
currentId = idMatch[1];
continue;
}
// Top-level key resets context (e.g. `macros:`, `hooks:`)
if (/^[A-Za-z]/.test(line)) {
currentId = null;
continue;
}
if (!currentId) continue;
// Look for --ctx-size N anywhere in the line (handles indented cmd:
// blocks where the flag is on its own line).
const ctx = /--ctx-size\s+(\d+)/.exec(line);
if (ctx) {
map.set(currentId, Number(ctx[1]));
currentId = null; // one ctx per model
}
}
return map;
}
/**
* Extract ctx-size from a /running entry's `cmd` string.
*
* The /running endpoint returns entries like:
* { model: "Qwen_...", cmd: "/path/llama-server --model ... --ctx-size 262144 ...", ... }
*
* This is the authoritative source for the currently loaded model's ctx.
*/
export function extractCtxFromRunningCmd(cmd: string | undefined): number | null {
if (!cmd) return null;
const m = /--ctx-size\s+(\d+)/.exec(cmd);
return m ? Number(m[1]) : null;
} }
/** /**
+81 -24
View File
@@ -7,45 +7,102 @@
import assert from "node:assert/strict"; import assert from "node:assert/strict";
import { test } from "node:test"; import { test } from "node:test";
import { import {
extractCtxSize, parseCtxMapFromYaml,
extractCtxFromRunningCmd,
isReasoningModel, isReasoningModel,
isShardArtefact, isShardArtefact,
} from "../ai-server/router-utils.ts"; } from "../ai-server/router-utils.ts";
// ── extractCtxSize ────────────────────────────────────────────────────── // ── parseCtxMapFromYaml ─────────────────────────────────────────────────
test("extractCtxSize: --ctx-size present with value", () => { test("parseCtxMapFromYaml: extracts ctx-size from model blocks", () => {
const m = { const yaml = `
id: "x", models:
status: { value: "loaded", args: ["--host", "127.0.0.1", "--ctx-size", "131072"] }, Qwen_Qwen3.6-35B-A3B-Q8_0:
}; cmd: |
assert.equal(extractCtxSize(m), 131072); /home/ai-server/llama.cpp/build/bin/llama-server
--model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
--ctx-size 262144
--temp 0.7
MiniMax-M2.7-IQ3_XXS:
cmd: |
/home/ai-server/llama.cpp/build/bin/llama-server
--model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
--ctx-size 131072
--temp 1.0
`;
const map = parseCtxMapFromYaml(yaml);
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 262144);
assert.equal(map.get("MiniMax-M2.7-IQ3_XXS"), 131072);
assert.equal(map.size, 2);
}); });
test("extractCtxSize: missing --ctx-size -> null", () => { test("parseCtxMapFromYaml: skips comments and blank lines", () => {
assert.equal(extractCtxSize({ id: "x", status: { value: "loaded", args: ["--host", "127"] } }), null); const yaml = `
# This is a comment
models:
# Model with large context
Qwen_Qwen3.6-35B-A3B-Q8_0:
cmd: |
/path/to/server
--ctx-size 65536
--temp 0.7
`;
const map = parseCtxMapFromYaml(yaml);
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 65536);
}); });
test("extractCtxSize: --ctx-size at end of argv -> null (no value)", () => { test("parseCtxMapFromYaml: resets on top-level keys", () => {
assert.equal(extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size"] } }), null); const yaml = `
models:
Qwen_Qwen3.6-35B-A3B-Q8_0:
cmd: |
/path/to/server
--ctx-size 262144
hooks:
on_startup:
preload:
- Qwen_Qwen3.6-35B-A3B-Q8_0
`;
const map = parseCtxMapFromYaml(yaml);
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 262144);
// "preload" is not a valid model id pattern, but even if it were,
// it's under hooks: so should not be included.
assert.ok(!map.has("preload"));
}); });
test("extractCtxSize: non-numeric value -> null", () => { test("parseCtxMapFromYaml: empty yaml returns empty map", () => {
assert.equal( const map = parseCtxMapFromYaml("");
extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size", "notanumber"] } }), assert.equal(map.size, 0);
null,
);
}); });
test("extractCtxSize: zero is valid (not null)", () => { test("parseCtxMapFromYaml: model without ctx-size is skipped", () => {
assert.equal( const yaml = `
extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size", "0"] } }), models:
0, SmallModel:
); cmd: |
/path/to/server
--temp 0.7
`;
const map = parseCtxMapFromYaml(yaml);
assert.equal(map.get("SmallModel"), undefined);
assert.equal(map.size, 0);
}); });
test("extractCtxSize: missing status entirely -> null", () => { // ── extractCtxFromRunningCmd ────────────────────────────────────────────
assert.equal(extractCtxSize({ id: "x" }), null);
test("extractCtxFromRunningCmd: parses --ctx-size from cmd string", () => {
const cmd = "/home/ai-server/llama.cpp/build/bin/llama-server --model /home/ai-server/models/Qwen.gguf --ctx-size 262144 --temp 0.7";
assert.equal(extractCtxFromRunningCmd(cmd), 262144);
});
test("extractCtxFromRunningCmd: undefined cmd returns null", () => {
assert.equal(extractCtxFromRunningCmd(undefined), null);
});
test("extractCtxFromRunningCmd: cmd without --ctx-size returns null", () => {
assert.equal(extractCtxFromRunningCmd("/path/to/server --temp 0.7"), null);
}); });
// ── isShardArtefact ───────────────────────────────────────────────────── // ── isShardArtefact ─────────────────────────────────────────────────────