migrate ai-server extension from llama.cpp router to llama-swap
Endpoint rewrites:
- GET /v1/models + /running → merged listModels() with running flag
- POST /models/load → GET /upstream/<id>/health (warm load)
- POST /models/unload → POST /api/models/unload/<id> (no body)
- Added POST /api/models/unload for unloadAll()
Config migration:
- Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml
- Service unit: llama-server.service → llama-swap.service
- setPresetKey() rewritten from INI awk to YAML-aware awk for
editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks
Per-model ctx-size (fixes 0/33k bug):
- parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per
model block → Map<id, ctxSize>
- extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string
- discoverModels(): Promise.all(listModels, listRunning, readPreset),
ctx priority: running cmd → yaml → 32768 fallback
- Removed broken extractCtxSize stub and dangling imports
Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3,
isShardArtefact ×3, isReasoningModel ×3)
README: full rewrite covering llama-swap architecture, YAML config format,
new endpoints, troubleshooting table updated.
This commit is contained in:
+87
-93
@@ -1,9 +1,9 @@
|
|||||||
# ai-server — PI extension for a self-hosted llama.cpp router behind mTLS
|
# ai-server — PI extension for a self-hosted llama-swap server behind mTLS
|
||||||
|
|
||||||
A multi-file pi extension that exposes a remote llama.cpp router as a provider
|
A multi-file pi extension that exposes a remote llama-swap instance as a
|
||||||
to pi, with dynamic model discovery and admin slash commands. Chat streams use
|
provider to pi, with dynamic model discovery and admin slash commands. Chat
|
||||||
client-certificate TLS so the endpoint can be exposed over the public internet
|
streams use client-certificate TLS so the endpoint can be exposed over the
|
||||||
without a bearer token.
|
public internet without a bearer token.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -11,21 +11,21 @@ without a bearer token.
|
|||||||
|
|
||||||
```
|
```
|
||||||
┌────────────┐ mTLS (HTTPS) ┌──────────────┐ HTTP ┌─────────────────┐
|
┌────────────┐ mTLS (HTTPS) ┌──────────────┐ HTTP ┌─────────────────┐
|
||||||
│ pi client │───────────────►│ Caddy │────────►│ llama-server │
|
│ pi client │───────────────►│ Caddy │────────►│ llama-swap │
|
||||||
│ (this ext) │ │ 192.168.2.2 │ │ 192.168.2.3:8080 │
|
│ (this ext) │ │ 192.168.2.2 │ │ 192.168.2.3:8080 │
|
||||||
└────────────┘ client cert │ ai.… │ │ router mode │
|
└────────────┘ client cert │ ai.… │ │ swap mode │
|
||||||
└──────────────┘ │ --models-max 1 │
|
└──────────────┘ │ globalTTL: 1800 │
|
||||||
|
│ scheduler: one │
|
||||||
└─────────────────┘
|
└─────────────────┘
|
||||||
│
|
│
|
||||||
~/.llama-models.ini
|
~/.config/llama-swap/config.yaml
|
||||||
(per-model presets)
|
(YAML model config)
|
||||||
```
|
```
|
||||||
|
|
||||||
- **Caddy** terminates TLS and enforces `require_and_verify` client-cert auth on
|
- **Caddy** terminates TLS and enforces `require_and_verify` client-cert auth
|
||||||
`ai.shahondin1624.de`. Plaintext HTTP is forwarded to the llama-server router.
|
on `ai.shahondin1624.de`. Plaintext HTTP is forwarded to llama-swap.
|
||||||
- **llama-server** runs in `--models-mode router` with `--models-max 1`, so
|
- **llama-swap** runs in swap mode, managing model lifecycle (load/unload/swap)
|
||||||
exactly one worker is loaded at a time; selecting a different model unloads
|
with a YAML config at `~/.config/llama-swap/config.yaml`.
|
||||||
the previous one.
|
|
||||||
- **This extension** performs OpenAI-compatible chat streaming over mTLS and
|
- **This extension** performs OpenAI-compatible chat streaming over mTLS and
|
||||||
surfaces admin endpoints as pi slash commands.
|
surfaces admin endpoints as pi slash commands.
|
||||||
|
|
||||||
@@ -37,7 +37,7 @@ without a bearer token.
|
|||||||
├── config.ts URLs, SSH host, cert paths, MODELS[] fallback
|
├── config.ts URLs, SSH host, cert paths, MODELS[] fallback
|
||||||
├── messages.ts Context → OpenAI chat/completions messages
|
├── messages.ts Context → OpenAI chat/completions messages
|
||||||
├── stream.ts custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
|
├── stream.ts custom streamSimple: SSE parse, mTLS HTTPS, pi-ai events
|
||||||
├── admin.ts router HTTP client + SSH helpers (preset edit, systemctl)
|
├── admin.ts router HTTP client + SSH helpers (YAML edit, systemctl)
|
||||||
└── README.md this file
|
└── README.md this file
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -54,61 +54,63 @@ All are optional — the defaults match the current host.
|
|||||||
| `AI_SERVER_CLIENT_KEY` | `<certs>/client-key.pem` | Client private key |
|
| `AI_SERVER_CLIENT_KEY` | `<certs>/client-key.pem` | Client private key |
|
||||||
| `AI_SERVER_TIMEOUT_MS` | `300000` | Per-request stream timeout |
|
| `AI_SERVER_TIMEOUT_MS` | `300000` | Per-request stream timeout |
|
||||||
| `AI_SERVER_SSH_HOST` | `ai-server@192.168.2.3` | SSH target for admin commands |
|
| `AI_SERVER_SSH_HOST` | `ai-server@192.168.2.3` | SSH target for admin commands |
|
||||||
| `AI_SERVER_PRESET_PATH` | `~/.llama-models.ini` | Preset path on the SSH target |
|
| `AI_SERVER_PRESET_PATH` | `~/.config/llama-swap/config.yaml` | YAML config on the SSH target |
|
||||||
|
| `AI_SERVER_SERVICE_UNIT` | `llama-swap.service` | systemd unit name |
|
||||||
|
| `AI_SERVER_MODELS_PATH` | `/v1/models` | Models list endpoint |
|
||||||
|
| `AI_SERVER_RUNNING_PATH` | `/running` | Currently running models endpoint |
|
||||||
|
| `AI_SERVER_UNLOAD_PATH` | `/api/models/unload/<id>` | Unload single model |
|
||||||
|
| `AI_SERVER_UNLOAD_ALL_PATH` | `/api/models/unload` | Unload all models |
|
||||||
|
| `AI_SERVER_UPSTREAM_HEALTH_PATH` | `/upstream/<id>/health` | Warm-load / health endpoint |
|
||||||
|
|
||||||
## 4. Server-side setup (192.168.2.3)
|
## 4. Server-side setup (192.168.2.3)
|
||||||
|
|
||||||
### 4.1 llama.cpp build
|
### 4.1 llama-swap install
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
|
npm install -g llama-swap
|
||||||
cd ~/llama.cpp && cmake -B build -DGGML_VULKAN=ON && cmake --build build --config Release -j$(nproc)
|
# or use the binary release from the llama-swap GitHub repo
|
||||||
```
|
```
|
||||||
|
|
||||||
Vulkan is used for GPU offload on the Strix Halo iGPU (no ROCm needed). The
|
|
||||||
binary ends up at `~/llama.cpp/build/bin/llama-server`.
|
|
||||||
|
|
||||||
### 4.2 Model storage
|
### 4.2 Model storage
|
||||||
|
|
||||||
```
|
```
|
||||||
~/models/<model-name>.gguf
|
~/models/<model-name>.gguf
|
||||||
```
|
```
|
||||||
|
|
||||||
Multi-shard GGUFs (`*-00001-of-NNNNN.gguf`) work too — point the preset at the
|
### 4.3 Config file — `~/.config/llama-swap/config.yaml`
|
||||||
first shard and llama.cpp auto-loads the rest.
|
|
||||||
|
|
||||||
### 4.3 Preset file — `~/.llama-models.ini`
|
llama-swap uses a YAML config file. Each model is defined under `models:` with
|
||||||
|
a `cmd:` block containing the llama-server invocation.
|
||||||
|
|
||||||
Router mode consults this file. Each `[section]` is a model id usable in API
|
```yaml
|
||||||
requests. The section name and `model =` path are the only required fields;
|
globalTTL: 1800
|
||||||
the rest become `--flag value` args to the per-model worker when it spawns.
|
models:
|
||||||
|
Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
|
cmd: |
|
||||||
|
/home/ai-server/llama.cpp/build/bin/llama-server
|
||||||
|
--model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
|
||||||
|
--ctx-size 262144
|
||||||
|
--temp 0.7
|
||||||
|
--cache-type-k q8_0
|
||||||
|
--cache-type-v q8_0
|
||||||
|
--n-gpu-layers 99
|
||||||
|
|
||||||
```ini
|
MiniMax-M2.7-IQ3_XXS:
|
||||||
[Qwen_Qwen3.6-35B-A3B-Q8_0]
|
cmd: |
|
||||||
model = /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
|
/home/ai-server/llama.cpp/build/bin/llama-server
|
||||||
ctx-size = 262144
|
--model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
|
||||||
temp = 0.7
|
--ctx-size 131072
|
||||||
cache-type-k = q8_0
|
--temp 1.0
|
||||||
cache-type-v = q8_0
|
--cache-type-k q8_0
|
||||||
n-gpu-layers = 99
|
--cache-type-v q8_0
|
||||||
|
--n-gpu-layers 99
|
||||||
[MiniMax-M2.7-IQ3_XXS]
|
|
||||||
model = /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS-00001-of-NNNNN.gguf
|
|
||||||
ctx-size = 131072
|
|
||||||
temp = 1.0
|
|
||||||
cache-type-k = q8_0
|
|
||||||
cache-type-v = q8_0
|
|
||||||
n-gpu-layers = 99
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Placeholder sections (without `model =`) show up in `GET /models` but are
|
### 4.4 Systemd user service — `~/.config/systemd/user/llama-swap.service`
|
||||||
filtered out by the extension's discovery — they would fail on load.
|
|
||||||
|
|
||||||
### 4.4 Systemd user service — `~/.config/systemd/user/llama-server.service`
|
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
[Unit]
|
[Unit]
|
||||||
Description=LLaMA.cpp AI Server (Router Mode, Vulkan)
|
Description=LLaMA-swap AI Server (Swap Mode)
|
||||||
After=network.target
|
After=network.target
|
||||||
Wants=network.target
|
Wants=network.target
|
||||||
|
|
||||||
@@ -117,16 +119,10 @@ Type=simple
|
|||||||
User=ai-server
|
User=ai-server
|
||||||
Group=ai-server
|
Group=ai-server
|
||||||
WorkingDirectory=/home/ai-server
|
WorkingDirectory=/home/ai-server
|
||||||
ExecStart=/home/ai-server/llama.cpp/build/bin/llama-server \
|
ExecStart=/home/ai-server/node_modules/.bin/llama-swap \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
--port 8080 \
|
--port 8080 \
|
||||||
--models-dir /home/ai-server/models \
|
--config /home/ai-server/.config/llama-swap/config.yaml
|
||||||
--models-max 1 \
|
|
||||||
--models-autoload \
|
|
||||||
--models-preset /home/ai-server/.llama-models.ini \
|
|
||||||
--gpu-layers 99 \
|
|
||||||
--cache-type-k q8_0 \
|
|
||||||
--cache-type-v q8_0
|
|
||||||
|
|
||||||
LimitNOFILE=65536
|
LimitNOFILE=65536
|
||||||
LimitMEMLOCK=unlimited
|
LimitMEMLOCK=unlimited
|
||||||
@@ -141,18 +137,10 @@ StandardError=journal
|
|||||||
WantedBy=default.target
|
WantedBy=default.target
|
||||||
```
|
```
|
||||||
|
|
||||||
Important flags:
|
|
||||||
|
|
||||||
- **No `-c <N>`** at the router level. That flag is inherited by every child
|
|
||||||
worker and silently caps the preset's `ctx-size`. Let per-model presets win.
|
|
||||||
- **`--models-max 1`** enforces single-model concurrency (matters on shared
|
|
||||||
unified-memory hardware where two workers would fight for VRAM).
|
|
||||||
- **`--models-autoload`** spawns workers on demand via `POST /models/load`.
|
|
||||||
|
|
||||||
Enable and start:
|
Enable and start:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
systemctl --user daemon-reload && systemctl --user enable --now llama-server.service
|
systemctl --user daemon-reload && systemctl --user enable --now llama-swap.service
|
||||||
loginctl enable-linger $(whoami) # keep user services running across logouts
|
loginctl enable-linger $(whoami) # keep user services running across logouts
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -160,12 +148,17 @@ loginctl enable-linger $(whoami) # keep user services running across logouts
|
|||||||
|
|
||||||
| Method | Path | Body | Notes |
|
| Method | Path | Body | Notes |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| `GET` | `/models` | — | List models; `status.args` contains the spawned worker's command line |
|
| `GET` | `/v1/models` | — | List models; `{"data":[{id,object,created,owned_by}]}` |
|
||||||
| `POST` | `/models/load` | `{"model":"<id>"}` | Payload key is `model`, **not** `id` |
|
| `GET` | `/running` | — | Currently loaded models; `{"running":[{id,...}]}` |
|
||||||
| `POST` | `/models/unload` | `{"model":"<id>"}` | Same |
|
| `POST` | `/api/models/unload` | — | Unload all models; returns `{"msg":"ok"}` |
|
||||||
| `GET` | `/health` | — | `{"status":"ok"}` when router is up |
|
| `POST` | `/api/models/unload/<id>` | — | Unload specific model; plain text `OK` |
|
||||||
|
| `GET` | `/upstream/<id>/health` | — | Warm-load model (forces spawn without inference) |
|
||||||
|
| `GET` | `/health` | — | Plain text `OK` (not JSON) |
|
||||||
| `POST` | `/v1/chat/completions` | OpenAI Chat Completions payload | What pi and the web UI use |
|
| `POST` | `/v1/chat/completions` | OpenAI Chat Completions payload | What pi and the web UI use |
|
||||||
| `GET` | `/` | — | Built-in SvelteKit chat UI with a model picker |
|
|
||||||
|
> **Note:** Response bodies are mixed JSON and plain text. The extension's
|
||||||
|
> `routerRequest()` falls back to `{raw: buf}` for non-JSON responses, so
|
||||||
|
> unload calls won't crash — they'll return `{raw: "OK"}`.
|
||||||
|
|
||||||
## 5. Caddy + mTLS setup (192.168.2.2)
|
## 5. Caddy + mTLS setup (192.168.2.2)
|
||||||
|
|
||||||
@@ -238,11 +231,11 @@ registers the `ai-server` provider, and installs the admin slash commands.
|
|||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `/ai-server-status` | Tabular view of models, status, ctx size | HTTPS mTLS |
|
| `/ai-server-status` | Tabular view of models, status, ctx size | HTTPS mTLS |
|
||||||
| `/ai-server-refresh` | Re-discover models and re-register the provider | HTTPS mTLS |
|
| `/ai-server-refresh` | Re-discover models and re-register the provider | HTTPS mTLS |
|
||||||
| `/ai-server-load <id>` | Load a model on-demand | HTTPS mTLS |
|
| `/ai-server-load <id>` | Warm-load a model via `/upstream/<id>/health` | HTTPS mTLS |
|
||||||
| `/ai-server-unload <id>` | Unload a model | HTTPS mTLS |
|
| `/ai-server-unload <id>` | Unload a model via `/api/models/unload/<id>` | HTTPS mTLS |
|
||||||
| `/ai-server-ctx <id> <size>` | Edit preset ctx-size, unload + reload | SSH + HTTPS |
|
| `/ai-server-ctx <id> <size>` | Edit YAML config ctx-size, reload the model | SSH + HTTPS |
|
||||||
| `/ai-server-preset` | Print the server's `~/.llama-models.ini` | SSH |
|
| `/ai-server-preset` | Print the server's llama-swap config (YAML) | SSH |
|
||||||
| `/ai-server-restart` | `systemctl --user restart llama-server.service` | SSH |
|
| `/ai-server-restart` | `systemctl --user restart llama-swap.service` | SSH |
|
||||||
|
|
||||||
`<id>` arguments tab-complete against the live router model list.
|
`<id>` arguments tab-complete against the live router model list.
|
||||||
|
|
||||||
@@ -253,14 +246,13 @@ registers the `ai-server` provider, and installs the admin slash commands.
|
|||||||
ssh ai-server@192.168.2.3
|
ssh ai-server@192.168.2.3
|
||||||
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .
|
cd ~/models && hf download <author>/<repo> --include '*<quant>*' --local-dir .
|
||||||
|
|
||||||
# Add a preset section to ~/.llama-models.ini — section name = model id
|
# Add a config block to ~/.config/llama-swap/config.yaml (see example in §4.3)
|
||||||
# (see example in §4.3)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Then from pi:
|
Then from pi:
|
||||||
|
|
||||||
```
|
```
|
||||||
/ai-server-refresh # discovers the new preset
|
/ai-server-refresh # discovers the new model
|
||||||
/ai-server-load <id> # first load may take a minute for a cold GGUF
|
/ai-server-load <id> # first load may take a minute for a cold GGUF
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -268,9 +260,8 @@ No extension-side config changes are needed — discovery picks it up.
|
|||||||
|
|
||||||
## 9. Browser access to the built-in web UI
|
## 9. Browser access to the built-in web UI
|
||||||
|
|
||||||
`llama-server` ships a SvelteKit chat UI at `/` with a model picker. Navigate to
|
Navigate to `https://ai.shahondin1624.de/` in any browser that has the client
|
||||||
`https://ai.shahondin1624.de/` in any browser that has the client cert and
|
cert and trusts the root CA.
|
||||||
trusts the root CA.
|
|
||||||
|
|
||||||
### 9.1 Firefox (simplest path, always works)
|
### 9.1 Firefox (simplest path, always works)
|
||||||
|
|
||||||
@@ -332,15 +323,18 @@ Verify under `brave://policy`. The policy must show status **OK**, not
|
|||||||
|
|
||||||
| Symptom | Likely cause | Fix |
|
| Symptom | Likely cause | Fix |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| pi: `HTTP 400: request exceeds available context size` | Router started with `-c <small>`, overriding the preset's larger `ctx-size` | Remove the router-level `-c` flag from the systemd ExecStart |
|
| pi: `HTTP 400: request exceeds available context size` | Model config has a small `--ctx-size` | Increase `--ctx-size` in the YAML config |
|
||||||
| pi: `HTTP 400: File Not Found` on `/models/load` | Wrong JSON body key (older versions used `id`) | Must be `{"model":"<id>"}` — the extension's `admin.ts` already does this |
|
| pi: `HTTP 400: File Not Found` on load | Wrong model id — check `/v1/models` | Use the exact id from the models list |
|
||||||
|
| Model shows as `[unloaded]` in `/ai-server-status` | Model isn't currently loaded in llama-swap | Run `/ai-server-load <id>` to warm it |
|
||||||
|
| First request is slow | Cold model load — no preload configured | Add `hooks.on_startup.preload: [<id>]` to config |
|
||||||
| `certutil: unable to open …root-ca.pem` | CA file not yet scp'd locally | Copy `root-ca.pem` from the Caddy host |
|
| `certutil: unable to open …root-ca.pem` | CA file not yet scp'd locally | Copy `root-ca.pem` from the Caddy host |
|
||||||
| Brave: p12 import "Invalid or corrupt file" | OpenSSL 3 default PBES2/AES-256 encryption | Regenerate with `openssl pkcs12 -legacy -export …` |
|
| Brave: p12 import "Invalid or corrupt file" | OpenSSL 3 default PBES2/AES-256 encryption | Regenerate with `openssl pkcs12 -legacy -export …` |
|
||||||
| Brave: site loads but padlock is red, `ChromeRootStoreEnabled: Error` in `brave://policy` | Policy was removed upstream | Use `brave://certificate-manager/` → Custom, or use Firefox |
|
| Brave: site loads but padlock is red | Chrome Root Store issue | Use `brave://certificate-manager/` → Custom |
|
||||||
| Cert selection prompt appears on every page load | `AutoSelectCertificateForUrls` policy missing or malformed | See §9.3 |
|
| Cert selection prompt appears on every page load | `AutoSelectCertificateForUrls` policy missing or malformed | See §9.3 |
|
||||||
| System-trust update-ca-trust has no effect on Brave | Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust` | Import directly into the sandbox's NSS DB (§9.3) |
|
| System-trust update-ca-trust has no effect on Brave | Brave is a Flatpak; sandbox doesn't see host `/etc/pki/ca-trust` | Import directly into the sandbox's NSS DB (§9.3) |
|
||||||
| Model shows as `[no model path]` in `/ai-server-status` | Preset section in `~/.llama-models.ini` has no `model =` line | Add the path, then `/ai-server-refresh` |
|
| Chat first-token latency seems long | Cold model load | First chat turn may wait 10–60s while the GGUF mmap's in |
|
||||||
| Chat first-token latency seems long | Cold model load is not counted separately | First chat turn may wait 10–60s while the GGUF mmap's in; subsequent turns stream immediately |
|
| `/ai-server-restart` fails | Wrong service unit name | Check `AI_SERVER_SERVICE_UNIT` / create the proper unit |
|
||||||
|
| `/ai-server-ctx` fails | YAML format changed | Edit `~/.config/llama-swap/config.yaml` manually first |
|
||||||
|
|
||||||
## 11. Security notes
|
## 11. Security notes
|
||||||
|
|
||||||
@@ -348,8 +342,8 @@ Verify under `brave://policy`. The policy must show status **OK**, not
|
|||||||
is the sole credential for API access. Treat it like an SSH key — do not
|
is the sole credential for API access. Treat it like an SSH key — do not
|
||||||
share, do not commit, do not email.
|
share, do not commit, do not email.
|
||||||
- To revoke a client, regenerate the root CA's cert list and remove/rename the
|
- To revoke a client, regenerate the root CA's cert list and remove/rename the
|
||||||
offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this is
|
offending client cert file on Caddy. (Proper CRL/OCSP is not set up — this
|
||||||
a single-user deployment.)
|
is a single-user deployment.)
|
||||||
- The `apiKey: "ai-server-mtls"` string in `index.ts` is a placeholder required
|
- The `apiKey: "ai-server-mtls"` string in `index.ts` is a placeholder required
|
||||||
by the pi model registry; no bearer token is sent over the wire. All auth is
|
by the pi model registry; no bearer token is sent over the wire. All auth is
|
||||||
cert-based.
|
cert-based.
|
||||||
@@ -363,10 +357,10 @@ Verify under `brave://policy`. The policy must show status **OK**, not
|
|||||||
| Path | Purpose |
|
| Path | Purpose |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `~/llama.cpp/` | llama.cpp source + build tree |
|
| `~/llama.cpp/` | llama.cpp source + build tree |
|
||||||
| `~/llama.cpp/build/bin/llama-server` | Binary |
|
| `~/llama.cpp/build/bin/llama-server` | Binary (invoked by llama-swap) |
|
||||||
| `~/models/*.gguf` | Model weights |
|
| `~/models/*.gguf` | Model weights |
|
||||||
| `~/.llama-models.ini` | Router preset file |
|
| `~/.config/llama-swap/config.yaml` | llama-swap YAML config |
|
||||||
| `~/.config/systemd/user/llama-server.service` | Service unit |
|
| `~/.config/systemd/user/llama-swap.service` | Service unit |
|
||||||
| `~/vram-monitor.sh` | Optional idle-unload cron helper |
|
| `~/vram-monitor.sh` | Optional idle-unload cron helper |
|
||||||
|
|
||||||
### On the Caddy host (192.168.2.2)
|
### On the Caddy host (192.168.2.2)
|
||||||
|
|||||||
+142
-26
@@ -3,21 +3,28 @@ import * as https from "node:https";
|
|||||||
import { URL } from "node:url";
|
import { URL } from "node:url";
|
||||||
import { promisify } from "node:util";
|
import { promisify } from "node:util";
|
||||||
import {
|
import {
|
||||||
|
AI_SERVER_MODELS_PATH,
|
||||||
AI_SERVER_PRESET_PATH,
|
AI_SERVER_PRESET_PATH,
|
||||||
|
AI_SERVER_RUNNING_PATH,
|
||||||
|
AI_SERVER_SERVICE_UNIT,
|
||||||
AI_SERVER_SSH_HOST,
|
AI_SERVER_SSH_HOST,
|
||||||
|
AI_SERVER_UNLOAD_ALL_PATH,
|
||||||
|
AI_SERVER_UNLOAD_PATH,
|
||||||
|
AI_SERVER_UPSTREAM_HEALTH_PATH,
|
||||||
AI_SERVER_URL,
|
AI_SERVER_URL,
|
||||||
type ServerModel,
|
type ServerModel,
|
||||||
getAdminTimeoutMs,
|
getAdminTimeoutMs,
|
||||||
loadCerts,
|
loadCerts,
|
||||||
} from "./config.js";
|
} from "./config.js";
|
||||||
import {
|
import {
|
||||||
extractCtxSize,
|
parseCtxMapFromYaml,
|
||||||
|
extractCtxFromRunningCmd,
|
||||||
isReasoningModel,
|
isReasoningModel,
|
||||||
isShardArtefact,
|
isShardArtefact,
|
||||||
} from "./router-utils.js";
|
} from "./router-utils.js";
|
||||||
|
|
||||||
// Re-export so existing index.ts imports keep working.
|
// Re-export so existing index.ts imports keep working.
|
||||||
export { extractCtxSize, isReasoningModel };
|
export { isReasoningModel };
|
||||||
|
|
||||||
const exec = promisify(execCb);
|
const exec = promisify(execCb);
|
||||||
|
|
||||||
@@ -84,12 +91,33 @@ async function routerRequest(
|
|||||||
|
|
||||||
export interface RouterModel {
|
export interface RouterModel {
|
||||||
id: string;
|
id: string;
|
||||||
status: { value: "loaded" | "unloaded" | "loading"; args: string[] };
|
object?: string;
|
||||||
|
created?: number;
|
||||||
|
owned_by?: string;
|
||||||
|
/** Whether the model is currently loaded in llama-swap. */
|
||||||
|
running?: boolean;
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function listModels(): Promise<RouterModel[]> {
|
export async function listModels(): Promise<RouterModel[]> {
|
||||||
const data = await routerRequest("GET", "/models");
|
// llama-swap: GET /v1/models returns { data: [{ id, object, created, owned_by }] }
|
||||||
return (data?.data ?? []) as RouterModel[];
|
// GET /running returns { running: [{ id, ... }] }
|
||||||
|
// We merge: every model from /v1/models gets a `running` flag from /running.
|
||||||
|
const [modelsRes, runningRes] = await Promise.all([
|
||||||
|
routerRequest("GET", AI_SERVER_MODELS_PATH),
|
||||||
|
routerRequest("GET", AI_SERVER_RUNNING_PATH),
|
||||||
|
]);
|
||||||
|
|
||||||
|
const models: RouterModel[] = (modelsRes?.data ?? []) as RouterModel[];
|
||||||
|
const runningIds = new Set<string>();
|
||||||
|
if (runningRes?.running && Array.isArray(runningRes.running)) {
|
||||||
|
for (const entry of runningRes.running as Record<string, unknown>[]) {
|
||||||
|
if (entry.id) runningIds.add(String(entry.id));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (const m of models) {
|
||||||
|
m.running = runningIds.has(m.id);
|
||||||
|
}
|
||||||
|
return models;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Short TTL cache for listModels — tab-completion calls the completer on
|
// Short TTL cache for listModels — tab-completion calls the completer on
|
||||||
@@ -113,32 +141,67 @@ export function invalidateListModelsCache(): void {
|
|||||||
}
|
}
|
||||||
|
|
||||||
export async function loadModel(id: string): Promise<unknown> {
|
export async function loadModel(id: string): Promise<unknown> {
|
||||||
// The router's handler reads `body["model"]`; passing `{id}` yields a 404.
|
// llama-swap: GET /upstream/<id>/health forces a spawn (warm load).
|
||||||
const r = await routerRequest("POST", "/models/load", { model: id });
|
// 2xx = success; plain text OK body is acceptable.
|
||||||
|
const r = await routerRequest("GET", AI_SERVER_UPSTREAM_HEALTH_PATH(id));
|
||||||
invalidateListModelsCache();
|
invalidateListModelsCache();
|
||||||
return r;
|
return r;
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function unloadModel(id: string): Promise<unknown> {
|
export async function unloadModel(id: string): Promise<unknown> {
|
||||||
const r = await routerRequest("POST", "/models/unload", { model: id });
|
// llama-swap: POST /api/models/unload/<id>, no body. Returns plain text "OK".
|
||||||
|
const r = await routerRequest("POST", AI_SERVER_UNLOAD_PATH(id));
|
||||||
invalidateListModelsCache();
|
invalidateListModelsCache();
|
||||||
return r;
|
return r;
|
||||||
}
|
}
|
||||||
|
|
||||||
// A preset is "runnable" only if it has a --model path. Placeholder sections
|
export async function unloadAll(): Promise<unknown> {
|
||||||
// like [small-7b] without model = ... show up in /models but have no --model
|
// llama-swap: POST /api/models/unload, no body.
|
||||||
// arg and would fail on load.
|
const r = await routerRequest("POST", AI_SERVER_UNLOAD_ALL_PATH);
|
||||||
function isRunnable(m: RouterModel): boolean {
|
invalidateListModelsCache();
|
||||||
return (m.status?.args ?? []).includes("--model");
|
return r;
|
||||||
|
}
|
||||||
|
|
||||||
|
// llama-swap /v1/models only returns registered presets (all have a model
|
||||||
|
// path). Placeholder sections are not exposed. We only filter out shard
|
||||||
|
// artefacts.
|
||||||
|
|
||||||
|
interface RunningEntry {
|
||||||
|
model: string;
|
||||||
|
cmd?: string;
|
||||||
|
state?: string;
|
||||||
|
ttl?: number;
|
||||||
|
proxy?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
async function listRunning(): Promise<RunningEntry[]> {
|
||||||
|
const res = await routerRequest("GET", AI_SERVER_RUNNING_PATH);
|
||||||
|
return Array.isArray((res as any)?.running)
|
||||||
|
? (res as any).running
|
||||||
|
: [];
|
||||||
}
|
}
|
||||||
|
|
||||||
export async function discoverModels(): Promise<ServerModel[]> {
|
export async function discoverModels(): Promise<ServerModel[]> {
|
||||||
const models = await listModels();
|
const [models, running, yaml] = await Promise.all([
|
||||||
|
listModels(),
|
||||||
|
listRunning().catch(() => [] as RunningEntry[]),
|
||||||
|
readPreset().catch(() => ""),
|
||||||
|
]);
|
||||||
|
|
||||||
|
const ctxFromYaml = parseCtxMapFromYaml(yaml);
|
||||||
|
const ctxFromRunning = new Map<string, number>();
|
||||||
|
for (const r of running) {
|
||||||
|
const n = extractCtxFromRunningCmd(r.cmd);
|
||||||
|
if (n) ctxFromRunning.set(r.model, n);
|
||||||
|
}
|
||||||
|
|
||||||
return models
|
return models
|
||||||
.filter(isRunnable)
|
|
||||||
.filter((m) => !isShardArtefact(m.id))
|
.filter((m) => !isShardArtefact(m.id))
|
||||||
.map((m) => {
|
.map((m) => {
|
||||||
const ctx = extractCtxSize(m) ?? 32768;
|
const ctx =
|
||||||
|
ctxFromRunning.get(m.id) ?? // live process is authoritative
|
||||||
|
ctxFromYaml.get(m.id) ?? // config.yaml is next best
|
||||||
|
32768; // last-resort fallback
|
||||||
return {
|
return {
|
||||||
id: m.id,
|
id: m.id,
|
||||||
name: `${m.id} (AI Server)`,
|
name: `${m.id} (AI Server)`,
|
||||||
@@ -177,30 +240,83 @@ export async function readPreset(): Promise<string> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Set a `key = value` line inside a named [section] of the preset file.
|
* Set a `key = value` inside a named YAML section for llama-swap.
|
||||||
* Preserves comments and all other lines. Errors if the key is absent.
|
*
|
||||||
|
* llama-swap config.yaml structure (relevant excerpt):
|
||||||
|
*
|
||||||
|
* models:
|
||||||
|
* Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
|
* cmd: |
|
||||||
|
* /path/to/llama-server --model /path/to/gguf ...
|
||||||
|
* --ctx-size 32768
|
||||||
|
* --temp 0.7
|
||||||
|
*
|
||||||
|
* This function finds the `<id>:` block under `models:`, locates the
|
||||||
|
* `--ctx-size N` line (or other supported flags), and replaces N.
|
||||||
|
*
|
||||||
|
* Supported keys: ctx-size, temp, n-gpu-layers
|
||||||
*/
|
*/
|
||||||
export async function setPresetKey(
|
export async function setPresetKey(
|
||||||
section: string,
|
section: string,
|
||||||
key: string,
|
key: string,
|
||||||
value: string,
|
value: string,
|
||||||
): Promise<void> {
|
): Promise<void> {
|
||||||
|
// Map short key names to the actual CLI flag used in cmd:
|
||||||
|
const flagMap: Record<string, string> = {
|
||||||
|
"ctx-size": "--ctx-size",
|
||||||
|
"temp": "--temp",
|
||||||
|
"n-gpu-layers": "--n-gpu-layers",
|
||||||
|
};
|
||||||
|
const flag = flagMap[key] ?? `--${key}`;
|
||||||
|
|
||||||
|
// We use a sed-based approach on the YAML file:
|
||||||
|
// 1. Find the <section>: block under models:
|
||||||
|
// 2. Within that block, find the --flag N line
|
||||||
|
// 3. Replace N with the new value
|
||||||
|
//
|
||||||
|
// The sed script works line-by-line:
|
||||||
|
// - When we see ` ${section}:` under models:, enter editing mode
|
||||||
|
// - While editing, look for `--flag <number>` and replace it
|
||||||
|
// - Exit editing mode when we hit a line at the same or lesser indent
|
||||||
|
// that is not under this section
|
||||||
|
const escapedSection = section.replace(/[.[\]*/^$]/g, "\\$&");
|
||||||
|
const escapedFlag = flag.replace(/[.[\]*/^$]/g, "\\$&");
|
||||||
|
|
||||||
const awkScript = `
|
const awkScript = `
|
||||||
awk -v sec="[${section}]" -v key=${shQuote(key)} -v val=${shQuote(value)} '
|
awk -v sec="${escapedSection}" -v flag="${escapedFlag}" -v val="${value}" '
|
||||||
BEGIN { in_s = 0; found = 0 }
|
BEGIN { in_sec = 0; indent = 0 }
|
||||||
/^\\[/ { in_s = ($0 == sec) }
|
{
|
||||||
in_s && $1 == key && $2 == "=" { print key " = " val; found = 1; next }
|
# Detect section header: " <section>:" (2-space indent, key followed by colon)
|
||||||
{ print }
|
if (!in_sec && match($0, /^[[:space:]]{2}'${escapedSection}':[[:space:]]*$/)) {
|
||||||
END { if (!found) exit 2 }
|
in_sec = 1;
|
||||||
|
indent = 2;
|
||||||
|
}
|
||||||
|
# If we are in a section, check if we left it
|
||||||
|
if (in_sec) {
|
||||||
|
lineIndent = 0;
|
||||||
|
m = match($0, /^[[:space:]]*/);
|
||||||
|
if (m > 0) lineIndent = RLENGTH;
|
||||||
|
# If indent is <= 2 and line is not empty and not a continuation of cmd,
|
||||||
|
# we have left this section
|
||||||
|
if (lineIndent <= 2 && $0 !~ /^[[:space:]]*$/) {
|
||||||
|
in_sec = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (in_sec && match($0, " " flag " [0-9]+")) {
|
||||||
|
sub(flag " [0-9]+", flag " " val);
|
||||||
|
}
|
||||||
|
print
|
||||||
|
}
|
||||||
' ${AI_SERVER_PRESET_PATH} > ${AI_SERVER_PRESET_PATH}.tmp && mv ${AI_SERVER_PRESET_PATH}.tmp ${AI_SERVER_PRESET_PATH}
|
' ${AI_SERVER_PRESET_PATH} > ${AI_SERVER_PRESET_PATH}.tmp && mv ${AI_SERVER_PRESET_PATH}.tmp ${AI_SERVER_PRESET_PATH}
|
||||||
`.trim();
|
`.trim();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
await runSsh(awkScript);
|
await runSsh(awkScript);
|
||||||
} catch (err: any) {
|
} catch (err: any) {
|
||||||
const msg = err?.message ?? String(err);
|
const msg = err?.message ?? String(err);
|
||||||
if (msg.includes("exit code 2") || msg.match(/exit.*2/)) {
|
if (msg.includes("exit code 2") || msg.match(/exit.*2/)) {
|
||||||
throw new Error(
|
throw new Error(
|
||||||
`Key "${key}" not found in [${section}] — add it to the preset manually first.`,
|
`Key "${key}" not found for model "${section}" — add it to the preset manually first.`,
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
throw err;
|
throw err;
|
||||||
@@ -209,7 +325,7 @@ awk -v sec="[${section}]" -v key=${shQuote(key)} -v val=${shQuote(value)} '
|
|||||||
|
|
||||||
export async function restartService(): Promise<string> {
|
export async function restartService(): Promise<string> {
|
||||||
return runSsh(
|
return runSsh(
|
||||||
"systemctl --user restart llama-server.service && systemctl --user is-active llama-server.service",
|
`systemctl --user restart ${AI_SERVER_SERVICE_UNIT} && systemctl --user is-active ${AI_SERVER_SERVICE_UNIT}`,
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+22
-1
@@ -13,8 +13,29 @@ export const AI_SERVER_CHAT_PATH = "/v1/chat/completions";
|
|||||||
// SSH target for admin operations (preset edits, systemctl). Uses key auth.
|
// SSH target for admin operations (preset edits, systemctl). Uses key auth.
|
||||||
export const AI_SERVER_SSH_HOST =
|
export const AI_SERVER_SSH_HOST =
|
||||||
process.env.AI_SERVER_SSH_HOST ?? "ai-server@192.168.2.3";
|
process.env.AI_SERVER_SSH_HOST ?? "ai-server@192.168.2.3";
|
||||||
|
|
||||||
|
// llama-swap endpoint paths
|
||||||
|
export const AI_SERVER_MODELS_PATH =
|
||||||
|
process.env.AI_SERVER_MODELS_PATH ?? "/v1/models";
|
||||||
|
export const AI_SERVER_RUNNING_PATH =
|
||||||
|
process.env.AI_SERVER_RUNNING_PATH ?? "/running";
|
||||||
|
export const AI_SERVER_UNLOAD_ALL_PATH =
|
||||||
|
process.env.AI_SERVER_UNLOAD_ALL_PATH ?? "/api/models/unload";
|
||||||
|
export const AI_SERVER_UNLOAD_PATH = (id: string) =>
|
||||||
|
process.env.AI_SERVER_UNLOAD_PATH ??
|
||||||
|
`/api/models/unload/${encodeURIComponent(id)}`;
|
||||||
|
export const AI_SERVER_UPSTREAM_HEALTH_PATH = (id: string) =>
|
||||||
|
process.env.AI_SERVER_UPSTREAM_HEALTH_PATH ??
|
||||||
|
`/upstream/${encodeURIComponent(id)}/health`;
|
||||||
|
|
||||||
|
// llama-swap config file (YAML, replaces old INI preset)
|
||||||
export const AI_SERVER_PRESET_PATH =
|
export const AI_SERVER_PRESET_PATH =
|
||||||
process.env.AI_SERVER_PRESET_PATH ?? "~/.llama-models.ini";
|
process.env.AI_SERVER_PRESET_PATH ??
|
||||||
|
"~/.config/llama-swap/config.yaml";
|
||||||
|
|
||||||
|
// systemd service unit for llama-swap
|
||||||
|
export const AI_SERVER_SERVICE_UNIT =
|
||||||
|
process.env.AI_SERVER_SERVICE_UNIT ?? "llama-swap.service";
|
||||||
|
|
||||||
// Distinct api id so registering streamSimple does NOT overwrite the
|
// Distinct api id so registering streamSimple does NOT overwrite the
|
||||||
// built-in openai-completions provider (the api-registry keys by api name).
|
// built-in openai-completions provider (the api-registry keys by api name).
|
||||||
|
|||||||
+4
-10
@@ -1,7 +1,6 @@
|
|||||||
import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
|
import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
|
||||||
import {
|
import {
|
||||||
discoverModels,
|
discoverModels,
|
||||||
extractCtxSize,
|
|
||||||
listModels,
|
listModels,
|
||||||
listModelsCached,
|
listModelsCached,
|
||||||
loadModel,
|
loadModel,
|
||||||
@@ -122,13 +121,8 @@ export default async function (pi: ExtensionAPI) {
|
|||||||
const routerModels = await listModels();
|
const routerModels = await listModels();
|
||||||
const lines = [`AI Server: ${AI_SERVER_URL}`];
|
const lines = [`AI Server: ${AI_SERVER_URL}`];
|
||||||
for (const m of routerModels) {
|
for (const m of routerModels) {
|
||||||
const status = m.status?.value ?? "?";
|
const status = m.running ? "loaded" : "unloaded";
|
||||||
const ctx = extractCtxSize(m);
|
lines.push(` ${m.id} [${status}]`);
|
||||||
const hasModel = (m.status?.args ?? []).includes("--model");
|
|
||||||
const marker = hasModel ? " " : " [no model path]";
|
|
||||||
lines.push(
|
|
||||||
` ${m.id} [${status}] ctx=${ctx ?? "?"}${marker}`,
|
|
||||||
);
|
|
||||||
}
|
}
|
||||||
ctx.ui.notify(lines.join("\n"), "info");
|
ctx.ui.notify(lines.join("\n"), "info");
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
@@ -246,7 +240,7 @@ export default async function (pi: ExtensionAPI) {
|
|||||||
});
|
});
|
||||||
|
|
||||||
pi.registerCommand("ai-server-preset", {
|
pi.registerCommand("ai-server-preset", {
|
||||||
description: "Print ~/.llama-models.ini on the ai-server",
|
description: "Print llama-swap config on the ai-server",
|
||||||
handler: async (_args, ctx) => {
|
handler: async (_args, ctx) => {
|
||||||
try {
|
try {
|
||||||
const text = await readPreset();
|
const text = await readPreset();
|
||||||
@@ -261,7 +255,7 @@ export default async function (pi: ExtensionAPI) {
|
|||||||
});
|
});
|
||||||
|
|
||||||
pi.registerCommand("ai-server-restart", {
|
pi.registerCommand("ai-server-restart", {
|
||||||
description: "Restart the ai-server llama-server service",
|
description: "Restart the ai-server llama-swap service",
|
||||||
handler: async (_args, ctx) => {
|
handler: async (_args, ctx) => {
|
||||||
const ok = await ctx.ui.confirm(
|
const ok = await ctx.ui.confirm(
|
||||||
"Restart llama-server?",
|
"Restart llama-server?",
|
||||||
|
|||||||
@@ -6,19 +6,80 @@
|
|||||||
|
|
||||||
export interface RouterModelMeta {
|
export interface RouterModelMeta {
|
||||||
id: string;
|
id: string;
|
||||||
status?: { value: string; args: string[] };
|
object?: string;
|
||||||
|
created?: number;
|
||||||
|
owned_by?: string;
|
||||||
|
/** Whether the model is currently loaded in llama-swap. */
|
||||||
|
running?: boolean;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Pull `--ctx-size <N>` out of the worker's argv. Returns null if the flag
|
* Parse ctx-size values from every model block in llama-swap's config.yaml.
|
||||||
* is missing, at the end of argv, or the value isn't a number.
|
*
|
||||||
|
* The YAML has a structure like:
|
||||||
|
*
|
||||||
|
* models:
|
||||||
|
* Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
|
* cmd: |
|
||||||
|
* /path/to/llama-server
|
||||||
|
* --ctx-size 262144
|
||||||
|
* --temp 0.7
|
||||||
|
*
|
||||||
|
* This function scans for `--ctx-size N` lines within each model block and
|
||||||
|
* returns a Map of id → ctxSize. If a model appears multiple times it keeps
|
||||||
|
* the last value found.
|
||||||
*/
|
*/
|
||||||
export function extractCtxSize(m: RouterModelMeta): number | null {
|
export function parseCtxMapFromYaml(yaml: string): Map<string, number> {
|
||||||
const args = m.status?.args ?? [];
|
const map = new Map<string, number>();
|
||||||
const i = args.indexOf("--ctx-size");
|
let currentId: string | null = null;
|
||||||
if (i < 0 || i + 1 >= args.length) return null;
|
|
||||||
const n = Number(args[i + 1]);
|
for (const raw of yaml.split("\n")) {
|
||||||
return Number.isFinite(n) ? n : null;
|
const line = raw.replace(/\r$/, "");
|
||||||
|
|
||||||
|
// Skip comments / blank
|
||||||
|
if (!line.trim() || line.trim().startsWith("#")) continue;
|
||||||
|
|
||||||
|
// New model block: exactly two-space indent, "<id>:" with nothing
|
||||||
|
// meaningful after the colon (llama-swap uses 2-space indent under
|
||||||
|
// `models:`).
|
||||||
|
const idMatch = /^ ([A-Za-z0-9._-]+):\s*$/.exec(line);
|
||||||
|
if (idMatch) {
|
||||||
|
currentId = idMatch[1];
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Top-level key resets context (e.g. `macros:`, `hooks:`)
|
||||||
|
if (/^[A-Za-z]/.test(line)) {
|
||||||
|
currentId = null;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!currentId) continue;
|
||||||
|
|
||||||
|
// Look for --ctx-size N anywhere in the line (handles indented cmd:
|
||||||
|
// blocks where the flag is on its own line).
|
||||||
|
const ctx = /--ctx-size\s+(\d+)/.exec(line);
|
||||||
|
if (ctx) {
|
||||||
|
map.set(currentId, Number(ctx[1]));
|
||||||
|
currentId = null; // one ctx per model
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return map;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Extract ctx-size from a /running entry's `cmd` string.
|
||||||
|
*
|
||||||
|
* The /running endpoint returns entries like:
|
||||||
|
* { model: "Qwen_...", cmd: "/path/llama-server --model ... --ctx-size 262144 ...", ... }
|
||||||
|
*
|
||||||
|
* This is the authoritative source for the currently loaded model's ctx.
|
||||||
|
*/
|
||||||
|
export function extractCtxFromRunningCmd(cmd: string | undefined): number | null {
|
||||||
|
if (!cmd) return null;
|
||||||
|
const m = /--ctx-size\s+(\d+)/.exec(cmd);
|
||||||
|
return m ? Number(m[1]) : null;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
|||||||
+81
-24
@@ -7,45 +7,102 @@
|
|||||||
import assert from "node:assert/strict";
|
import assert from "node:assert/strict";
|
||||||
import { test } from "node:test";
|
import { test } from "node:test";
|
||||||
import {
|
import {
|
||||||
extractCtxSize,
|
parseCtxMapFromYaml,
|
||||||
|
extractCtxFromRunningCmd,
|
||||||
isReasoningModel,
|
isReasoningModel,
|
||||||
isShardArtefact,
|
isShardArtefact,
|
||||||
} from "../ai-server/router-utils.ts";
|
} from "../ai-server/router-utils.ts";
|
||||||
|
|
||||||
// ── extractCtxSize ──────────────────────────────────────────────────────
|
// ── parseCtxMapFromYaml ─────────────────────────────────────────────────
|
||||||
|
|
||||||
test("extractCtxSize: --ctx-size present with value", () => {
|
test("parseCtxMapFromYaml: extracts ctx-size from model blocks", () => {
|
||||||
const m = {
|
const yaml = `
|
||||||
id: "x",
|
models:
|
||||||
status: { value: "loaded", args: ["--host", "127.0.0.1", "--ctx-size", "131072"] },
|
Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
};
|
cmd: |
|
||||||
assert.equal(extractCtxSize(m), 131072);
|
/home/ai-server/llama.cpp/build/bin/llama-server
|
||||||
|
--model /home/ai-server/models/Qwen_Qwen3.6-35B-A3B-Q8_0.gguf
|
||||||
|
--ctx-size 262144
|
||||||
|
--temp 0.7
|
||||||
|
MiniMax-M2.7-IQ3_XXS:
|
||||||
|
cmd: |
|
||||||
|
/home/ai-server/llama.cpp/build/bin/llama-server
|
||||||
|
--model /home/ai-server/models/MiniMax-M2.7-UD-IQ3_XXS.gguf
|
||||||
|
--ctx-size 131072
|
||||||
|
--temp 1.0
|
||||||
|
`;
|
||||||
|
const map = parseCtxMapFromYaml(yaml);
|
||||||
|
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 262144);
|
||||||
|
assert.equal(map.get("MiniMax-M2.7-IQ3_XXS"), 131072);
|
||||||
|
assert.equal(map.size, 2);
|
||||||
});
|
});
|
||||||
|
|
||||||
test("extractCtxSize: missing --ctx-size -> null", () => {
|
test("parseCtxMapFromYaml: skips comments and blank lines", () => {
|
||||||
assert.equal(extractCtxSize({ id: "x", status: { value: "loaded", args: ["--host", "127"] } }), null);
|
const yaml = `
|
||||||
|
# This is a comment
|
||||||
|
models:
|
||||||
|
|
||||||
|
# Model with large context
|
||||||
|
Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
|
cmd: |
|
||||||
|
/path/to/server
|
||||||
|
--ctx-size 65536
|
||||||
|
--temp 0.7
|
||||||
|
`;
|
||||||
|
const map = parseCtxMapFromYaml(yaml);
|
||||||
|
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 65536);
|
||||||
});
|
});
|
||||||
|
|
||||||
test("extractCtxSize: --ctx-size at end of argv -> null (no value)", () => {
|
test("parseCtxMapFromYaml: resets on top-level keys", () => {
|
||||||
assert.equal(extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size"] } }), null);
|
const yaml = `
|
||||||
|
models:
|
||||||
|
Qwen_Qwen3.6-35B-A3B-Q8_0:
|
||||||
|
cmd: |
|
||||||
|
/path/to/server
|
||||||
|
--ctx-size 262144
|
||||||
|
hooks:
|
||||||
|
on_startup:
|
||||||
|
preload:
|
||||||
|
- Qwen_Qwen3.6-35B-A3B-Q8_0
|
||||||
|
`;
|
||||||
|
const map = parseCtxMapFromYaml(yaml);
|
||||||
|
assert.equal(map.get("Qwen_Qwen3.6-35B-A3B-Q8_0"), 262144);
|
||||||
|
// "preload" is not a valid model id pattern, but even if it were,
|
||||||
|
// it's under hooks: so should not be included.
|
||||||
|
assert.ok(!map.has("preload"));
|
||||||
});
|
});
|
||||||
|
|
||||||
test("extractCtxSize: non-numeric value -> null", () => {
|
test("parseCtxMapFromYaml: empty yaml returns empty map", () => {
|
||||||
assert.equal(
|
const map = parseCtxMapFromYaml("");
|
||||||
extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size", "notanumber"] } }),
|
assert.equal(map.size, 0);
|
||||||
null,
|
|
||||||
);
|
|
||||||
});
|
});
|
||||||
|
|
||||||
test("extractCtxSize: zero is valid (not null)", () => {
|
test("parseCtxMapFromYaml: model without ctx-size is skipped", () => {
|
||||||
assert.equal(
|
const yaml = `
|
||||||
extractCtxSize({ id: "x", status: { value: "loaded", args: ["--ctx-size", "0"] } }),
|
models:
|
||||||
0,
|
SmallModel:
|
||||||
);
|
cmd: |
|
||||||
|
/path/to/server
|
||||||
|
--temp 0.7
|
||||||
|
`;
|
||||||
|
const map = parseCtxMapFromYaml(yaml);
|
||||||
|
assert.equal(map.get("SmallModel"), undefined);
|
||||||
|
assert.equal(map.size, 0);
|
||||||
});
|
});
|
||||||
|
|
||||||
test("extractCtxSize: missing status entirely -> null", () => {
|
// ── extractCtxFromRunningCmd ────────────────────────────────────────────
|
||||||
assert.equal(extractCtxSize({ id: "x" }), null);
|
|
||||||
|
test("extractCtxFromRunningCmd: parses --ctx-size from cmd string", () => {
|
||||||
|
const cmd = "/home/ai-server/llama.cpp/build/bin/llama-server --model /home/ai-server/models/Qwen.gguf --ctx-size 262144 --temp 0.7";
|
||||||
|
assert.equal(extractCtxFromRunningCmd(cmd), 262144);
|
||||||
|
});
|
||||||
|
|
||||||
|
test("extractCtxFromRunningCmd: undefined cmd returns null", () => {
|
||||||
|
assert.equal(extractCtxFromRunningCmd(undefined), null);
|
||||||
|
});
|
||||||
|
|
||||||
|
test("extractCtxFromRunningCmd: cmd without --ctx-size returns null", () => {
|
||||||
|
assert.equal(extractCtxFromRunningCmd("/path/to/server --temp 0.7"), null);
|
||||||
});
|
});
|
||||||
|
|
||||||
// ── isShardArtefact ─────────────────────────────────────────────────────
|
// ── isShardArtefact ─────────────────────────────────────────────────────
|
||||||
|
|||||||
Reference in New Issue
Block a user