10 Commits

Author SHA1 Message Date
shahondin1624 f7af660727 migrate ai-server extension from llama.cpp router to llama-swap
Endpoint rewrites:
  - GET /v1/models + /running → merged listModels() with running flag
  - POST /models/load → GET /upstream/<id>/health (warm load)
  - POST /models/unload → POST /api/models/unload/<id> (no body)
  - Added POST /api/models/unload for unloadAll()

Config migration:
  - Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml
  - Service unit: llama-server.service → llama-swap.service
  - setPresetKey() rewritten from INI awk to YAML-aware awk for
    editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks

Per-model ctx-size (fixes 0/33k bug):
  - parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per
    model block → Map<id, ctxSize>
  - extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string
  - discoverModels(): Promise.all(listModels, listRunning, readPreset),
    ctx priority: running cmd → yaml → 32768 fallback
  - Removed broken extractCtxSize stub and dangling imports

Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3,
isShardArtefact ×3, isReasoningModel ×3)

README: full rewrite covering llama-swap architecture, YAML config format,
new endpoints, troubleshooting table updated.
2026-05-27 10:42:19 +02:00
Tobias Addicks 01564df5be Refactor extension structure 2026-05-17 22:55:46 +02:00
shahondin1624 98c140ac03 feat(ai-server): wire pi settings → mTLS extension; per-model reasoning; configurable admin timeout
- config.ts: add getAdminTimeoutMs() reading from
  AI_SERVER_ADMIN_TIMEOUT_MS env or settings.json
  retry.provider.adminTimeoutMs (default = inference timeout, capped at 5min).
  Refactor settings access into a cached readPiSettings() helper shared by
  both timeout resolvers.
- stream.ts: forward options.reasoning (pi-mono's defaultThinkingLevel) to
  llama.cpp via chat_template_kwargs.enable_thinking +
  reasoning_effort, gated on per-model reasoning capability. Add TCP keepalive
  (30s) on the request socket to prevent NAT/middlebox idle drops during long
  silent prefills (root cause of the recent read ETIMEDOUT).
- router-utils.ts: add isReasoningModel(id) with a substring-match list of
  known reasoning families (MiniMax-M, Qwen3.6, Qwen3-Coder, Qwen3-VL,
  MiMo-V2, gpt-oss, Devstral). Unanchored to handle HF-style Org_Model ids.
- admin.ts: replace hardcoded 30s router HTTP timeout with getAdminTimeoutMs;
  use isReasoningModel(id) in discoverModels() instead of blanket
  reasoning: true.
- index.ts: add informational compat block (thinkingFormat,
  supportsReasoningEffort, maxTokensField, etc.) to model registrations so
  pi-mono's UI / capability detection reflects per-model reasoning support.
- tests: 3 new isReasoningModel test groups (positive, negative, unknown).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:33:56 +02:00
shahondin1624 f1ceeb4363 Refactor pass: shared utils, non-blocking discovery, safer monkey-patches
Audit produced five concrete improvements:

1) New shared/ module (zero-dep pure utilities)
   - shared/ansi.ts: hexToRgb (throws on malformed input instead of
     silently producing NaN), fgFromHex, stripAnsi, visibleWidth,
     ANSI_RESET_FG / ANSI_RESET_ALL constants.
   - shared/format.ts: formatTokens, formatElapsed.
   - shared/ctx.ts: safely() and safelyAsync() helpers for dealing with
     pi's "stale after session replacement or reload" ExtensionRunner
     semantics.

   Removes duplicate helpers from mechanicus-footer, markdown-body-color,
   dark-mechanicus-indicator.

2) ai-server: non-blocking startup + short-race timeout
   - Factory registers STATIC_MODELS immediately so pi startup isn't
     blocked on the HTTPS round-trip.
   - Races discoverModels() against a 300ms timeout. On LAN (~40ms) the
     live list wins and pi --list-models sees the real models. Past the
     timeout, fallback remains and background discovery updates the
     provider later.
   - listModelsCached() with 5s TTL for tab completions (was firing a
     round-trip on every keystroke).
   - loadModel/unloadModel invalidate the cache.

3) dark-mechanicus-indicator: stale-ctx guard
   - Wrap the setInterval ticker body in safely() so a race between
     session_shutdown and the ticker can't crash node. Same pattern as
     the earlier footer fix.

4) Safer monkey-patches in markdown-body-color and mechanicus-thinking-label
   - Feature-detect Markdown/Editor/AssistantMessageComponent's target
     method before patching. Warn-and-skip rather than silently create
     a broken prototype if a pi-tui upgrade renames the internal method.

5) Minor
   - Replaced five `as any` casts with typed Record<string, unknown>
     access in the monkey-patch sites.
   - ai-server debug log only fires when actual discovery succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:05:18 +02:00
shahondin1624 139dcbce74 messages: drop assistant entries with only thinking content
Pi's session can contain assistant entries whose AssistantMessage.content
is entirely thinking blocks (no text, no tool calls) — typical after an
aborted turn or when reasoning is edited out. Our contextToOpenAIMessages
was emitting those as { role: "assistant", content: null }.

When such a message is at the end of the context, llama.cpp's chat
template interprets the trailing assistant entry as an "assistant
response prefill" attempt. Reasoning-model templates (MiniMax M2.7,
Qwen, etc.) have enable_thinking set, and the server rejects this
combination with HTTP 400:
    "Assistant response prefill is incompatible with enable_thinking."

Fix: skip assistant entries where extractAssistantText and
extractToolCalls both return empty. Thinking blocks aren't re-fed to
the model anyway, so dropping the wrapper message loses no information.

+ two regression tests in tests/messages.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:48:16 +02:00
shahondin1624 752fdbaff1 Expand tests — unit coverage for router-utils + cache/LE contract
Refactor:
- Extracted extractCtxSize + isShardArtefact from ai-server/admin.ts to
  a new ai-server/router-utils.ts with zero relative imports. Makes them
  directly loadable in tests with Node's --experimental-strip-types
  (no jiti needed). admin.ts re-exports extractCtxSize so index.ts is
  unchanged.

tests/router-utils.test.ts (9 cases):
- extractCtxSize: present/value, missing, end-of-argv, non-numeric,
  zero (edge), missing status.
- isShardArtefact: positive cases (5-digit, numeric, no zero-padding),
  negative cases (clean preset names, non-shard numeric patterns,
  shard pattern mid-string).

tests/integration.test.ts (2 new cases):
- "server cert is publicly trusted": verifies curl without --cacert
  flag reaches /health. Catches LE regression (cert reverting to
  self-signed).
- "chat completion returns usage with prompt_tokens_details":
  sanity-checks the server contract our stream.ts now reads for
  cache-token reporting. Picks any loaded runnable model; skips
  cleanly when none loaded.

Totals: 28 tests (13 messages + 9 router-utils + 6 integration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:44:44 +02:00
shahondin1624 99ad3630fc stream: report cached tokens; indicator: suppress pi's "Working..."
Two small fixes:

ai-server/stream.ts
- llama.cpp reports cached prompt tokens via
    usage.prompt_tokens_details.cached_tokens
  and we were ignoring it. Populate output.usage.cacheRead so pi's
  footer can show the "R<tokens>" field. cacheRead is a subset of
  prompt_tokens (already counted in input), so totalTokens stays
  input + output — no double-counting.

dark-mechanicus-indicator.ts
- Pi appends "Working... (ESC to interrupt)" next to custom working
  indicator frames via a separate message slot. Call
  ctx.ui.setWorkingMessage("") on session_start + every turn_start to
  clear that suffix so the indicator line is just
    ⚙ <quote> · <elapsed>
  with no trailing "Working...".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:28:16 +02:00
shahondin1624 b2cb1667c7 ai-server: stop pinning server cert to private CA (LE is now live)
Now that Caddy serves a Let's Encrypt cert for ai.shahondin1624.de,
passing `ca: root-ca.pem` to https.request made Node override its
default trust list with only the private CA — which no longer chains
the LE-issued server cert, so every request failed with
ECONNRESET / CERT_HAS_EXPIRED-style errors on the client side.

Dropping the `ca:` option lets Node fall back to its built-in Mozilla
CA bundle, which includes Let's Encrypt. Client cert/key still passed;
mTLS remains enforced server-side.

If the server ever reverts to a self-signed cert, re-add `ca: certs.ca`
or set NODE_EXTRA_CA_CERTS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:03:35 +02:00
shahondin1624 39d0797dc9 Gate startup logs behind PI_DEBUG + skip GGUF-shard phantom entries
- All [ai-server] / [markdown-body-color] / [mechanicus-thinking-label]
  console.log calls now fire only when PI_DEBUG is set. Default boot is
  clean.
- ai-server's discoverModels now filters out ids matching
  /-\d+-of-\d+$/ — llama.cpp's --models-autoload registers every GGUF
  shard as its own id, duplicating the preset's consolidated model.
  These shard-named phantoms are no longer surfaced to pi.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 22:38:09 +02:00
shahondin1624 e321f90fe9 Initial commit — ai-server and local-llama extensions
- ai-server/: multi-file pi extension that talks to a remote llama.cpp
  router over mTLS (custom streamSimple), with dynamic model discovery
  and admin slash commands for load/unload/ctx-size/restart/preset.
  Includes README.md documenting the full mTLS + systemd + Caddy setup.
- local-llama.ts: minimal extension registering a local llama.cpp server
  as an OpenAI-compatible provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 21:14:40 +02:00