pi-extensions

Author	SHA1	Message	Date
shahondin1624	f7af660727	migrate ai-server extension from llama.cpp router to llama-swap Endpoint rewrites: - GET /v1/models + /running → merged listModels() with running flag - POST /models/load → GET /upstream/<id>/health (warm load) - POST /models/unload → POST /api/models/unload/<id> (no body) - Added POST /api/models/unload for unloadAll() Config migration: - Preset path: ~/.llama-models.ini → ~/.config/llama-swap/config.yaml - Service unit: llama-server.service → llama-swap.service - setPresetKey() rewritten from INI awk to YAML-aware awk for editing --ctx-size/--temp/--n-gpu-layers in cmd: blocks Per-model ctx-size (fixes 0/33k bug): - parseCtxMapFromYaml(): walks config.yaml, extracts --ctx-size N per model block → Map<id, ctxSize> - extractCtxFromRunningCmd(): parses --ctx-size from /running cmd string - discoverModels(): Promise.all(listModels, listRunning, readPreset), ctx priority: running cmd → yaml → 32768 fallback - Removed broken extractCtxSize stub and dangling imports Tests: 14 passing (parseCtxMapFromYaml ×5, extractCtxFromRunningCmd ×3, isShardArtefact ×3, isReasoningModel ×3) README: full rewrite covering llama-swap architecture, YAML config format, new endpoints, troubleshooting table updated.	2026-05-27 10:42:19 +02:00
Tobias Addicks	01564df5be	Refactor extension structure	2026-05-17 22:55:46 +02:00
shahondin1624	98c140ac03	feat(ai-server): wire pi settings → mTLS extension; per-model reasoning; configurable admin timeout - config.ts: add getAdminTimeoutMs() reading from AI_SERVER_ADMIN_TIMEOUT_MS env or settings.json retry.provider.adminTimeoutMs (default = inference timeout, capped at 5min). Refactor settings access into a cached readPiSettings() helper shared by both timeout resolvers. - stream.ts: forward options.reasoning (pi-mono's defaultThinkingLevel) to llama.cpp via chat_template_kwargs.enable_thinking + reasoning_effort, gated on per-model reasoning capability. Add TCP keepalive (30s) on the request socket to prevent NAT/middlebox idle drops during long silent prefills (root cause of the recent read ETIMEDOUT). - router-utils.ts: add isReasoningModel(id) with a substring-match list of known reasoning families (MiniMax-M, Qwen3.6, Qwen3-Coder, Qwen3-VL, MiMo-V2, gpt-oss, Devstral). Unanchored to handle HF-style Org_Model ids. - admin.ts: replace hardcoded 30s router HTTP timeout with getAdminTimeoutMs; use isReasoningModel(id) in discoverModels() instead of blanket reasoning: true. - index.ts: add informational compat block (thinkingFormat, supportsReasoningEffort, maxTokensField, etc.) to model registrations so pi-mono's UI / capability detection reflects per-model reasoning support. - tests: 3 new isReasoningModel test groups (positive, negative, unknown). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 19:33:56 +02:00
shahondin1624	f1ceeb4363	Refactor pass: shared utils, non-blocking discovery, safer monkey-patches Audit produced five concrete improvements: 1) New shared/ module (zero-dep pure utilities) - shared/ansi.ts: hexToRgb (throws on malformed input instead of silently producing NaN), fgFromHex, stripAnsi, visibleWidth, ANSI_RESET_FG / ANSI_RESET_ALL constants. - shared/format.ts: formatTokens, formatElapsed. - shared/ctx.ts: safely() and safelyAsync() helpers for dealing with pi's "stale after session replacement or reload" ExtensionRunner semantics. Removes duplicate helpers from mechanicus-footer, markdown-body-color, dark-mechanicus-indicator. 2) ai-server: non-blocking startup + short-race timeout - Factory registers STATIC_MODELS immediately so pi startup isn't blocked on the HTTPS round-trip. - Races discoverModels() against a 300ms timeout. On LAN (~40ms) the live list wins and pi --list-models sees the real models. Past the timeout, fallback remains and background discovery updates the provider later. - listModelsCached() with 5s TTL for tab completions (was firing a round-trip on every keystroke). - loadModel/unloadModel invalidate the cache. 3) dark-mechanicus-indicator: stale-ctx guard - Wrap the setInterval ticker body in safely() so a race between session_shutdown and the ticker can't crash node. Same pattern as the earlier footer fix. 4) Safer monkey-patches in markdown-body-color and mechanicus-thinking-label - Feature-detect Markdown/Editor/AssistantMessageComponent's target method before patching. Warn-and-skip rather than silently create a broken prototype if a pi-tui upgrade renames the internal method. 5) Minor - Replaced five `as any` casts with typed Record<string, unknown> access in the monkey-patch sites. - ai-server debug log only fires when actual discovery succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:05:18 +02:00
shahondin1624	139dcbce74	messages: drop assistant entries with only thinking content Pi's session can contain assistant entries whose AssistantMessage.content is entirely thinking blocks (no text, no tool calls) — typical after an aborted turn or when reasoning is edited out. Our contextToOpenAIMessages was emitting those as { role: "assistant", content: null }. When such a message is at the end of the context, llama.cpp's chat template interprets the trailing assistant entry as an "assistant response prefill" attempt. Reasoning-model templates (MiniMax M2.7, Qwen, etc.) have enable_thinking set, and the server rejects this combination with HTTP 400: "Assistant response prefill is incompatible with enable_thinking." Fix: skip assistant entries where extractAssistantText and extractToolCalls both return empty. Thinking blocks aren't re-fed to the model anyway, so dropping the wrapper message loses no information. + two regression tests in tests/messages.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:48:16 +02:00
shahondin1624	752fdbaff1	Expand tests — unit coverage for router-utils + cache/LE contract Refactor: - Extracted extractCtxSize + isShardArtefact from ai-server/admin.ts to a new ai-server/router-utils.ts with zero relative imports. Makes them directly loadable in tests with Node's --experimental-strip-types (no jiti needed). admin.ts re-exports extractCtxSize so index.ts is unchanged. tests/router-utils.test.ts (9 cases): - extractCtxSize: present/value, missing, end-of-argv, non-numeric, zero (edge), missing status. - isShardArtefact: positive cases (5-digit, numeric, no zero-padding), negative cases (clean preset names, non-shard numeric patterns, shard pattern mid-string). tests/integration.test.ts (2 new cases): - "server cert is publicly trusted": verifies curl without --cacert flag reaches /health. Catches LE regression (cert reverting to self-signed). - "chat completion returns usage with prompt_tokens_details": sanity-checks the server contract our stream.ts now reads for cache-token reporting. Picks any loaded runnable model; skips cleanly when none loaded. Totals: 28 tests (13 messages + 9 router-utils + 6 integration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:44:44 +02:00
shahondin1624	99ad3630fc	stream: report cached tokens; indicator: suppress pi's "Working..." Two small fixes: ai-server/stream.ts - llama.cpp reports cached prompt tokens via usage.prompt_tokens_details.cached_tokens and we were ignoring it. Populate output.usage.cacheRead so pi's footer can show the "R<tokens>" field. cacheRead is a subset of prompt_tokens (already counted in input), so totalTokens stays input + output — no double-counting. dark-mechanicus-indicator.ts - Pi appends "Working... (ESC to interrupt)" next to custom working indicator frames via a separate message slot. Call ctx.ui.setWorkingMessage("") on session_start + every turn_start to clear that suffix so the indicator line is just ⚙ <quote> · <elapsed> with no trailing "Working...". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:28:16 +02:00
shahondin1624	b2cb1667c7	ai-server: stop pinning server cert to private CA (LE is now live) Now that Caddy serves a Let's Encrypt cert for ai.shahondin1624.de, passing `ca: root-ca.pem` to https.request made Node override its default trust list with only the private CA — which no longer chains the LE-issued server cert, so every request failed with ECONNRESET / CERT_HAS_EXPIRED-style errors on the client side. Dropping the `ca:` option lets Node fall back to its built-in Mozilla CA bundle, which includes Let's Encrypt. Client cert/key still passed; mTLS remains enforced server-side. If the server ever reverts to a self-signed cert, re-add `ca: certs.ca` or set NODE_EXTRA_CA_CERTS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:03:35 +02:00
shahondin1624	39d0797dc9	Gate startup logs behind PI_DEBUG + skip GGUF-shard phantom entries - All [ai-server] / [markdown-body-color] / [mechanicus-thinking-label] console.log calls now fire only when PI_DEBUG is set. Default boot is clean. - ai-server's discoverModels now filters out ids matching /-\d+-of-\d+$/ — llama.cpp's --models-autoload registers every GGUF shard as its own id, duplicating the preset's consolidated model. These shard-named phantoms are no longer surfaced to pi. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 22:38:09 +02:00
shahondin1624	e321f90fe9	Initial commit — ai-server and local-llama extensions - ai-server/: multi-file pi extension that talks to a remote llama.cpp router over mTLS (custom streamSimple), with dynamic model discovery and admin slash commands for load/unload/ctx-size/restart/preset. Includes README.md documenting the full mTLS + systemd + Caddy setup. - local-llama.ts: minimal extension registering a local llama.cpp server as an OpenAI-compatible provider. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 21:14:40 +02:00

10 Commits