- config.ts: add getAdminTimeoutMs() reading from
AI_SERVER_ADMIN_TIMEOUT_MS env or settings.json
retry.provider.adminTimeoutMs (default = inference timeout, capped at 5min).
Refactor settings access into a cached readPiSettings() helper shared by
both timeout resolvers.
- stream.ts: forward options.reasoning (pi-mono's defaultThinkingLevel) to
llama.cpp via chat_template_kwargs.enable_thinking +
reasoning_effort, gated on per-model reasoning capability. Add TCP keepalive
(30s) on the request socket to prevent NAT/middlebox idle drops during long
silent prefills (root cause of the recent read ETIMEDOUT).
- router-utils.ts: add isReasoningModel(id) with a substring-match list of
known reasoning families (MiniMax-M, Qwen3.6, Qwen3-Coder, Qwen3-VL,
MiMo-V2, gpt-oss, Devstral). Unanchored to handle HF-style Org_Model ids.
- admin.ts: replace hardcoded 30s router HTTP timeout with getAdminTimeoutMs;
use isReasoningModel(id) in discoverModels() instead of blanket
reasoning: true.
- index.ts: add informational compat block (thinkingFormat,
supportsReasoningEffort, maxTokensField, etc.) to model registrations so
pi-mono's UI / capability detection reflects per-model reasoning support.
- tests: 3 new isReasoningModel test groups (positive, negative, unknown).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit produced five concrete improvements:
1) New shared/ module (zero-dep pure utilities)
- shared/ansi.ts: hexToRgb (throws on malformed input instead of
silently producing NaN), fgFromHex, stripAnsi, visibleWidth,
ANSI_RESET_FG / ANSI_RESET_ALL constants.
- shared/format.ts: formatTokens, formatElapsed.
- shared/ctx.ts: safely() and safelyAsync() helpers for dealing with
pi's "stale after session replacement or reload" ExtensionRunner
semantics.
Removes duplicate helpers from mechanicus-footer, markdown-body-color,
dark-mechanicus-indicator.
2) ai-server: non-blocking startup + short-race timeout
- Factory registers STATIC_MODELS immediately so pi startup isn't
blocked on the HTTPS round-trip.
- Races discoverModels() against a 300ms timeout. On LAN (~40ms) the
live list wins and pi --list-models sees the real models. Past the
timeout, fallback remains and background discovery updates the
provider later.
- listModelsCached() with 5s TTL for tab completions (was firing a
round-trip on every keystroke).
- loadModel/unloadModel invalidate the cache.
3) dark-mechanicus-indicator: stale-ctx guard
- Wrap the setInterval ticker body in safely() so a race between
session_shutdown and the ticker can't crash node. Same pattern as
the earlier footer fix.
4) Safer monkey-patches in markdown-body-color and mechanicus-thinking-label
- Feature-detect Markdown/Editor/AssistantMessageComponent's target
method before patching. Warn-and-skip rather than silently create
a broken prototype if a pi-tui upgrade renames the internal method.
5) Minor
- Replaced five `as any` casts with typed Record<string, unknown>
access in the monkey-patch sites.
- ai-server debug log only fires when actual discovery succeeds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pi's session can contain assistant entries whose AssistantMessage.content
is entirely thinking blocks (no text, no tool calls) — typical after an
aborted turn or when reasoning is edited out. Our contextToOpenAIMessages
was emitting those as { role: "assistant", content: null }.
When such a message is at the end of the context, llama.cpp's chat
template interprets the trailing assistant entry as an "assistant
response prefill" attempt. Reasoning-model templates (MiniMax M2.7,
Qwen, etc.) have enable_thinking set, and the server rejects this
combination with HTTP 400:
"Assistant response prefill is incompatible with enable_thinking."
Fix: skip assistant entries where extractAssistantText and
extractToolCalls both return empty. Thinking blocks aren't re-fed to
the model anyway, so dropping the wrapper message loses no information.
+ two regression tests in tests/messages.test.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refactor:
- Extracted extractCtxSize + isShardArtefact from ai-server/admin.ts to
a new ai-server/router-utils.ts with zero relative imports. Makes them
directly loadable in tests with Node's --experimental-strip-types
(no jiti needed). admin.ts re-exports extractCtxSize so index.ts is
unchanged.
tests/router-utils.test.ts (9 cases):
- extractCtxSize: present/value, missing, end-of-argv, non-numeric,
zero (edge), missing status.
- isShardArtefact: positive cases (5-digit, numeric, no zero-padding),
negative cases (clean preset names, non-shard numeric patterns,
shard pattern mid-string).
tests/integration.test.ts (2 new cases):
- "server cert is publicly trusted": verifies curl without --cacert
flag reaches /health. Catches LE regression (cert reverting to
self-signed).
- "chat completion returns usage with prompt_tokens_details":
sanity-checks the server contract our stream.ts now reads for
cache-token reporting. Picks any loaded runnable model; skips
cleanly when none loaded.
Totals: 28 tests (13 messages + 9 router-utils + 6 integration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small fixes:
ai-server/stream.ts
- llama.cpp reports cached prompt tokens via
usage.prompt_tokens_details.cached_tokens
and we were ignoring it. Populate output.usage.cacheRead so pi's
footer can show the "R<tokens>" field. cacheRead is a subset of
prompt_tokens (already counted in input), so totalTokens stays
input + output — no double-counting.
dark-mechanicus-indicator.ts
- Pi appends "Working... (ESC to interrupt)" next to custom working
indicator frames via a separate message slot. Call
ctx.ui.setWorkingMessage("") on session_start + every turn_start to
clear that suffix so the indicator line is just
⚙ <quote> · <elapsed>
with no trailing "Working...".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that Caddy serves a Let's Encrypt cert for ai.shahondin1624.de,
passing `ca: root-ca.pem` to https.request made Node override its
default trust list with only the private CA — which no longer chains
the LE-issued server cert, so every request failed with
ECONNRESET / CERT_HAS_EXPIRED-style errors on the client side.
Dropping the `ca:` option lets Node fall back to its built-in Mozilla
CA bundle, which includes Let's Encrypt. Client cert/key still passed;
mTLS remains enforced server-side.
If the server ever reverts to a self-signed cert, re-add `ca: certs.ca`
or set NODE_EXTRA_CA_CERTS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- All [ai-server] / [markdown-body-color] / [mechanicus-thinking-label]
console.log calls now fire only when PI_DEBUG is set. Default boot is
clean.
- ai-server's discoverModels now filters out ids matching
/-\d+-of-\d+$/ — llama.cpp's --models-autoload registers every GGUF
shard as its own id, duplicating the preset's consolidated model.
These shard-named phantoms are no longer surfaced to pi.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ai-server/: multi-file pi extension that talks to a remote llama.cpp
router over mTLS (custom streamSimple), with dynamic model discovery
and admin slash commands for load/unload/ctx-size/restart/preset.
Includes README.md documenting the full mTLS + systemd + Caddy setup.
- local-llama.ts: minimal extension registering a local llama.cpp server
as an OpenAI-compatible provider.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>