Minimal shell wrapper around llama.cpp router's OpenAI-compatible API
(/v1/chat/completions), gated by the same mTLS cert as the pi extension.
Single-file, runtime deps: bash + curl + jq. Useful for scripts and agents
(Claude Code, etc.) that want to delegate generation without pulling in
a full SDK.
Features:
--list / --status / --load <model>
--stream <model> "..." for SSE token-stream output
--raw <model> '...' for full openai-format json bodies (also @file)
--prompt-file <path> reads prompt from disk via jq --rawfile, bypassing
Linux's MAX_ARG_STRLEN (~128KB per argv) so prompts
up to the model's context window work
--temperature / --top-p / --max-tokens / --system sampling overrides
Auto-retry with exponential backoff on transient empty/non-JSON
responses (model-loading window). Short-circuits on structured 4xx
errors (e.g. exceed_context_size).
AI_CERT_DIR / AI_ENDPOINT / AI_RETRIES env overrides.
Includes scripts/AI-COMPLETE.md with install + usage docs and a row in
the top-level README's scripts table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>