19 KiB
Local Agentic System — Planning Agent Prompt
You are a senior software architect and implementation planner. Your task is to produce a detailed, phased implementation plan for the system described below. The plan must respect every constraint and design decision listed here — none of them are negotiable unless explicitly marked as flexible. Produce a dependency-ordered build sequence, identify risks early, and flag any decision that requires user input before proceeding.
Project Intent
Build a fully local, privacy-respecting, agentic AI system that can handle software development, file and document management, system administration, and research tasks. The system must be deployable on a single machine and horizontally distributable across multiple machines with minimal configuration change. Performance of orchestration and service infrastructure must be as high as possible. Model inference latency is acceptable. The user must retain full control and auditability of all agent behavior.
Hardware
- Primary machine (GPU): AMD Ryzen 7 2700x, 64GB DDR4, AMD Radeon RX 9070 XT (16GB VRAM, RDNA4). ROCm support for RDNA4 is unconfirmed — Ollama on bare metal (not in Docker) is preferred for GPU access to avoid ROCm-in-Docker complexity.
- Secondary machine (Server): 32GB DDR3, CPU only. Suitable for orchestrator, memory service, search service, and other non-inference workloads.
- Network overhead between machines is acceptable.
Core Architectural Principles
1. Microservice Architecture (gRPC + Protobuf)
- All services communicate via gRPC over Protocol Buffers.
- On a single machine: Docker internal network (service DNS, e.g.
memory-service:50051). - On multiple machines: Docker Swarm encrypted overlay network (
driver: overlay, encrypted: true). This replaces mTLS between internal services — Docker handles key exchange. No manual certificate management for internal traffic. - External edge traffic only: Caddy (v2) handles HTTPS termination. Internal services are never exposed externally. Caddy proxies to the orchestrator entry point only.
- gRPC backends must speak h2c (HTTP/2 cleartext) since Caddy terminates TLS at edge.
- Unix socket vs TCP is a configuration concern only — service code must not care which transport it uses.
2. Service Isolation via Docker
- Two Docker networks:
edge: Caddy ↔ orchestrator entry point only.internal: All inter-service communication. Declaredinternal: truein Compose — Docker enforces no external routing.
- Ollama runs on the host directly (not in Docker). The model-gateway service
reaches it via
host.docker.internal:11434. - Secrets service has access to the host keyring (see Secrets section).
3. Language Boundaries
- Rust (Tonic + Tokio): All performance-critical services: model gateway, memory service, tool broker, secrets service, audit service. Tonic for gRPC, Axum if any HTTP endpoint is needed.
- Python (asyncio + uvloop + grpcio): Orchestrator only. Coordination logic, not performance-critical. Justified because inference latency dwarfs all orchestration overhead.
- Python: Search service (Trafilatura/readability-lxml are Python libraries, service is I/O bound).
- No LangChain, CrewAI, or similar frameworks. Direct gRPC calls, direct Ollama HTTP. Explicit is better than abstracted here.
4. Python Performance Stack (non-negotiable substitutions)
uvloop— replaces default asyncio event loop with libuv (C). Install at process entry before anything else.orjson— Rust-backed JSON. Replaces stdlibjsoneverywhere.msgspec— Rust-backed schema validation. Replaces Pydantic for all agent schemas.grpcio— standard Python gRPC client.httpx— async HTTP client for Ollama calls where gRPC is not used.
Service Definitions
Model Gateway (Rust)
- Wraps Ollama HTTP API.
- Exposes streaming token inference and synchronous inference via gRPC.
- Handles model routing: dispatches to different Ollama models based on task complexity hint in request (simple → 3B/7B, reasoning/code → 14B).
- Never manages model weights directly — Ollama does that.
Memory Service (Rust)
- Storage: DuckDB with VSS extension (vector similarity search).
- Embedding model: nomic-embed-text via Ollama. Embeddings generated per field.
- Memory entry schema:
id, name, description, tags[], correlating_ids[], corpus, name_embedding, description_embedding, corpus_embedding, created_at, last_accessed, access_count, source_provenance - Staged retrieval (coarse-to-fine, non-negotiable):
- Embed query → cosine similarity on
name_embeddings→ top 20 - Cosine similarity on
description_embeddingsof top 20 → top 5 - Full corpus load of top 5 → optional corpus embedding re-rank
- Correlation expansion: agent may request descriptions of
correlating_idsfor top results, then decide whether to pull full corpus of any.
- Embed query → cosine similarity on
- Extraction step: When a corpus is read in full, a lightweight model call extracts only the segment relevant to the query. This extracted segment — not the full corpus — is what enters agent context and what is cached.
- Cache: Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.
- Memory write gating: Subagents flag
new_memory_candidates[]in return schema. Orchestrator decides what to persist. Prevents noise accumulation. - Memory poisoning protection: Memories sourced from external content (web, files)
are tagged
source: external. This tag survives retrieval and is visible to agents, framing the content as data not instructions. External-sourced writes pass through a summarization step that strips imperative constructions before storage. - Exposes:
QueryMemory(streaming),WriteMemory,GetCorrelated.
Tool Broker (Rust)
- Single enforcement point for all tool calls. No tool executes without passing through the broker.
- Loads Agent Type Manifest at startup. Manifest is static config, never passed to any model.
- Enforcement layers (in order):
- Session override check (user-set flags, see User Control section)
- Agent type manifest check (is this tool in the agent type's allowed set?)
- Lineage constraint (spawned agents cannot exceed parent's allowed tool set — enforced by intersecting allowed sets up the spawn chain)
- Path allowlist check for filesystem tools (per agent type, broker-enforced)
- Network egress check: web tools must route through SearXNG only, no arbitrary external HTTP
- Execute or return structured denial
- Tool discovery (separate from execution):
- Agent sends task description → broker calls tool selector (lightweight model, ~100 token prompt) → returns ranked tool names → broker intersects with agent's allowed set → agent receives only the intersection, fully defined.
- Agents never receive their full allowed tool set — only what's relevant to the current task (typically 2–4 tools).
- Loop/thrash detection: Broker tracks per-session: spawn depth (hard max: 4), identical tool call repetition (hard max: 3 within window), total tool calls per task (configurable). On limit hit: graceful failure returned to orchestrator, not silent hang.
- Credential injection: Tools requiring credentials declare a placeholder in their definition. Broker fetches actual credential from Secrets Service at execution time. Credential value never appears in any agent context or log entry.
- Prompt injection firewall: All external content (tool results, file reads, web content) passes through a sanitization step before being returned as tool results. Rule-based filter for injection patterns ("ignore previous instructions", "you are now", imperative AI-directed sentences, etc.).
- Tool result tagging: All tool results are tagged
[TOOL_RESULT: UNTRUSTED]in the response framing so the model has explicit context that this is data, not instruction. - Exposes:
DiscoverTools,ExecuteTool(streaming for long ops),ValidateCall(dry-run).
Secrets Service (Rust)
- Wraps the Linux Secret Service API via D-Bus → libsecret → GNOME Keyring or
KWallet. No
.envfiles. No plaintext credentials anywhere. - Alternative if no persistent desktop session (server deployment): Linux kernel
keyring (
keyutils) — stored in kernel memory, never hits disk. - Only the Tool Broker's service identity is authorized to call the Secrets Service. Enforced at the network/TLS identity level on multi-machine deployments, by Docker internal network isolation on single machine.
- D-Bus socket mounted into secrets-service container, or secrets service runs on host with Unix socket mounted into Docker.
- Exposes:
GetSecretonly. No list, no write from broker — secrets are pre-populated by the user directly.
Audit Service (Rust)
- Append-only structured log of every: tool invocation, broker decision (allow/deny), memory read/write, subagent spawn, session config change.
- Write-only gRPC interface exposed to all other services. No read RPC. Reads are only possible via direct access to the audit process/file by the user.
- Log entries include: timestamp, agent_id, agent_type, lineage[], action, tool_name, params_hash (not params — never log credential-adjacent data), result_status, session_id.
- Exposes:
Appendonly.
Search Service (Python)
- Wraps local SearXNG instance (Docker, JSON API).
- Pipeline: SearXNG query → snippet relevance filter → readability-lxml (libxml2 backed) for clean text extraction → summarization model call → structured result.
- Raw web content never enters any agent context. Always summarized first.
- Structured result schema:
{claim, source_url, confidence, date, summary}. - Exposes:
Search(gRPC).
Orchestrator (Python)
- Coordinates agent lifecycle. Does not directly execute tools.
- Own allowed tools:
[memory_read, memory_write]only. - Decomposes user requests into subtasks with explicit dependency tags.
- Dispatches independent subtasks in parallel via
asyncio.gather. - Receives only compressed summaries from subagents — never raw tool output or file contents.
- Manages rolling context compaction: when context exceeds 60% of window, oldest N turns are summarized (async, background) and replaced with a structured bullet summary preserving decisions, artifacts (paths only), and open questions.
- Tracks session token budget and triggers compaction earlier if burn rate is high.
- Decides what memory candidates from subagents to persist.
- Applies confidence signals from subagent results to decide whether to trust, verify, or surface uncertainty to user.
Agent Types and Manifest
Static YAML config, loaded at startup, never in any context window:
agent_types:
orchestrator:
allowed_tools: [memory_read, memory_write]
can_spawn: [assistant, researcher, coder, sysadmin]
inherit_constraints: true
assistant:
allowed_tools: [web_search, memory_read, memory_write]
can_spawn: [researcher]
inherit_constraints: true
researcher:
allowed_tools: [web_search, memory_read, memory_write, fs_read]
can_spawn: []
coder:
allowed_tools: [fs_read, fs_write, run_code, memory_read, memory_write, web_search]
allowed_paths:
fs_write: ["~/projects/", "/tmp/agent-sandbox/"]
fs_read: ["~/projects/", "/tmp/agent-sandbox/"]
can_spawn: [researcher]
inherit_constraints: true
sysadmin:
allowed_tools: [fs_read, fs_write, run_shell, package_install, memory_read]
allowed_paths:
fs_write: ["~/projects/", "/tmp/agent-sandbox/", "/etc/managed/"]
fs_read: ["~/", "/etc/", "/var/log/"]
can_spawn: [coder]
inherit_constraints: true
inherit_constraints: true means a spawned agent's effective allowed set is
intersection(its_own_manifest, parent's_effective_allowed_set). This is enforced
by the broker via the lineage chain passed in every tool call.
User Control (Session Config)
Set before session or per-message. Reverts to default after per-message override.
OverrideLevel:
NONE — full manifest + broker enforcement (default)
RELAX — high-risk tools unlocked, lineage still enforced
ALL — broker passes everything through
Additional flags:
--disable-tool=<name> blacklist specific tool for session
--grant <agent_type>:<tool> whitelist specific tool for specific type
Override flags are part of session config, stored outside agent context, read by broker only.
Context Window Management
Subagent context structure (ordered, static parts first for prefix cache reuse):
[System prompt + tool definitions] ← static, prefix-cached
[Task context] ← injected by orchestrator, minimal
[Tool call history] ← last 2-3 turns verbatim, older summarized
[Current turn]
Orchestrator context structure:
[System prompt] ← static
[Compacted history summary] ← rolling, max ~200 tokens
[Current task state] ← structured JSON
[Last N subagent results] ← summaries only, never raw output
[Current turn]
Compaction trigger: 60% of context window consumed.
Compaction method: model call on oldest N turns →
"Summarize preserving: decisions made, artifacts produced (paths only), open questions."
Token budget: tracked per session, injected as max_tokens per API call.
Subagent Return Schema (non-negotiable contract)
Every subagent returns structured JSON only. Orchestrator never parses free text:
{
"status": "success | partial | failed",
"summary": "3 sentence max",
"artifacts": ["path/to/file"],
"result_quality": "verified | inferred | uncertain",
"source": "tool_output | model_knowledge | web",
"new_memory_candidates": [],
"failure_reason": null
}
Failure is always structured. Subagents never improvise on failure or hallucinate results.
Models (via Ollama on bare metal)
| Role | Model | VRAM (4-bit) |
|---|---|---|
| Orchestrator / tool selection | Qwen2.5 7B Instruct | ~4GB |
| Coder subagent | Qwen2.5-Coder 14B Instruct | ~8GB |
| Researcher / summarization | Qwen2.5 14B Instruct | ~8GB |
| Embeddings | nomic-embed-text | minimal |
Models are swapped by Ollama between tasks. Two 14B models do not run simultaneously. The server (CPU) can run the orchestrator model permanently to keep GPU VRAM free for subagent models.
Protobuf Contracts (to be written first, before any service code)
Services to define: ModelGateway, MemoryService, ToolBroker, SecretsService, AuditService, SearchService. Key patterns:
- Streaming where output is large or progressive (token generation, corpus retrieval, long tool ops).
- Blocking/unary where simplicity is preferred (cache hits, validation, writes).
- Every message includes a
session_idandagent_lineage[]field for audit and broker enforcement.
Implementation Order (dependency-ordered, do not reorder arbitrarily)
-
0. ROCm / Ollama verification — confirm RX 9070 XT recognized by Ollama on bare metal. Pull and run Qwen2.5 7B Instruct and nomic-embed-text. Establish baseline inference speed.
-
1. Proto definitions — write all
.protofiles before any service code. Generate Rust (prost) and Python (grpcio-tools) stubs. All service contracts are locked here. -
2. Audit Service — build first because everything else logs to it. Write-only gRPC, append-only file log, no read endpoint.
-
3. Secrets Service — build before broker since broker depends on it for credential injection. Wrap libsecret. Expose GetSecret only. Test with a dummy credential round-trip.
-
4. Memory Service — DuckDB + VSS setup, staged retrieval implementation, extraction step (model call via model gateway — so model gateway must be stubbed first or extraction deferred). Cache layer. Provenance tagging.
-
5. Model Gateway — Ollama HTTP wrapper, gRPC streaming inference, model routing logic. Dependency of memory service extraction step and all agents.
-
6. Search Service — SearXNG Docker container + readability-lxml pipeline + summarization call via model gateway. Structured result output.
-
7. Tool Broker — enforcement layers in order, tool discovery (tool selector model call), path allowlists, loop detection, credential injection wired to secrets service, injection firewall, result tagging. This is the most complex service — build incrementally and test each enforcement layer in isolation.
-
8. Single subagent (no orchestrator yet) — build one agent type end to end (researcher recommended: web_search + memory_read tools, no filesystem risk). Validate: tool discovery, broker enforcement, memory read/write, return schema, context structure, compaction trigger.
-
9. Orchestrator — task decomposition schema, parallel dispatch, context management, memory write gating, confidence signal handling, session config application. Wire to single subagent first.
-
10. Remaining agent types — coder, sysadmin, assistant. Each adds new tool categories and path allowlists. Test lineage constraint enforcement explicitly.
-
11. Docker Compose (single machine) — containerize all services except Ollama. Internal network with
internal: true. Caddy edge config. Verify service DNS routing. Mount secrets D-Bus socket or run secrets service on host. -
12. Multi-machine extension — convert internal network to encrypted overlay. Move inference workload to GPU machine, everything else to server. Verify no service code changes required.
Abstractions to Track (living reference)
| Abstraction | Owner | Never exposed to |
|---|---|---|
| Agent Type Manifest | Broker (loaded at startup) | Any model |
| Lineage chain | Broker (assembled per call) | Any model |
| Session config / overrides | Broker + session layer | Any model |
| Credential values | Secrets Service → Broker | Any model, any log |
| Enforcement logic | Broker | Any model |
| Path allowlists | Broker | Any model |
| Audit log (read) | User only | Any service |
| Raw web/file content | Search/file tool internals | Orchestrator context |
| Full memory corpus | Memory service | Orchestrator context (extracted segment only) |
| Tool full allowed set | Broker (filtered before delivery) | Agents (see only relevant 2-4) |
Non-Negotiable Constraints Summary
- No tool executes without passing through the broker.
- Lineage constraints are enforced at the broker, not in any prompt.
- Credentials never appear in any agent context, log entry, or environment variable.
- Raw external content (web, files) never enters orchestrator context.
- Audit log is write-only from all services; read access is user-only.
- Memory writes from external sources are sanitized and provenance-tagged before storage.
- Subagent return schema is always structured JSON — no free text to orchestrator.
- Ollama runs on bare metal (not Docker) for GPU compatibility.
- Internal Docker network is declared
internal: true— no external routing. - Proto definitions are written and locked before any service implementation begins.