Files
llm-multiverse/implementation-plans/issue-070.md
Pi Agent 98f4e01d18 feat: implement context compaction for subagent prompt (issue #70)
Add context compaction to the researcher agent to handle long-running
research tasks that exceed the context window budget. When estimated
tokens exceed 60% of max_tokens, older history entries are summarized
via the Model Gateway's unary Inference RPC and replaced with a
compact bullet-point summary, preserving the 3 most recent entries.

Changes:
- clients.py: Add inference() unary method to ModelGatewayClient
- prompt.py: Add compact() method, compaction prompt template, and
  _truncate_entries() fallback for gateway failures
- researcher.py: Replace hard context overflow termination with
  compaction-then-continue logic
- 93 tests pass with 95%+ coverage on modified files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 20:22:02 +01:00

17 KiB

Implementation Plan — Issue #70: Implement context compaction for subagent

Metadata

Field Value
Issue #70
Title Implement context compaction for subagent
Milestone Phase 8: First Subagent (Researcher)
Labels
Status IMPLEMENTING
Language Python
Related Plans issue-069.md, issue-068.md, issue-072.md
Blocked by #69

Acceptance Criteria

  • Monitor context window token usage
  • Trigger compaction when approaching token limit
  • Summarize older tool results into concise summaries
  • Preserve system prompt and current task untouched
  • Preserve most recent N tool results in full
  • Compacted context maintains coherent reasoning chain

Architecture Analysis

Service Context

This feature lives entirely in the orchestrator service (services/orchestrator/). It modifies two existing modules:

  • prompt.pyPromptBuilder gains a compact() method that replaces older history entries with a summarized version
  • researcher.pyResearcherAgent.run() replaces the current "context overflow -> PARTIAL" termination with a compaction-then-continue strategy

The compaction summary is generated via a non-streaming Inference RPC call to the Model Gateway (the same pattern used by the search service's Summarizer class).

Existing Patterns

  • Search service summarizer (services/search/src/search_service/summarizer.py): Demonstrates the Inference (non-streaming, unary) RPC pattern — build an InferenceRequest with InferenceParams, call self._stub.Inference(request), read response.text. Uses TASK_COMPLEXITY_SIMPLE. Falls back to truncation on gRPC error. This is the exact pattern we will follow for generating compaction summaries.
  • PromptBuilder.needs_compaction() (prompt.py on the issue-69 branch): Already implements the 60% threshold check using estimate_tokens() (chars / 4 heuristic). Currently, when this returns True, the researcher loop terminates with PARTIAL. After this issue, it triggers compaction instead.
  • ModelGatewayClient (clients.py on the issue-69 branch): Currently only exposes stream_inference() (streaming RPC). We need to add an inference() method for the non-streaming unary Inference RPC used by compaction.
  • AgentConfig (config.py): Holds max_tokens: int = 4096 — the total context budget. Compaction threshold is derived as int(max_tokens * 0.6).

Context Window Structure (from researcher spec)

+--------------------------------------------------+
| 1. SYSTEM PROMPT         (never compacted)        |
| 2. TOOL DEFINITIONS      (never compacted)        |
| 3. TASK DESCRIPTION      (never compacted)        |
+--------------------------------------------------+
| 4. MEMORY CONTEXT        (compactable)            |
| 5. HISTORY entries       (older ones compacted)   |
|    - tool_call                                    |
|    - tool_result                                  |
|    - reasoning                                    |
+--------------------------------------------------+
| 6. MOST RECENT entries   (never compacted)        |
|    (last 2-3 tool results preserved in full)      |
+--------------------------------------------------+

Dependencies

  • Model Gateway Inference RPC — non-streaming unary endpoint, already implemented in the Model Gateway service (issue #42). The orchestrator's ModelGatewayClient needs a new inference() method to call it.
  • llm_multiverse.v1.model_gateway_pb2InferenceRequest, InferenceParams, InferenceResponse messages (already generated).
  • No new external libraries required.

Implementation Steps

1. Configuration Constants

Add compaction-related constants to prompt.py (module-level, not in config — these are prompt builder internals):

# Number of most recent history entries to preserve in full during compaction.
COMPACTION_PRESERVE_RECENT = 3

# Maximum tokens for the compaction summary output.
COMPACTION_SUMMARY_MAX_TOKENS = 300

# Task complexity for the compaction inference call (use SIMPLE — this is
# a straightforward summarization task, no complex reasoning needed).
COMPACTION_TASK_COMPLEXITY = model_gateway_pb2.TASK_COMPLEXITY_SIMPLE

These are not user-facing configuration. The COMPACTION_PRESERVE_RECENT = 3 value keeps the most recent tool call/result pairs intact so the model has immediate context for its next decision. The summary max tokens of 300 is generous enough for a structured bullet list but small enough to free meaningful space.

2. Compaction Prompt Template

Define the compaction prompt as a module-level constant in prompt.py:

COMPACTION_PROMPT_TEMPLATE = """\
Summarize the following agent interaction history into a concise structured summary.

Preserve:
- Decisions made and their rationale
- Artifacts produced (file paths, URLs, tool names — identifiers only)
- Key findings and facts discovered
- Open questions or unresolved issues
- Tool calls that failed and why

Format as a bullet list. Be concise. Do not include raw tool output.

## History to summarize
{history}
"""

Key design decisions:

  • The prompt explicitly instructs the model to preserve decisions, artifacts (paths only), and open questions — this ensures the compacted summary maintains a coherent reasoning chain.
  • "Do not include raw tool output" prevents the summary from just re-stating verbose tool results.
  • The output is a structured bullet list, which is token-efficient and easy for the agent model to parse in subsequent iterations.

3. Add inference() to ModelGatewayClient

Add a non-streaming unary inference() method to clients.py:

async def inference(
    self,
    session_context: common_pb2.SessionContext,
    prompt: str,
    max_tokens: int,
    task_complexity: int = model_gateway_pb2.TASK_COMPLEXITY_SIMPLE,
) -> str:
    """Non-streaming inference call. Returns the full response text.

    Used for summarization tasks (context compaction) where streaming
    is unnecessary. Raises grpc.aio.AioRpcError on failure.
    """
    request = model_gateway_pb2.InferenceRequest(
        params=model_gateway_pb2.InferenceParams(
            context=session_context,
            prompt=prompt,
            task_complexity=task_complexity,
            max_tokens=max_tokens,
        ),
    )
    response = await self._stub.Inference(request)
    return response.text

This mirrors the search service's Summarizer.summarize() call pattern exactly, but without the fallback-to-truncation logic (compaction has its own fallback strategy — see step 5).

4. Add compact() Method to PromptBuilder

Add an async compact() method to the PromptBuilder class in prompt.py. This is the core of the feature.

Method Signature

async def compact(
    self,
    gateway: ModelGatewayClient,
    session_context: common_pb2.SessionContext,
) -> bool:
    """Compact older history entries by summarizing them via the Model Gateway.

    Returns True if compaction freed enough space to continue.
    Returns False if the context is still too large after compaction
    (e.g., the fixed sections alone exceed the budget).
    """

Algorithm

  1. Determine what to preserve: The last COMPACTION_PRESERVE_RECENT (3) entries in self._history are never compacted. If len(self._history) <= COMPACTION_PRESERVE_RECENT, there is nothing to compact — return False.

  2. Split history: Partition self._history into two slices:

    • to_compact = self._history[:-COMPACTION_PRESERVE_RECENT] — older entries to summarize
    • to_preserve = self._history[-COMPACTION_PRESERVE_RECENT:] — recent entries kept in full
  3. Build compaction prompt: Format COMPACTION_PROMPT_TEMPLATE with the content of to_compact entries joined by double newlines.

  4. Call Model Gateway Inference: Use gateway.inference() with the compaction prompt, COMPACTION_SUMMARY_MAX_TOKENS, and TASK_COMPLEXITY_SIMPLE.

  5. Handle failure: If the Inference call raises grpc.aio.AioRpcError, fall back to a naive truncation strategy — take the first 200 characters of each compacted entry and join them. This ensures compaction never crashes the loop, matching the search summarizer's graceful degradation pattern.

  6. Replace history: Set self._history to a single HistoryEntry(kind="compacted_summary", content=summary_text) followed by the to_preserve entries.

  7. Also compact memory context: If self._memory_context is non-empty, fold it into the compaction input text (prepend as "## Prior Memory Context\n...") so it is summarized together. Then clear self._memory_context — the summary now covers it.

  8. Check post-compaction size: Call self.needs_compaction() again. If still True, return False (the fixed sections are too large — nothing more can be freed). Otherwise return True.

Why compact() is async

The method needs to call the Model Gateway's Inference RPC, which is an async gRPC call. This is the only async operation in PromptBuilder. The method receives the ModelGatewayClient and SessionContext as parameters rather than storing them on the builder, keeping the builder lightweight and testable.

5. Modify Researcher Loop to Use Compaction

In researcher.py, change the context overflow handling in the main loop from:

if prompt_builder.needs_compaction():
    return _build_partial_result("Context overflow", used_web_search)

To:

if prompt_builder.needs_compaction():
    compacted = await prompt_builder.compact(self._gateway, session_ctx)
    if not compacted:
        # Compaction could not free enough space — terminate
        return _build_partial_result("Context overflow after compaction", used_web_search)
    logger.info(
        "Context compacted for agent %s (iteration %d)", agent_id, iteration
    )

This is the minimal change to the loop. The loop continues normally after successful compaction — the next iteration will build the prompt from the compacted history, which fits within the token budget.

Edge Case: Repeated Compaction

The loop may trigger compaction multiple times during a long research task. Each compaction summarizes the oldest entries again (including any previous compacted summary). This is acceptable because:

  • Each summary is progressively more condensed
  • The COMPACTION_PRESERVE_RECENT entries are always fresh
  • The structured bullet list format survives re-summarization well

6. Tests

All tests in services/orchestrator/tests/.

tests/test_prompt.py — New compaction tests (~10 tests)

Add to the existing test_prompt.py:

  1. test_compact_replaces_old_history — Add 6 history entries, compact. Verify self._history has 1 compacted summary + 3 preserved entries (4 total).

  2. test_compact_preserves_recent_entries — Add 5 entries, compact. Verify the last 3 entries are identical to the originals (by content).

  3. test_compact_too_few_entries_returns_false — Add only 2 history entries (fewer than COMPACTION_PRESERVE_RECENT). Verify compact() returns False without calling the gateway.

  4. test_compact_includes_memory_context — Set memory context, add history. Compact. Verify the compaction prompt sent to the gateway includes the memory context text. Verify _memory_context is cleared after compaction.

  5. test_compact_clears_memory_after_compaction — Set memory context, compact. Verify self._memory_context is empty and a subsequent build() does not include the memory section.

  6. test_compact_gateway_failure_falls_back_to_truncation — Mock the gateway to raise AioRpcError. Verify compaction still succeeds (returns True or False depending on size) and uses truncated text instead of a model summary.

  7. test_compact_returns_false_when_still_too_large — Set a very small max_tokens where even the system prompt exceeds the budget. Verify compact() returns False.

  8. test_compact_summary_entry_kind — After compaction, verify the first history entry has kind="compacted_summary".

  9. test_compact_prompt_contains_old_entries — Mock the gateway, capture the prompt passed to inference(). Verify it contains the content of the old (compacted) entries but not the preserved recent entries.

  10. test_build_after_compact_is_coherent — Compact, then call build(). Verify the resulting prompt string has the compacted summary in the History section followed by the preserved entries, and no duplicate content.

tests/test_clients.py — New inference method test (~3 tests)

Add to the existing test_clients.py:

  1. test_inference_returns_text — Mock Inference RPC to return a response with text="summary". Verify inference() returns "summary".

  2. test_inference_passes_params — Capture the request sent to the mock. Verify prompt, max_tokens, task_complexity are correctly set.

  3. test_inference_grpc_error_propagates — Mock raises UNAVAILABLE. Verify AioRpcError propagates.

tests/test_researcher.py — Modified compaction integration tests (~4 tests)

Modify/add to the existing test_researcher.py:

  1. test_context_overflow_triggers_compaction_then_continues — Set small max_tokens. Mock the gateway to return: tool call, tool call (triggers compaction), then done signal. Mock the Inference RPC (for compaction) to return a short summary. Verify the loop does NOT terminate on context overflow but instead continues and returns SUCCESS.

  2. test_compaction_failure_terminates_with_partial — Set small max_tokens. Mock compaction Inference to fail AND truncation fallback still too large. Verify PARTIAL result with "Context overflow after compaction" reason.

  3. test_multiple_compactions_in_long_session — Set moderate max_tokens. Mock gateway to return many tool calls (triggers compaction twice). Verify both compactions succeed and the loop eventually returns SUCCESS.

  4. test_context_overflow_replaces_old_termination — Verify the old behavior (immediate PARTIAL on needs_compaction()) is gone. When needs_compaction() fires, compaction is attempted first.

Files to Create/Modify

File Action Purpose
services/orchestrator/src/orchestrator/prompt.py Modify Add COMPACTION_PROMPT_TEMPLATE, COMPACTION_PRESERVE_RECENT, COMPACTION_SUMMARY_MAX_TOKENS constants; add compact() async method to PromptBuilder
services/orchestrator/src/orchestrator/clients.py Modify Add inference() unary method to ModelGatewayClient
services/orchestrator/src/orchestrator/researcher.py Modify Replace context overflow termination with compaction-then-continue logic
services/orchestrator/tests/test_prompt.py Modify Add ~10 compaction-related tests
services/orchestrator/tests/test_clients.py Modify Add ~3 tests for the new inference() method
services/orchestrator/tests/test_researcher.py Modify Add/modify ~4 tests for compaction integration in the agent loop

Risks and Edge Cases

  • Compaction summary quality: The compaction summary is only as good as the model producing it. A poor summary could lose critical context, causing the agent to repeat work or make contradictory decisions. Mitigation: the compaction prompt explicitly lists what to preserve (decisions, artifacts, open questions). Using TASK_COMPLEXITY_SIMPLE routes to a fast model; if summary quality is poor in practice, this can be bumped to TASK_COMPLEXITY_COMPLEX as a tuning knob.
  • Compaction latency: Each compaction adds one extra Model Gateway call (non-streaming). For TASK_COMPLEXITY_SIMPLE with 300 max tokens, this should be fast (~1-2s). However, repeated compactions in a long session add cumulative latency. Mitigation: compaction only triggers at 60% capacity, which should be infrequent (typically 0-2 times per research task).
  • Recursive summarization degradation: If compaction triggers multiple times, each round summarizes a previous summary. Information loss compounds. Mitigation: the structured bullet list format is resilient to re-summarization. The most recent 3 entries are always preserved in full, so the model always has concrete recent context.
  • Gateway unavailable during compaction: If the Model Gateway is down, the compaction falls back to naive truncation. The truncated text may not be coherent. Mitigation: the agent loop can still function with truncated history — it just has less context. If the gateway is fully down, the next stream_inference() call will also fail, terminating the loop with FAILED.
  • Fixed sections exceed budget: If the system prompt + tool definitions + task description alone exceed max_tokens * 0.6, compaction cannot help. Mitigation: compact() returns False, and the loop terminates with PARTIAL. This is an inherent limitation — the task or tool set is too large for the configured token budget. The operator should increase max_tokens in AgentConfig.
  • Thread safety: PromptBuilder.compact() mutates _history and _memory_context. The researcher loop is single-threaded (one agent per run() call), so this is safe. No concurrent access to PromptBuilder is expected.

Deviation Log

(Filled during implementation if deviations from plan occur)

Deviation Reason