Add context compaction to the researcher agent to handle long-running research tasks that exceed the context window budget. When estimated tokens exceed 60% of max_tokens, older history entries are summarized via the Model Gateway's unary Inference RPC and replaced with a compact bullet-point summary, preserving the 3 most recent entries. Changes: - clients.py: Add inference() unary method to ModelGatewayClient - prompt.py: Add compact() method, compaction prompt template, and _truncate_entries() fallback for gateway failures - researcher.py: Replace hard context overflow termination with compaction-then-continue logic - 93 tests pass with 95%+ coverage on modified files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17 KiB
Implementation Plan — Issue #70: Implement context compaction for subagent
Metadata
| Field | Value |
|---|---|
| Issue | #70 |
| Title | Implement context compaction for subagent |
| Milestone | Phase 8: First Subagent (Researcher) |
| Labels | — |
| Status | IMPLEMENTING |
| Language | Python |
| Related Plans | issue-069.md, issue-068.md, issue-072.md |
| Blocked by | #69 |
Acceptance Criteria
- Monitor context window token usage
- Trigger compaction when approaching token limit
- Summarize older tool results into concise summaries
- Preserve system prompt and current task untouched
- Preserve most recent N tool results in full
- Compacted context maintains coherent reasoning chain
Architecture Analysis
Service Context
This feature lives entirely in the orchestrator service (services/orchestrator/). It modifies two existing modules:
prompt.py—PromptBuildergains acompact()method that replaces older history entries with a summarized versionresearcher.py—ResearcherAgent.run()replaces the current "context overflow -> PARTIAL" termination with a compaction-then-continue strategy
The compaction summary is generated via a non-streaming Inference RPC call to the Model Gateway (the same pattern used by the search service's Summarizer class).
Existing Patterns
- Search service summarizer (
services/search/src/search_service/summarizer.py): Demonstrates theInference(non-streaming, unary) RPC pattern — build anInferenceRequestwithInferenceParams, callself._stub.Inference(request), readresponse.text. UsesTASK_COMPLEXITY_SIMPLE. Falls back to truncation on gRPC error. This is the exact pattern we will follow for generating compaction summaries. PromptBuilder.needs_compaction()(prompt.pyon the issue-69 branch): Already implements the 60% threshold check usingestimate_tokens()(chars / 4 heuristic). Currently, when this returnsTrue, the researcher loop terminates with PARTIAL. After this issue, it triggers compaction instead.ModelGatewayClient(clients.pyon the issue-69 branch): Currently only exposesstream_inference()(streaming RPC). We need to add aninference()method for the non-streaming unaryInferenceRPC used by compaction.AgentConfig(config.py): Holdsmax_tokens: int = 4096— the total context budget. Compaction threshold is derived asint(max_tokens * 0.6).
Context Window Structure (from researcher spec)
+--------------------------------------------------+
| 1. SYSTEM PROMPT (never compacted) |
| 2. TOOL DEFINITIONS (never compacted) |
| 3. TASK DESCRIPTION (never compacted) |
+--------------------------------------------------+
| 4. MEMORY CONTEXT (compactable) |
| 5. HISTORY entries (older ones compacted) |
| - tool_call |
| - tool_result |
| - reasoning |
+--------------------------------------------------+
| 6. MOST RECENT entries (never compacted) |
| (last 2-3 tool results preserved in full) |
+--------------------------------------------------+
Dependencies
- Model Gateway
InferenceRPC — non-streaming unary endpoint, already implemented in the Model Gateway service (issue #42). The orchestrator'sModelGatewayClientneeds a newinference()method to call it. llm_multiverse.v1.model_gateway_pb2—InferenceRequest,InferenceParams,InferenceResponsemessages (already generated).- No new external libraries required.
Implementation Steps
1. Configuration Constants
Add compaction-related constants to prompt.py (module-level, not in config — these are prompt builder internals):
# Number of most recent history entries to preserve in full during compaction.
COMPACTION_PRESERVE_RECENT = 3
# Maximum tokens for the compaction summary output.
COMPACTION_SUMMARY_MAX_TOKENS = 300
# Task complexity for the compaction inference call (use SIMPLE — this is
# a straightforward summarization task, no complex reasoning needed).
COMPACTION_TASK_COMPLEXITY = model_gateway_pb2.TASK_COMPLEXITY_SIMPLE
These are not user-facing configuration. The COMPACTION_PRESERVE_RECENT = 3 value keeps the most recent tool call/result pairs intact so the model has immediate context for its next decision. The summary max tokens of 300 is generous enough for a structured bullet list but small enough to free meaningful space.
2. Compaction Prompt Template
Define the compaction prompt as a module-level constant in prompt.py:
COMPACTION_PROMPT_TEMPLATE = """\
Summarize the following agent interaction history into a concise structured summary.
Preserve:
- Decisions made and their rationale
- Artifacts produced (file paths, URLs, tool names — identifiers only)
- Key findings and facts discovered
- Open questions or unresolved issues
- Tool calls that failed and why
Format as a bullet list. Be concise. Do not include raw tool output.
## History to summarize
{history}
"""
Key design decisions:
- The prompt explicitly instructs the model to preserve decisions, artifacts (paths only), and open questions — this ensures the compacted summary maintains a coherent reasoning chain.
- "Do not include raw tool output" prevents the summary from just re-stating verbose tool results.
- The output is a structured bullet list, which is token-efficient and easy for the agent model to parse in subsequent iterations.
3. Add inference() to ModelGatewayClient
Add a non-streaming unary inference() method to clients.py:
async def inference(
self,
session_context: common_pb2.SessionContext,
prompt: str,
max_tokens: int,
task_complexity: int = model_gateway_pb2.TASK_COMPLEXITY_SIMPLE,
) -> str:
"""Non-streaming inference call. Returns the full response text.
Used for summarization tasks (context compaction) where streaming
is unnecessary. Raises grpc.aio.AioRpcError on failure.
"""
request = model_gateway_pb2.InferenceRequest(
params=model_gateway_pb2.InferenceParams(
context=session_context,
prompt=prompt,
task_complexity=task_complexity,
max_tokens=max_tokens,
),
)
response = await self._stub.Inference(request)
return response.text
This mirrors the search service's Summarizer.summarize() call pattern exactly, but without the fallback-to-truncation logic (compaction has its own fallback strategy — see step 5).
4. Add compact() Method to PromptBuilder
Add an async compact() method to the PromptBuilder class in prompt.py. This is the core of the feature.
Method Signature
async def compact(
self,
gateway: ModelGatewayClient,
session_context: common_pb2.SessionContext,
) -> bool:
"""Compact older history entries by summarizing them via the Model Gateway.
Returns True if compaction freed enough space to continue.
Returns False if the context is still too large after compaction
(e.g., the fixed sections alone exceed the budget).
"""
Algorithm
-
Determine what to preserve: The last
COMPACTION_PRESERVE_RECENT(3) entries inself._historyare never compacted. Iflen(self._history) <= COMPACTION_PRESERVE_RECENT, there is nothing to compact — returnFalse. -
Split history: Partition
self._historyinto two slices:to_compact = self._history[:-COMPACTION_PRESERVE_RECENT]— older entries to summarizeto_preserve = self._history[-COMPACTION_PRESERVE_RECENT:]— recent entries kept in full
-
Build compaction prompt: Format
COMPACTION_PROMPT_TEMPLATEwith the content ofto_compactentries joined by double newlines. -
Call Model Gateway Inference: Use
gateway.inference()with the compaction prompt,COMPACTION_SUMMARY_MAX_TOKENS, andTASK_COMPLEXITY_SIMPLE. -
Handle failure: If the Inference call raises
grpc.aio.AioRpcError, fall back to a naive truncation strategy — take the first 200 characters of each compacted entry and join them. This ensures compaction never crashes the loop, matching the search summarizer's graceful degradation pattern. -
Replace history: Set
self._historyto a singleHistoryEntry(kind="compacted_summary", content=summary_text)followed by theto_preserveentries. -
Also compact memory context: If
self._memory_contextis non-empty, fold it into the compaction input text (prepend as "## Prior Memory Context\n...") so it is summarized together. Then clearself._memory_context— the summary now covers it. -
Check post-compaction size: Call
self.needs_compaction()again. If stillTrue, returnFalse(the fixed sections are too large — nothing more can be freed). Otherwise returnTrue.
Why compact() is async
The method needs to call the Model Gateway's Inference RPC, which is an async gRPC call. This is the only async operation in PromptBuilder. The method receives the ModelGatewayClient and SessionContext as parameters rather than storing them on the builder, keeping the builder lightweight and testable.
5. Modify Researcher Loop to Use Compaction
In researcher.py, change the context overflow handling in the main loop from:
if prompt_builder.needs_compaction():
return _build_partial_result("Context overflow", used_web_search)
To:
if prompt_builder.needs_compaction():
compacted = await prompt_builder.compact(self._gateway, session_ctx)
if not compacted:
# Compaction could not free enough space — terminate
return _build_partial_result("Context overflow after compaction", used_web_search)
logger.info(
"Context compacted for agent %s (iteration %d)", agent_id, iteration
)
This is the minimal change to the loop. The loop continues normally after successful compaction — the next iteration will build the prompt from the compacted history, which fits within the token budget.
Edge Case: Repeated Compaction
The loop may trigger compaction multiple times during a long research task. Each compaction summarizes the oldest entries again (including any previous compacted summary). This is acceptable because:
- Each summary is progressively more condensed
- The
COMPACTION_PRESERVE_RECENTentries are always fresh - The structured bullet list format survives re-summarization well
6. Tests
All tests in services/orchestrator/tests/.
tests/test_prompt.py — New compaction tests (~10 tests)
Add to the existing test_prompt.py:
-
test_compact_replaces_old_history— Add 6 history entries, compact. Verifyself._historyhas 1 compacted summary + 3 preserved entries (4 total). -
test_compact_preserves_recent_entries— Add 5 entries, compact. Verify the last 3 entries are identical to the originals (by content). -
test_compact_too_few_entries_returns_false— Add only 2 history entries (fewer thanCOMPACTION_PRESERVE_RECENT). Verifycompact()returnsFalsewithout calling the gateway. -
test_compact_includes_memory_context— Set memory context, add history. Compact. Verify the compaction prompt sent to the gateway includes the memory context text. Verify_memory_contextis cleared after compaction. -
test_compact_clears_memory_after_compaction— Set memory context, compact. Verifyself._memory_contextis empty and a subsequentbuild()does not include the memory section. -
test_compact_gateway_failure_falls_back_to_truncation— Mock the gateway to raiseAioRpcError. Verify compaction still succeeds (returnsTrueorFalsedepending on size) and uses truncated text instead of a model summary. -
test_compact_returns_false_when_still_too_large— Set a very smallmax_tokenswhere even the system prompt exceeds the budget. Verifycompact()returnsFalse. -
test_compact_summary_entry_kind— After compaction, verify the first history entry haskind="compacted_summary". -
test_compact_prompt_contains_old_entries— Mock the gateway, capture the prompt passed toinference(). Verify it contains the content of the old (compacted) entries but not the preserved recent entries. -
test_build_after_compact_is_coherent— Compact, then callbuild(). Verify the resulting prompt string has the compacted summary in the History section followed by the preserved entries, and no duplicate content.
tests/test_clients.py — New inference method test (~3 tests)
Add to the existing test_clients.py:
-
test_inference_returns_text— MockInferenceRPC to return a response withtext="summary". Verifyinference()returns"summary". -
test_inference_passes_params— Capture the request sent to the mock. Verifyprompt,max_tokens,task_complexityare correctly set. -
test_inference_grpc_error_propagates— Mock raisesUNAVAILABLE. VerifyAioRpcErrorpropagates.
tests/test_researcher.py — Modified compaction integration tests (~4 tests)
Modify/add to the existing test_researcher.py:
-
test_context_overflow_triggers_compaction_then_continues— Set smallmax_tokens. Mock the gateway to return: tool call, tool call (triggers compaction), then done signal. Mock theInferenceRPC (for compaction) to return a short summary. Verify the loop does NOT terminate on context overflow but instead continues and returns SUCCESS. -
test_compaction_failure_terminates_with_partial— Set smallmax_tokens. Mock compactionInferenceto fail AND truncation fallback still too large. Verify PARTIAL result with "Context overflow after compaction" reason. -
test_multiple_compactions_in_long_session— Set moderatemax_tokens. Mock gateway to return many tool calls (triggers compaction twice). Verify both compactions succeed and the loop eventually returns SUCCESS. -
test_context_overflow_replaces_old_termination— Verify the old behavior (immediate PARTIAL onneeds_compaction()) is gone. Whenneeds_compaction()fires, compaction is attempted first.
Files to Create/Modify
| File | Action | Purpose |
|---|---|---|
services/orchestrator/src/orchestrator/prompt.py |
Modify | Add COMPACTION_PROMPT_TEMPLATE, COMPACTION_PRESERVE_RECENT, COMPACTION_SUMMARY_MAX_TOKENS constants; add compact() async method to PromptBuilder |
services/orchestrator/src/orchestrator/clients.py |
Modify | Add inference() unary method to ModelGatewayClient |
services/orchestrator/src/orchestrator/researcher.py |
Modify | Replace context overflow termination with compaction-then-continue logic |
services/orchestrator/tests/test_prompt.py |
Modify | Add ~10 compaction-related tests |
services/orchestrator/tests/test_clients.py |
Modify | Add ~3 tests for the new inference() method |
services/orchestrator/tests/test_researcher.py |
Modify | Add/modify ~4 tests for compaction integration in the agent loop |
Risks and Edge Cases
- Compaction summary quality: The compaction summary is only as good as the model producing it. A poor summary could lose critical context, causing the agent to repeat work or make contradictory decisions. Mitigation: the compaction prompt explicitly lists what to preserve (decisions, artifacts, open questions). Using
TASK_COMPLEXITY_SIMPLEroutes to a fast model; if summary quality is poor in practice, this can be bumped toTASK_COMPLEXITY_COMPLEXas a tuning knob. - Compaction latency: Each compaction adds one extra Model Gateway call (non-streaming). For
TASK_COMPLEXITY_SIMPLEwith 300 max tokens, this should be fast (~1-2s). However, repeated compactions in a long session add cumulative latency. Mitigation: compaction only triggers at 60% capacity, which should be infrequent (typically 0-2 times per research task). - Recursive summarization degradation: If compaction triggers multiple times, each round summarizes a previous summary. Information loss compounds. Mitigation: the structured bullet list format is resilient to re-summarization. The most recent 3 entries are always preserved in full, so the model always has concrete recent context.
- Gateway unavailable during compaction: If the Model Gateway is down, the compaction falls back to naive truncation. The truncated text may not be coherent. Mitigation: the agent loop can still function with truncated history — it just has less context. If the gateway is fully down, the next
stream_inference()call will also fail, terminating the loop with FAILED. - Fixed sections exceed budget: If the system prompt + tool definitions + task description alone exceed
max_tokens * 0.6, compaction cannot help. Mitigation:compact()returnsFalse, and the loop terminates with PARTIAL. This is an inherent limitation — the task or tool set is too large for the configured token budget. The operator should increasemax_tokensinAgentConfig. - Thread safety:
PromptBuilder.compact()mutates_historyand_memory_context. The researcher loop is single-threaded (one agent perrun()call), so this is safe. No concurrent access toPromptBuilderis expected.
Deviation Log
(Filled during implementation if deviations from plan occur)
| Deviation | Reason |
|---|