Files

Pi Agent 0a48daa30b feat: implement semantic cache for query deduplication (issue #32 )

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-10 07:09:05 +01:00

26 KiB

Raw Blame History

Implementation Plan — Issue #32: Implement semantic cache

Metadata

Field	Value
Issue	#32
Title	Implement semantic cache
Milestone	Phase 4: Memory Service
Labels
Status	`DONE`
Language	Rust
Related Plans	issue-028.md, issue-029.md, issue-030.md, issue-031.md
Blocked by	#31 (completed)

Acceptance Criteria

Cache keyed by query embedding similarity (not exact match)
Configurable similarity threshold for cache hits
TTL-based cache expiration
Cache invalidation on new memory writes
Metrics: cache hit/miss rate tracking

Architecture Analysis

Service Context

This issue belongs to the Memory Service (Rust). It implements the semantic cache layer described in the architecture document:

Cache: Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.

The cache sits between the query_memory gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.

Affected gRPC endpoints:

QueryMemory (server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.
WriteMemory — on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.

Proto messages used:

QueryMemoryRequest — the query text is embedded and used as the cache key.
QueryMemoryResponse — cached responses set is_cached = true. The cached_extracted_segment and extraction_confidence fields are populated from the cache entry.
No proto changes required — the existing is_cached field on QueryMemoryResponse (field 4) already exists for this purpose.

Existing Patterns

Config pattern: RetrievalConfig and ExtractionConfig in services/memory/src/config.rs use #[derive(Debug, Clone, Deserialize)] with #[serde(default)] and named default functions. The cache config should follow the same pattern.
Builder pattern for service: MemoryServiceImpl uses builder methods like with_embedding_client() and with_extraction_client() at services/memory/src/service.rs:54-67. The cache can be built into the service directly (always present, not optional) since it is purely in-memory.
Embedding generation: The EmbeddingClient wrapped in Arc<Mutex<EmbeddingClient>> at services/memory/src/service.rs:26 provides generate() to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (at services/memory/src/service.rs:139-145).
DuckDB connection pattern: DuckDbManager::with_connection() at services/memory/src/db/mod.rs:84-90 uses Mutex<Connection>. The cache is in-memory (not in DuckDB) to avoid database lock contention.
Module organization: Each feature area has its own module directory (embedding/, extraction/, retrieval/). The cache should follow the same pattern as a cache/ module.

Dependencies

No new external crate dependencies — the cache is an in-memory HashMap-based structure protected by tokio::sync::RwLock. Cosine similarity computation can reuse the array_cosine_similarity logic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed.
Embedding client — the cache requires query embeddings. The query_memory handler already generates the query embedding before the pipeline runs (at services/memory/src/service.rs:139-145). The cache lookup and population reuse this embedding.
std::time::Instant — for TTL tracking.
std::sync::atomic — for lock-free cache metrics counters.

Implementation Steps

1. Types & Configuration

Add cache configuration to services/memory/src/config.rs:

/// Configuration for the semantic query cache.
#[derive(Debug, Clone, Deserialize)]
pub struct CacheConfig {
    /// Whether the cache is enabled (default: true).
    #[serde(default = "default_cache_enabled")]
    pub enabled: bool,

    /// Cosine similarity threshold for cache hits (default: 0.95).
    /// A cached query embedding must have cosine similarity >= this value
    /// with the incoming query embedding to be considered a hit.
    #[serde(default = "default_cache_similarity_threshold")]
    pub similarity_threshold: f32,

    /// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
    #[serde(default = "default_cache_ttl_secs")]
    pub ttl_secs: u64,

    /// Maximum number of entries in the cache (default: 1000).
    /// When exceeded, the oldest entry is evicted.
    #[serde(default = "default_cache_max_entries")]
    pub max_entries: usize,
}

Add cache: CacheConfig field to the Config struct with #[serde(default)].

Define cache types in a new services/memory/src/cache/mod.rs:

use std::time::Instant;

/// A single entry in the semantic cache.
///
/// Stores the query embedding (as the cache key), the retrieval results,
/// and metadata for TTL tracking and invalidation.
#[derive(Debug, Clone)]
pub struct CacheEntry {
    /// The embedding of the cached query (used for similarity matching).
    pub query_embedding: Vec<f32>,
    /// The original query text (for debugging/logging).
    pub query_text: String,
    /// The tag filter used in the original query (cache entries are scoped by tag).
    pub tag_filter: Option<String>,
    /// Cached response data (rank, entry, scores, extraction results).
    pub results: Vec<CachedResult>,
    /// Memory IDs in the result set (for invalidation on write).
    pub result_memory_ids: Vec<String>,
    /// When this entry was created (for TTL expiration).
    pub created_at: Instant,
}

/// A single cached retrieval result.
#[derive(Debug, Clone)]
pub struct CachedResult {
    /// The rank of the result in the original retrieval.
    pub rank: u32,
    /// The memory entry (proto format).
    pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
    /// Cosine similarity score from the retrieval pipeline.
    pub cosine_similarity: f32,
    /// Extracted segment (if extraction was performed).
    pub extracted_segment: Option<String>,
    /// Extraction confidence (if extraction was performed).
    pub extraction_confidence: Option<f32>,
}

/// Cache hit/miss metrics.
#[derive(Debug)]
pub struct CacheMetrics {
    /// Total cache hit count.
    pub hits: std::sync::atomic::AtomicU64,
    /// Total cache miss count.
    pub misses: std::sync::atomic::AtomicU64,
    /// Total cache evictions (TTL or capacity).
    pub evictions: std::sync::atomic::AtomicU64,
    /// Total cache invalidations (due to writes).
    pub invalidations: std::sync::atomic::AtomicU64,
}

2. Core Logic

Create services/memory/src/cache/mod.rs — Semantic cache manager:

use tokio::sync::RwLock;
use crate::config::CacheConfig;

/// Semantic cache for deduplicating similar queries.
///
/// Keyed by query embedding similarity rather than exact string match.
/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
/// with exclusive write access (cache population, invalidation, eviction).
pub struct SemanticCache {
    config: CacheConfig,
    entries: RwLock<Vec<CacheEntry>>,
    metrics: CacheMetrics,
}

impl SemanticCache {
    /// Create a new semantic cache with the given configuration.
    pub fn new(config: CacheConfig) -> Self;

    /// Look up a cache entry by query embedding similarity.
    ///
    /// Computes cosine similarity between the `query_embedding` and each
    /// cached entry's embedding. Returns the first entry whose similarity
    /// exceeds `config.similarity_threshold` and whose TTL has not expired.
    ///
    /// Also filters by `tag_filter` — a cached entry is only a hit if its
    /// tag filter matches the incoming query's tag filter.
    ///
    /// Updates metrics (hit or miss counter).
    ///
    /// Returns `None` on miss.
    pub async fn lookup(
        &self,
        query_embedding: &[f32],
        tag_filter: Option<&str>,
    ) -> Option<Vec<CachedResult>>;

    /// Insert a new cache entry.
    ///
    /// If the cache is at `max_entries` capacity, evicts the oldest entry
    /// (by `created_at`). Also removes any expired entries during insertion.
    pub async fn insert(&self, entry: CacheEntry);

    /// Invalidate all cache entries whose result set includes the given memory ID.
    ///
    /// Called when a memory is written or updated. This ensures stale cached
    /// results are not served after the underlying data changes.
    ///
    /// Updates the invalidation counter.
    pub async fn invalidate_by_memory_id(&self, memory_id: &str);

    /// Invalidate all cache entries (full cache flush).
    pub async fn invalidate_all(&self);

    /// Remove expired entries (TTL check).
    ///
    /// Called during `lookup` and `insert` to lazily clean up expired entries.
    async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);

    /// Get a snapshot of cache metrics.
    pub fn metrics(&self) -> CacheMetricsSnapshot;
}

Create services/memory/src/cache/similarity.rs — Pure-Rust cosine similarity:

/// Compute cosine similarity between two vectors.
///
/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
///
/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
/// because the cache operates in-memory, outside of DuckDB.
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;

Key implementation details for lookup():

Acquire read lock on entries.
Get current time for TTL check.
Iterate over entries, skip expired ones.
For each non-expired entry, check tag filter match.
Compute cosine_similarity(query_embedding, entry.query_embedding).
If similarity >= config.similarity_threshold, increment hit counter, return cloned results.
If no hit found, increment miss counter, return None.

Key implementation details for insert():

Acquire write lock on entries.
Call evict_expired() to remove stale entries.
If entries.len() >= config.max_entries, remove the oldest entry by created_at. Increment eviction counter.
Push the new CacheEntry.

Key implementation details for invalidate_by_memory_id():

Acquire write lock on entries.
Retain only entries whose result_memory_ids do not contain the given memory_id.
For each removed entry, increment invalidation counter.

Metrics snapshot type:

/// A point-in-time snapshot of cache metrics.
#[derive(Debug, Clone)]
pub struct CacheMetricsSnapshot {
    pub hits: u64,
    pub misses: u64,
    pub evictions: u64,
    pub invalidations: u64,
    pub current_size: usize,
    /// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
    pub hit_rate: f64,
}

3. gRPC Handler Wiring

Update services/memory/src/service.rs — Add cache to MemoryServiceImpl and integrate into query_memory:

pub struct MemoryServiceImpl {
    db: Arc<DuckDbManager>,
    embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
    extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
    retrieval_config: RetrievalConfig,
    extraction_config: ExtractionConfig,
    cache: Arc<SemanticCache>,
}

Update MemoryServiceImpl::new() to accept CacheConfig and construct the SemanticCache:

pub fn new(
    db: Arc<DuckDbManager>,
    retrieval_config: RetrievalConfig,
    extraction_config: ExtractionConfig,
    cache_config: CacheConfig,
) -> Self {
    Self {
        db,
        embedding_client: None,
        extraction_client: None,
        retrieval_config,
        extraction_config,
        cache: Arc::new(SemanticCache::new(cache_config)),
    }
}

Updated query_memory handler flow:

Validate request (existing code).
Generate query embedding (existing code at lines 139-145).
NEW — Cache lookup: If cache is enabled, call self.cache.lookup(&query_embedding, tag_filter). If hit, stream cached results with is_cached = true and return immediately.
Run retrieval pipeline (existing code at lines 148-163).
Run extraction (existing code at lines 166-180).
NEW — Cache population: Build a CacheEntry from the retrieval + extraction results and insert into the cache.
Stream results with is_cached = false (existing code).

// After generating query_embedding and before pipeline:
let tag_filter_ref = if req.memory_type.is_empty() {
    None
} else {
    Some(req.memory_type.as_str())
};

if self.cache.config().enabled {
    if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
        tracing::debug!(
            session_id = %ctx.session_id,
            query = %req.query,
            "Cache hit for query"
        );
        // Stream cached results
        let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
        tokio::spawn(async move {
            for result in cached_results {
                let response = QueryMemoryResponse {
                    rank: result.rank,
                    entry: Some(result.entry),
                    cosine_similarity: result.cosine_similarity,
                    is_cached: true,
                    cached_extracted_segment: result.extracted_segment,
                    extraction_confidence: result.extraction_confidence,
                };
                if tx.send(Ok(response)).await.is_err() {
                    break;
                }
            }
        });
        return Ok(Response::new(ReceiverStream::new(rx)));
    }
}

// ... existing pipeline and extraction code ...

// After extraction, before streaming:
// Populate cache with results
if self.cache.config().enabled {
    let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
    let result_memory_ids: Vec<String> = candidates.iter()
        .map(|c| c.memory_id.clone())
        .collect();
    let cache_entry = CacheEntry {
        query_embedding: query_vector.clone(),
        query_text: req.query.clone(),
        tag_filter: params.tag_filter.clone(),
        results: cached_results,
        result_memory_ids,
        created_at: Instant::now(),
    };
    self.cache.insert(cache_entry).await;
}

Update write_memory handler (future-proofing):

The write_memory handler is currently Unimplemented, but the invalidation hook should be documented and placed at the logical location. When write_memory is implemented, after successfully writing a memory, it must call:

self.cache.invalidate_by_memory_id(&memory_id).await;

For now, add a comment in the write_memory handler noting this requirement:

// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;

4. Service Integration

Update services/memory/src/main.rs — Pass cache config:

let cache_config = config.cache.clone();
let mut memory_service = MemoryServiceImpl::new(
    db,
    retrieval_config,
    extraction_config.clone(),
    cache_config,
);

Metrics exposure: The cache metrics are accessible via self.cache.metrics(). For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:

// Log cache metrics every 60 seconds
let cache_ref = memory_service.cache().clone();
tokio::spawn(async move {
    let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
    loop {
        interval.tick().await;
        let m = cache_ref.metrics();
        tracing::info!(
            hits = m.hits,
            misses = m.misses,
            hit_rate = format!("{:.1}%", m.hit_rate),
            size = m.current_size,
            evictions = m.evictions,
            invalidations = m.invalidations,
            "Cache metrics"
        );
    }
});

Error mapping: The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.

5. Tests

Unit tests in services/memory/src/cache/similarity.rs:

Test Case	Description
`test_cosine_similarity_identical_vectors`	Two identical vectors return 1.0
`test_cosine_similarity_orthogonal_vectors`	Two orthogonal vectors return 0.0
`test_cosine_similarity_opposite_vectors`	Two opposite vectors return -1.0
`test_cosine_similarity_zero_vector`	A zero vector returns 0.0 (no division by zero)
`test_cosine_similarity_different_magnitudes`	Vectors with same direction but different magnitudes return 1.0
`test_cosine_similarity_known_value`	Known pair of vectors produces expected similarity

Unit tests in services/memory/src/cache/mod.rs:

Test Case	Description
`test_cache_new_creates_empty_cache`	New cache has 0 entries and 0 metrics
`test_cache_insert_and_lookup_hit`	Insert an entry, lookup with same embedding returns hit
`test_cache_lookup_miss_below_threshold`	Lookup with dissimilar embedding returns miss
`test_cache_lookup_miss_empty_cache`	Lookup on empty cache returns None
`test_cache_ttl_expiration`	Insert entry, wait past TTL, lookup returns None
`test_cache_invalidate_by_memory_id`	Insert entry with memory ID, invalidate, lookup returns None
`test_cache_invalidate_by_memory_id_partial`	Two entries, invalidate one memory ID, other entry survives
`test_cache_invalidate_all`	Insert entries, invalidate all, lookup returns None
`test_cache_max_entries_eviction`	Insert entries beyond max_entries, oldest is evicted
`test_cache_metrics_hit_count`	After hits, hit counter is incremented
`test_cache_metrics_miss_count`	After misses, miss counter is incremented
`test_cache_metrics_hit_rate`	After mix of hits and misses, hit rate is correct
`test_cache_metrics_eviction_count`	After eviction, eviction counter is incremented
`test_cache_metrics_invalidation_count`	After invalidation, invalidation counter is incremented
`test_cache_tag_filter_scoping`	Entry cached with tag "A", lookup with tag "B" misses
`test_cache_tag_filter_none_matches_none`	Entry cached without tag, lookup without tag hits
`test_cache_disabled_returns_none`	Cache with `enabled=false`, lookup always returns None
`test_cache_concurrent_read_write`	Spawn multiple readers and writers, verify no panics/deadlocks

Service-level tests in services/memory/src/service.rs:

Test Case	Description
`test_query_memory_cache_hit`	First query populates cache, second identical query returns `is_cached=true`
`test_query_memory_cache_miss_different_query`	Two dissimilar queries both return `is_cached=false`
`test_query_memory_cache_disabled`	Cache disabled in config, all queries return `is_cached=false`

Config tests in services/memory/src/config.rs:

Test Case	Description
`test_cache_config_defaults`	Default config has `enabled=true`, `similarity_threshold=0.95`, `ttl_secs=300`, `max_entries=1000`
`test_cache_config_from_toml`	Custom values loaded from TOML
`test_cache_config_uses_defaults_when_omitted`	Config without `[cache]` section uses defaults

Mocking strategy:

Use DuckDbManager::in_memory() for all DB operations.
Use the existing mock Model Gateway server pattern from services/memory/src/service.rs:469-713 for embedding and extraction clients.
For cache-specific tests, construct CacheEntry directly without needing the full pipeline.
For TTL tests, use a very short TTL (e.g., 1 second) and tokio::time::sleep().

Cargo Dependencies

No new crate dependencies required. All functionality is available via:

tokio (async RwLock, mpsc channels, time::Instant)
std::sync::atomic (lock-free metrics counters)
std::time::Instant (TTL tracking)

Trait Implementations

No new trait implementations required. The SemanticCache is a concrete struct used directly by the service layer.

Error Types

No new error types required. Cache operations are non-fatal:

Cache lookup miss: falls through to the pipeline.
Cache insertion failure: logged as warning, response still returned.
Cache invalidation: best-effort, logged.

Files to Create/Modify

File	Action	Purpose
`services/memory/src/cache/mod.rs`	Create	`SemanticCache`, `CacheEntry`, `CachedResult`, `CacheMetrics`, `CacheMetricsSnapshot` — cache manager with lookup, insert, invalidation, eviction, and metrics
`services/memory/src/cache/similarity.rs`	Create	`cosine_similarity()` — pure-Rust cosine similarity for in-memory embedding comparison
`services/memory/src/config.rs`	Modify	Add `CacheConfig` struct with `enabled`, `similarity_threshold`, `ttl_secs`, `max_entries`; add `cache` field to `Config`
`services/memory/src/lib.rs`	Modify	Add `pub mod cache;`
`services/memory/src/service.rs`	Modify	Add `cache: Arc<SemanticCache>` to `MemoryServiceImpl`; update constructor to accept `CacheConfig`; integrate cache lookup before pipeline and cache population after pipeline in `query_memory`; add cache invalidation comment to `write_memory`
`services/memory/src/main.rs`	Modify	Pass `CacheConfig` to `MemoryServiceImpl::new()`; add periodic cache metrics logging task

Risks and Edge Cases

Cache key collision with different tag filters: Two queries with the same text but different memory_type tag filters should not share cache entries. Mitigation: the cache lookup filters by tag_filter match in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly.
Similarity threshold tuning: A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
Cache size and memory pressure: Each cache entry stores the query embedding (768 floats = 3KB), the full MemoryEntry proto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. The max_entries cap prevents unbounded growth.
TTL granularity: TTL is checked lazily during lookup and insert, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue.
Write-through invalidation for unimplemented write_memory: The write_memory handler is currently Unimplemented. The invalidation hook is documented as a TODO comment. When write_memory is implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability.
Concurrent access patterns: The cache uses tokio::sync::RwLock which allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). The RwLock will not be a bottleneck unless the cache is invalidated very frequently.
Embedding client required for cache: The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns failed_precondition when no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated.
Cache coherence with extraction toggle: If the first query runs with skip_extraction=false (extraction results cached) and a subsequent semantically similar query has skip_extraction=true, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match on skip_extraction flag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative.
Linear scan performance: The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.

Deviation Log

Deviation	Reason
Merged `feature/issue-31-extraction-step` into feature branch (fast-forward)	Issue #32 depends on #31 (extraction step) which is completed but not yet merged to `main`. The extraction types, client, and `ExtractionConfig` are required by the cache integration in `service.rs`.
`SemanticCache::metrics()` is `async` (acquires read lock to get `current_size`)	Plan showed it as a sync method, but reading `entries.len()` requires the RwLock. Made async for correctness.
Used `f64` intermediates in cosine similarity computation	Plan specified `f32` only. Using `f64` for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to `f32`.

26 KiB Raw Blame History