Files
llm-multiverse/implementation-plans/issue-032.md

26 KiB

Implementation Plan — Issue #32: Implement semantic cache

Metadata

Field Value
Issue #32
Title Implement semantic cache
Milestone Phase 4: Memory Service
Labels
Status COMPLETED
Language Rust
Related Plans issue-028.md, issue-029.md, issue-030.md, issue-031.md
Blocked by #31 (completed)

Acceptance Criteria

  • Cache keyed by query embedding similarity (not exact match)
  • Configurable similarity threshold for cache hits
  • TTL-based cache expiration
  • Cache invalidation on new memory writes
  • Metrics: cache hit/miss rate tracking

Architecture Analysis

Service Context

This issue belongs to the Memory Service (Rust). It implements the semantic cache layer described in the architecture document:

Cache: Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.

The cache sits between the query_memory gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.

Affected gRPC endpoints:

  • QueryMemory (server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.
  • WriteMemory — on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.

Proto messages used:

  • QueryMemoryRequest — the query text is embedded and used as the cache key.
  • QueryMemoryResponse — cached responses set is_cached = true. The cached_extracted_segment and extraction_confidence fields are populated from the cache entry.
  • No proto changes required — the existing is_cached field on QueryMemoryResponse (field 4) already exists for this purpose.

Existing Patterns

  • Config pattern: RetrievalConfig and ExtractionConfig in services/memory/src/config.rs use #[derive(Debug, Clone, Deserialize)] with #[serde(default)] and named default functions. The cache config should follow the same pattern.
  • Builder pattern for service: MemoryServiceImpl uses builder methods like with_embedding_client() and with_extraction_client() at services/memory/src/service.rs:54-67. The cache can be built into the service directly (always present, not optional) since it is purely in-memory.
  • Embedding generation: The EmbeddingClient wrapped in Arc<Mutex<EmbeddingClient>> at services/memory/src/service.rs:26 provides generate() to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (at services/memory/src/service.rs:139-145).
  • DuckDB connection pattern: DuckDbManager::with_connection() at services/memory/src/db/mod.rs:84-90 uses Mutex<Connection>. The cache is in-memory (not in DuckDB) to avoid database lock contention.
  • Module organization: Each feature area has its own module directory (embedding/, extraction/, retrieval/). The cache should follow the same pattern as a cache/ module.

Dependencies

  • No new external crate dependencies — the cache is an in-memory HashMap-based structure protected by tokio::sync::RwLock. Cosine similarity computation can reuse the array_cosine_similarity logic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed.
  • Embedding client — the cache requires query embeddings. The query_memory handler already generates the query embedding before the pipeline runs (at services/memory/src/service.rs:139-145). The cache lookup and population reuse this embedding.
  • std::time::Instant — for TTL tracking.
  • std::sync::atomic — for lock-free cache metrics counters.

Implementation Steps

1. Types & Configuration

Add cache configuration to services/memory/src/config.rs:

/// Configuration for the semantic query cache.
#[derive(Debug, Clone, Deserialize)]
pub struct CacheConfig {
    /// Whether the cache is enabled (default: true).
    #[serde(default = "default_cache_enabled")]
    pub enabled: bool,

    /// Cosine similarity threshold for cache hits (default: 0.95).
    /// A cached query embedding must have cosine similarity >= this value
    /// with the incoming query embedding to be considered a hit.
    #[serde(default = "default_cache_similarity_threshold")]
    pub similarity_threshold: f32,

    /// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
    #[serde(default = "default_cache_ttl_secs")]
    pub ttl_secs: u64,

    /// Maximum number of entries in the cache (default: 1000).
    /// When exceeded, the oldest entry is evicted.
    #[serde(default = "default_cache_max_entries")]
    pub max_entries: usize,
}

Add cache: CacheConfig field to the Config struct with #[serde(default)].

Define cache types in a new services/memory/src/cache/mod.rs:

use std::time::Instant;

/// A single entry in the semantic cache.
///
/// Stores the query embedding (as the cache key), the retrieval results,
/// and metadata for TTL tracking and invalidation.
#[derive(Debug, Clone)]
pub struct CacheEntry {
    /// The embedding of the cached query (used for similarity matching).
    pub query_embedding: Vec<f32>,
    /// The original query text (for debugging/logging).
    pub query_text: String,
    /// The tag filter used in the original query (cache entries are scoped by tag).
    pub tag_filter: Option<String>,
    /// Cached response data (rank, entry, scores, extraction results).
    pub results: Vec<CachedResult>,
    /// Memory IDs in the result set (for invalidation on write).
    pub result_memory_ids: Vec<String>,
    /// When this entry was created (for TTL expiration).
    pub created_at: Instant,
}

/// A single cached retrieval result.
#[derive(Debug, Clone)]
pub struct CachedResult {
    /// The rank of the result in the original retrieval.
    pub rank: u32,
    /// The memory entry (proto format).
    pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
    /// Cosine similarity score from the retrieval pipeline.
    pub cosine_similarity: f32,
    /// Extracted segment (if extraction was performed).
    pub extracted_segment: Option<String>,
    /// Extraction confidence (if extraction was performed).
    pub extraction_confidence: Option<f32>,
}

/// Cache hit/miss metrics.
#[derive(Debug)]
pub struct CacheMetrics {
    /// Total cache hit count.
    pub hits: std::sync::atomic::AtomicU64,
    /// Total cache miss count.
    pub misses: std::sync::atomic::AtomicU64,
    /// Total cache evictions (TTL or capacity).
    pub evictions: std::sync::atomic::AtomicU64,
    /// Total cache invalidations (due to writes).
    pub invalidations: std::sync::atomic::AtomicU64,
}

2. Core Logic

Create services/memory/src/cache/mod.rs — Semantic cache manager:

use tokio::sync::RwLock;
use crate::config::CacheConfig;

/// Semantic cache for deduplicating similar queries.
///
/// Keyed by query embedding similarity rather than exact string match.
/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
/// with exclusive write access (cache population, invalidation, eviction).
pub struct SemanticCache {
    config: CacheConfig,
    entries: RwLock<Vec<CacheEntry>>,
    metrics: CacheMetrics,
}

impl SemanticCache {
    /// Create a new semantic cache with the given configuration.
    pub fn new(config: CacheConfig) -> Self;

    /// Look up a cache entry by query embedding similarity.
    ///
    /// Computes cosine similarity between the `query_embedding` and each
    /// cached entry's embedding. Returns the first entry whose similarity
    /// exceeds `config.similarity_threshold` and whose TTL has not expired.
    ///
    /// Also filters by `tag_filter` — a cached entry is only a hit if its
    /// tag filter matches the incoming query's tag filter.
    ///
    /// Updates metrics (hit or miss counter).
    ///
    /// Returns `None` on miss.
    pub async fn lookup(
        &self,
        query_embedding: &[f32],
        tag_filter: Option<&str>,
    ) -> Option<Vec<CachedResult>>;

    /// Insert a new cache entry.
    ///
    /// If the cache is at `max_entries` capacity, evicts the oldest entry
    /// (by `created_at`). Also removes any expired entries during insertion.
    pub async fn insert(&self, entry: CacheEntry);

    /// Invalidate all cache entries whose result set includes the given memory ID.
    ///
    /// Called when a memory is written or updated. This ensures stale cached
    /// results are not served after the underlying data changes.
    ///
    /// Updates the invalidation counter.
    pub async fn invalidate_by_memory_id(&self, memory_id: &str);

    /// Invalidate all cache entries (full cache flush).
    pub async fn invalidate_all(&self);

    /// Remove expired entries (TTL check).
    ///
    /// Called during `lookup` and `insert` to lazily clean up expired entries.
    async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);

    /// Get a snapshot of cache metrics.
    pub fn metrics(&self) -> CacheMetricsSnapshot;
}

Create services/memory/src/cache/similarity.rs — Pure-Rust cosine similarity:

/// Compute cosine similarity between two vectors.
///
/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
///
/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
/// because the cache operates in-memory, outside of DuckDB.
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;

Key implementation details for lookup():

  1. Acquire read lock on entries.
  2. Get current time for TTL check.
  3. Iterate over entries, skip expired ones.
  4. For each non-expired entry, check tag filter match.
  5. Compute cosine_similarity(query_embedding, entry.query_embedding).
  6. If similarity >= config.similarity_threshold, increment hit counter, return cloned results.
  7. If no hit found, increment miss counter, return None.

Key implementation details for insert():

  1. Acquire write lock on entries.
  2. Call evict_expired() to remove stale entries.
  3. If entries.len() >= config.max_entries, remove the oldest entry by created_at. Increment eviction counter.
  4. Push the new CacheEntry.

Key implementation details for invalidate_by_memory_id():

  1. Acquire write lock on entries.
  2. Retain only entries whose result_memory_ids do not contain the given memory_id.
  3. For each removed entry, increment invalidation counter.

Metrics snapshot type:

/// A point-in-time snapshot of cache metrics.
#[derive(Debug, Clone)]
pub struct CacheMetricsSnapshot {
    pub hits: u64,
    pub misses: u64,
    pub evictions: u64,
    pub invalidations: u64,
    pub current_size: usize,
    /// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
    pub hit_rate: f64,
}

3. gRPC Handler Wiring

Update services/memory/src/service.rs — Add cache to MemoryServiceImpl and integrate into query_memory:

pub struct MemoryServiceImpl {
    db: Arc<DuckDbManager>,
    embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
    extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
    retrieval_config: RetrievalConfig,
    extraction_config: ExtractionConfig,
    cache: Arc<SemanticCache>,
}

Update MemoryServiceImpl::new() to accept CacheConfig and construct the SemanticCache:

pub fn new(
    db: Arc<DuckDbManager>,
    retrieval_config: RetrievalConfig,
    extraction_config: ExtractionConfig,
    cache_config: CacheConfig,
) -> Self {
    Self {
        db,
        embedding_client: None,
        extraction_client: None,
        retrieval_config,
        extraction_config,
        cache: Arc::new(SemanticCache::new(cache_config)),
    }
}

Updated query_memory handler flow:

  1. Validate request (existing code).
  2. Generate query embedding (existing code at lines 139-145).
  3. NEW — Cache lookup: If cache is enabled, call self.cache.lookup(&query_embedding, tag_filter). If hit, stream cached results with is_cached = true and return immediately.
  4. Run retrieval pipeline (existing code at lines 148-163).
  5. Run extraction (existing code at lines 166-180).
  6. NEW — Cache population: Build a CacheEntry from the retrieval + extraction results and insert into the cache.
  7. Stream results with is_cached = false (existing code).
// After generating query_embedding and before pipeline:
let tag_filter_ref = if req.memory_type.is_empty() {
    None
} else {
    Some(req.memory_type.as_str())
};

if self.cache.config().enabled {
    if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
        tracing::debug!(
            session_id = %ctx.session_id,
            query = %req.query,
            "Cache hit for query"
        );
        // Stream cached results
        let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
        tokio::spawn(async move {
            for result in cached_results {
                let response = QueryMemoryResponse {
                    rank: result.rank,
                    entry: Some(result.entry),
                    cosine_similarity: result.cosine_similarity,
                    is_cached: true,
                    cached_extracted_segment: result.extracted_segment,
                    extraction_confidence: result.extraction_confidence,
                };
                if tx.send(Ok(response)).await.is_err() {
                    break;
                }
            }
        });
        return Ok(Response::new(ReceiverStream::new(rx)));
    }
}

// ... existing pipeline and extraction code ...

// After extraction, before streaming:
// Populate cache with results
if self.cache.config().enabled {
    let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
    let result_memory_ids: Vec<String> = candidates.iter()
        .map(|c| c.memory_id.clone())
        .collect();
    let cache_entry = CacheEntry {
        query_embedding: query_vector.clone(),
        query_text: req.query.clone(),
        tag_filter: params.tag_filter.clone(),
        results: cached_results,
        result_memory_ids,
        created_at: Instant::now(),
    };
    self.cache.insert(cache_entry).await;
}

Update write_memory handler (future-proofing):

The write_memory handler is currently Unimplemented, but the invalidation hook should be documented and placed at the logical location. When write_memory is implemented, after successfully writing a memory, it must call:

self.cache.invalidate_by_memory_id(&memory_id).await;

For now, add a comment in the write_memory handler noting this requirement:

// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;

4. Service Integration

Update services/memory/src/main.rs — Pass cache config:

let cache_config = config.cache.clone();
let mut memory_service = MemoryServiceImpl::new(
    db,
    retrieval_config,
    extraction_config.clone(),
    cache_config,
);

Metrics exposure: The cache metrics are accessible via self.cache.metrics(). For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:

// Log cache metrics every 60 seconds
let cache_ref = memory_service.cache().clone();
tokio::spawn(async move {
    let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
    loop {
        interval.tick().await;
        let m = cache_ref.metrics();
        tracing::info!(
            hits = m.hits,
            misses = m.misses,
            hit_rate = format!("{:.1}%", m.hit_rate),
            size = m.current_size,
            evictions = m.evictions,
            invalidations = m.invalidations,
            "Cache metrics"
        );
    }
});

Error mapping: The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.

5. Tests

Unit tests in services/memory/src/cache/similarity.rs:

Test Case Description
test_cosine_similarity_identical_vectors Two identical vectors return 1.0
test_cosine_similarity_orthogonal_vectors Two orthogonal vectors return 0.0
test_cosine_similarity_opposite_vectors Two opposite vectors return -1.0
test_cosine_similarity_zero_vector A zero vector returns 0.0 (no division by zero)
test_cosine_similarity_different_magnitudes Vectors with same direction but different magnitudes return 1.0
test_cosine_similarity_known_value Known pair of vectors produces expected similarity

Unit tests in services/memory/src/cache/mod.rs:

Test Case Description
test_cache_new_creates_empty_cache New cache has 0 entries and 0 metrics
test_cache_insert_and_lookup_hit Insert an entry, lookup with same embedding returns hit
test_cache_lookup_miss_below_threshold Lookup with dissimilar embedding returns miss
test_cache_lookup_miss_empty_cache Lookup on empty cache returns None
test_cache_ttl_expiration Insert entry, wait past TTL, lookup returns None
test_cache_invalidate_by_memory_id Insert entry with memory ID, invalidate, lookup returns None
test_cache_invalidate_by_memory_id_partial Two entries, invalidate one memory ID, other entry survives
test_cache_invalidate_all Insert entries, invalidate all, lookup returns None
test_cache_max_entries_eviction Insert entries beyond max_entries, oldest is evicted
test_cache_metrics_hit_count After hits, hit counter is incremented
test_cache_metrics_miss_count After misses, miss counter is incremented
test_cache_metrics_hit_rate After mix of hits and misses, hit rate is correct
test_cache_metrics_eviction_count After eviction, eviction counter is incremented
test_cache_metrics_invalidation_count After invalidation, invalidation counter is incremented
test_cache_tag_filter_scoping Entry cached with tag "A", lookup with tag "B" misses
test_cache_tag_filter_none_matches_none Entry cached without tag, lookup without tag hits
test_cache_disabled_returns_none Cache with enabled=false, lookup always returns None
test_cache_concurrent_read_write Spawn multiple readers and writers, verify no panics/deadlocks

Service-level tests in services/memory/src/service.rs:

Test Case Description
test_query_memory_cache_hit First query populates cache, second identical query returns is_cached=true
test_query_memory_cache_miss_different_query Two dissimilar queries both return is_cached=false
test_query_memory_cache_disabled Cache disabled in config, all queries return is_cached=false

Config tests in services/memory/src/config.rs:

Test Case Description
test_cache_config_defaults Default config has enabled=true, similarity_threshold=0.95, ttl_secs=300, max_entries=1000
test_cache_config_from_toml Custom values loaded from TOML
test_cache_config_uses_defaults_when_omitted Config without [cache] section uses defaults

Mocking strategy:

  • Use DuckDbManager::in_memory() for all DB operations.
  • Use the existing mock Model Gateway server pattern from services/memory/src/service.rs:469-713 for embedding and extraction clients.
  • For cache-specific tests, construct CacheEntry directly without needing the full pipeline.
  • For TTL tests, use a very short TTL (e.g., 1 second) and tokio::time::sleep().

Cargo Dependencies

No new crate dependencies required. All functionality is available via:

  • tokio (async RwLock, mpsc channels, time::Instant)
  • std::sync::atomic (lock-free metrics counters)
  • std::time::Instant (TTL tracking)

Trait Implementations

No new trait implementations required. The SemanticCache is a concrete struct used directly by the service layer.

Error Types

No new error types required. Cache operations are non-fatal:

  • Cache lookup miss: falls through to the pipeline.
  • Cache insertion failure: logged as warning, response still returned.
  • Cache invalidation: best-effort, logged.

Files to Create/Modify

File Action Purpose
services/memory/src/cache/mod.rs Create SemanticCache, CacheEntry, CachedResult, CacheMetrics, CacheMetricsSnapshot — cache manager with lookup, insert, invalidation, eviction, and metrics
services/memory/src/cache/similarity.rs Create cosine_similarity() — pure-Rust cosine similarity for in-memory embedding comparison
services/memory/src/config.rs Modify Add CacheConfig struct with enabled, similarity_threshold, ttl_secs, max_entries; add cache field to Config
services/memory/src/lib.rs Modify Add pub mod cache;
services/memory/src/service.rs Modify Add cache: Arc<SemanticCache> to MemoryServiceImpl; update constructor to accept CacheConfig; integrate cache lookup before pipeline and cache population after pipeline in query_memory; add cache invalidation comment to write_memory
services/memory/src/main.rs Modify Pass CacheConfig to MemoryServiceImpl::new(); add periodic cache metrics logging task

Risks and Edge Cases

  • Cache key collision with different tag filters: Two queries with the same text but different memory_type tag filters should not share cache entries. Mitigation: the cache lookup filters by tag_filter match in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly.
  • Similarity threshold tuning: A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
  • Cache size and memory pressure: Each cache entry stores the query embedding (768 floats = 3KB), the full MemoryEntry proto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. The max_entries cap prevents unbounded growth.
  • TTL granularity: TTL is checked lazily during lookup and insert, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue.
  • Write-through invalidation for unimplemented write_memory: The write_memory handler is currently Unimplemented. The invalidation hook is documented as a TODO comment. When write_memory is implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability.
  • Concurrent access patterns: The cache uses tokio::sync::RwLock which allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). The RwLock will not be a bottleneck unless the cache is invalidated very frequently.
  • Embedding client required for cache: The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns failed_precondition when no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated.
  • Cache coherence with extraction toggle: If the first query runs with skip_extraction=false (extraction results cached) and a subsequent semantically similar query has skip_extraction=true, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match on skip_extraction flag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative.
  • Linear scan performance: The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.

Deviation Log

Deviation Reason
Merged feature/issue-31-extraction-step into feature branch (fast-forward) Issue #32 depends on #31 (extraction step) which is completed but not yet merged to main. The extraction types, client, and ExtractionConfig are required by the cache integration in service.rs.
SemanticCache::metrics() is async (acquires read lock to get current_size) Plan showed it as a sync method, but reading entries.len() requires the RwLock. Made async for correctness.
Used f64 intermediates in cosine similarity computation Plan specified f32 only. Using f64 for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to f32.