26 KiB
Implementation Plan — Issue #32: Implement semantic cache
Metadata
| Field | Value |
|---|---|
| Issue | #32 |
| Title | Implement semantic cache |
| Milestone | Phase 4: Memory Service |
| Labels | |
| Status | DONE |
| Language | Rust |
| Related Plans | issue-028.md, issue-029.md, issue-030.md, issue-031.md |
| Blocked by | #31 (completed) |
Acceptance Criteria
- Cache keyed by query embedding similarity (not exact match)
- Configurable similarity threshold for cache hits
- TTL-based cache expiration
- Cache invalidation on new memory writes
- Metrics: cache hit/miss rate tracking
Architecture Analysis
Service Context
This issue belongs to the Memory Service (Rust). It implements the semantic cache layer described in the architecture document:
Cache: Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.
The cache sits between the query_memory gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.
Affected gRPC endpoints:
QueryMemory(server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.WriteMemory— on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.
Proto messages used:
QueryMemoryRequest— the query text is embedded and used as the cache key.QueryMemoryResponse— cached responses setis_cached = true. Thecached_extracted_segmentandextraction_confidencefields are populated from the cache entry.- No proto changes required — the existing
is_cachedfield onQueryMemoryResponse(field 4) already exists for this purpose.
Existing Patterns
- Config pattern:
RetrievalConfigandExtractionConfiginservices/memory/src/config.rsuse#[derive(Debug, Clone, Deserialize)]with#[serde(default)]and named default functions. The cache config should follow the same pattern. - Builder pattern for service:
MemoryServiceImpluses builder methods likewith_embedding_client()andwith_extraction_client()atservices/memory/src/service.rs:54-67. The cache can be built into the service directly (always present, not optional) since it is purely in-memory. - Embedding generation: The
EmbeddingClientwrapped inArc<Mutex<EmbeddingClient>>atservices/memory/src/service.rs:26providesgenerate()to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (atservices/memory/src/service.rs:139-145). - DuckDB connection pattern:
DuckDbManager::with_connection()atservices/memory/src/db/mod.rs:84-90usesMutex<Connection>. The cache is in-memory (not in DuckDB) to avoid database lock contention. - Module organization: Each feature area has its own module directory (
embedding/,extraction/,retrieval/). The cache should follow the same pattern as acache/module.
Dependencies
- No new external crate dependencies — the cache is an in-memory
HashMap-based structure protected bytokio::sync::RwLock. Cosine similarity computation can reuse thearray_cosine_similaritylogic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed. - Embedding client — the cache requires query embeddings. The
query_memoryhandler already generates the query embedding before the pipeline runs (atservices/memory/src/service.rs:139-145). The cache lookup and population reuse this embedding. std::time::Instant— for TTL tracking.std::sync::atomic— for lock-free cache metrics counters.
Implementation Steps
1. Types & Configuration
Add cache configuration to services/memory/src/config.rs:
/// Configuration for the semantic query cache.
#[derive(Debug, Clone, Deserialize)]
pub struct CacheConfig {
/// Whether the cache is enabled (default: true).
#[serde(default = "default_cache_enabled")]
pub enabled: bool,
/// Cosine similarity threshold for cache hits (default: 0.95).
/// A cached query embedding must have cosine similarity >= this value
/// with the incoming query embedding to be considered a hit.
#[serde(default = "default_cache_similarity_threshold")]
pub similarity_threshold: f32,
/// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
#[serde(default = "default_cache_ttl_secs")]
pub ttl_secs: u64,
/// Maximum number of entries in the cache (default: 1000).
/// When exceeded, the oldest entry is evicted.
#[serde(default = "default_cache_max_entries")]
pub max_entries: usize,
}
Add cache: CacheConfig field to the Config struct with #[serde(default)].
Define cache types in a new services/memory/src/cache/mod.rs:
use std::time::Instant;
/// A single entry in the semantic cache.
///
/// Stores the query embedding (as the cache key), the retrieval results,
/// and metadata for TTL tracking and invalidation.
#[derive(Debug, Clone)]
pub struct CacheEntry {
/// The embedding of the cached query (used for similarity matching).
pub query_embedding: Vec<f32>,
/// The original query text (for debugging/logging).
pub query_text: String,
/// The tag filter used in the original query (cache entries are scoped by tag).
pub tag_filter: Option<String>,
/// Cached response data (rank, entry, scores, extraction results).
pub results: Vec<CachedResult>,
/// Memory IDs in the result set (for invalidation on write).
pub result_memory_ids: Vec<String>,
/// When this entry was created (for TTL expiration).
pub created_at: Instant,
}
/// A single cached retrieval result.
#[derive(Debug, Clone)]
pub struct CachedResult {
/// The rank of the result in the original retrieval.
pub rank: u32,
/// The memory entry (proto format).
pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
/// Cosine similarity score from the retrieval pipeline.
pub cosine_similarity: f32,
/// Extracted segment (if extraction was performed).
pub extracted_segment: Option<String>,
/// Extraction confidence (if extraction was performed).
pub extraction_confidence: Option<f32>,
}
/// Cache hit/miss metrics.
#[derive(Debug)]
pub struct CacheMetrics {
/// Total cache hit count.
pub hits: std::sync::atomic::AtomicU64,
/// Total cache miss count.
pub misses: std::sync::atomic::AtomicU64,
/// Total cache evictions (TTL or capacity).
pub evictions: std::sync::atomic::AtomicU64,
/// Total cache invalidations (due to writes).
pub invalidations: std::sync::atomic::AtomicU64,
}
2. Core Logic
Create services/memory/src/cache/mod.rs — Semantic cache manager:
use tokio::sync::RwLock;
use crate::config::CacheConfig;
/// Semantic cache for deduplicating similar queries.
///
/// Keyed by query embedding similarity rather than exact string match.
/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
/// with exclusive write access (cache population, invalidation, eviction).
pub struct SemanticCache {
config: CacheConfig,
entries: RwLock<Vec<CacheEntry>>,
metrics: CacheMetrics,
}
impl SemanticCache {
/// Create a new semantic cache with the given configuration.
pub fn new(config: CacheConfig) -> Self;
/// Look up a cache entry by query embedding similarity.
///
/// Computes cosine similarity between the `query_embedding` and each
/// cached entry's embedding. Returns the first entry whose similarity
/// exceeds `config.similarity_threshold` and whose TTL has not expired.
///
/// Also filters by `tag_filter` — a cached entry is only a hit if its
/// tag filter matches the incoming query's tag filter.
///
/// Updates metrics (hit or miss counter).
///
/// Returns `None` on miss.
pub async fn lookup(
&self,
query_embedding: &[f32],
tag_filter: Option<&str>,
) -> Option<Vec<CachedResult>>;
/// Insert a new cache entry.
///
/// If the cache is at `max_entries` capacity, evicts the oldest entry
/// (by `created_at`). Also removes any expired entries during insertion.
pub async fn insert(&self, entry: CacheEntry);
/// Invalidate all cache entries whose result set includes the given memory ID.
///
/// Called when a memory is written or updated. This ensures stale cached
/// results are not served after the underlying data changes.
///
/// Updates the invalidation counter.
pub async fn invalidate_by_memory_id(&self, memory_id: &str);
/// Invalidate all cache entries (full cache flush).
pub async fn invalidate_all(&self);
/// Remove expired entries (TTL check).
///
/// Called during `lookup` and `insert` to lazily clean up expired entries.
async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);
/// Get a snapshot of cache metrics.
pub fn metrics(&self) -> CacheMetricsSnapshot;
}
Create services/memory/src/cache/similarity.rs — Pure-Rust cosine similarity:
/// Compute cosine similarity between two vectors.
///
/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
///
/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
/// because the cache operates in-memory, outside of DuckDB.
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;
Key implementation details for lookup():
- Acquire read lock on
entries. - Get current time for TTL check.
- Iterate over entries, skip expired ones.
- For each non-expired entry, check tag filter match.
- Compute
cosine_similarity(query_embedding, entry.query_embedding). - If similarity >=
config.similarity_threshold, increment hit counter, return cloned results. - If no hit found, increment miss counter, return
None.
Key implementation details for insert():
- Acquire write lock on
entries. - Call
evict_expired()to remove stale entries. - If
entries.len() >= config.max_entries, remove the oldest entry bycreated_at. Increment eviction counter. - Push the new
CacheEntry.
Key implementation details for invalidate_by_memory_id():
- Acquire write lock on
entries. - Retain only entries whose
result_memory_idsdo not contain the givenmemory_id. - For each removed entry, increment invalidation counter.
Metrics snapshot type:
/// A point-in-time snapshot of cache metrics.
#[derive(Debug, Clone)]
pub struct CacheMetricsSnapshot {
pub hits: u64,
pub misses: u64,
pub evictions: u64,
pub invalidations: u64,
pub current_size: usize,
/// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
pub hit_rate: f64,
}
3. gRPC Handler Wiring
Update services/memory/src/service.rs — Add cache to MemoryServiceImpl and integrate into query_memory:
pub struct MemoryServiceImpl {
db: Arc<DuckDbManager>,
embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
retrieval_config: RetrievalConfig,
extraction_config: ExtractionConfig,
cache: Arc<SemanticCache>,
}
Update MemoryServiceImpl::new() to accept CacheConfig and construct the SemanticCache:
pub fn new(
db: Arc<DuckDbManager>,
retrieval_config: RetrievalConfig,
extraction_config: ExtractionConfig,
cache_config: CacheConfig,
) -> Self {
Self {
db,
embedding_client: None,
extraction_client: None,
retrieval_config,
extraction_config,
cache: Arc::new(SemanticCache::new(cache_config)),
}
}
Updated query_memory handler flow:
- Validate request (existing code).
- Generate query embedding (existing code at lines 139-145).
- NEW — Cache lookup: If cache is enabled, call
self.cache.lookup(&query_embedding, tag_filter). If hit, stream cached results withis_cached = trueand return immediately. - Run retrieval pipeline (existing code at lines 148-163).
- Run extraction (existing code at lines 166-180).
- NEW — Cache population: Build a
CacheEntryfrom the retrieval + extraction results and insert into the cache. - Stream results with
is_cached = false(existing code).
// After generating query_embedding and before pipeline:
let tag_filter_ref = if req.memory_type.is_empty() {
None
} else {
Some(req.memory_type.as_str())
};
if self.cache.config().enabled {
if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
tracing::debug!(
session_id = %ctx.session_id,
query = %req.query,
"Cache hit for query"
);
// Stream cached results
let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
tokio::spawn(async move {
for result in cached_results {
let response = QueryMemoryResponse {
rank: result.rank,
entry: Some(result.entry),
cosine_similarity: result.cosine_similarity,
is_cached: true,
cached_extracted_segment: result.extracted_segment,
extraction_confidence: result.extraction_confidence,
};
if tx.send(Ok(response)).await.is_err() {
break;
}
}
});
return Ok(Response::new(ReceiverStream::new(rx)));
}
}
// ... existing pipeline and extraction code ...
// After extraction, before streaming:
// Populate cache with results
if self.cache.config().enabled {
let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
let result_memory_ids: Vec<String> = candidates.iter()
.map(|c| c.memory_id.clone())
.collect();
let cache_entry = CacheEntry {
query_embedding: query_vector.clone(),
query_text: req.query.clone(),
tag_filter: params.tag_filter.clone(),
results: cached_results,
result_memory_ids,
created_at: Instant::now(),
};
self.cache.insert(cache_entry).await;
}
Update write_memory handler (future-proofing):
The write_memory handler is currently Unimplemented, but the invalidation hook should be documented and placed at the logical location. When write_memory is implemented, after successfully writing a memory, it must call:
self.cache.invalidate_by_memory_id(&memory_id).await;
For now, add a comment in the write_memory handler noting this requirement:
// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;
4. Service Integration
Update services/memory/src/main.rs — Pass cache config:
let cache_config = config.cache.clone();
let mut memory_service = MemoryServiceImpl::new(
db,
retrieval_config,
extraction_config.clone(),
cache_config,
);
Metrics exposure: The cache metrics are accessible via self.cache.metrics(). For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:
// Log cache metrics every 60 seconds
let cache_ref = memory_service.cache().clone();
tokio::spawn(async move {
let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
loop {
interval.tick().await;
let m = cache_ref.metrics();
tracing::info!(
hits = m.hits,
misses = m.misses,
hit_rate = format!("{:.1}%", m.hit_rate),
size = m.current_size,
evictions = m.evictions,
invalidations = m.invalidations,
"Cache metrics"
);
}
});
Error mapping: The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.
5. Tests
Unit tests in services/memory/src/cache/similarity.rs:
| Test Case | Description |
|---|---|
test_cosine_similarity_identical_vectors |
Two identical vectors return 1.0 |
test_cosine_similarity_orthogonal_vectors |
Two orthogonal vectors return 0.0 |
test_cosine_similarity_opposite_vectors |
Two opposite vectors return -1.0 |
test_cosine_similarity_zero_vector |
A zero vector returns 0.0 (no division by zero) |
test_cosine_similarity_different_magnitudes |
Vectors with same direction but different magnitudes return 1.0 |
test_cosine_similarity_known_value |
Known pair of vectors produces expected similarity |
Unit tests in services/memory/src/cache/mod.rs:
| Test Case | Description |
|---|---|
test_cache_new_creates_empty_cache |
New cache has 0 entries and 0 metrics |
test_cache_insert_and_lookup_hit |
Insert an entry, lookup with same embedding returns hit |
test_cache_lookup_miss_below_threshold |
Lookup with dissimilar embedding returns miss |
test_cache_lookup_miss_empty_cache |
Lookup on empty cache returns None |
test_cache_ttl_expiration |
Insert entry, wait past TTL, lookup returns None |
test_cache_invalidate_by_memory_id |
Insert entry with memory ID, invalidate, lookup returns None |
test_cache_invalidate_by_memory_id_partial |
Two entries, invalidate one memory ID, other entry survives |
test_cache_invalidate_all |
Insert entries, invalidate all, lookup returns None |
test_cache_max_entries_eviction |
Insert entries beyond max_entries, oldest is evicted |
test_cache_metrics_hit_count |
After hits, hit counter is incremented |
test_cache_metrics_miss_count |
After misses, miss counter is incremented |
test_cache_metrics_hit_rate |
After mix of hits and misses, hit rate is correct |
test_cache_metrics_eviction_count |
After eviction, eviction counter is incremented |
test_cache_metrics_invalidation_count |
After invalidation, invalidation counter is incremented |
test_cache_tag_filter_scoping |
Entry cached with tag "A", lookup with tag "B" misses |
test_cache_tag_filter_none_matches_none |
Entry cached without tag, lookup without tag hits |
test_cache_disabled_returns_none |
Cache with enabled=false, lookup always returns None |
test_cache_concurrent_read_write |
Spawn multiple readers and writers, verify no panics/deadlocks |
Service-level tests in services/memory/src/service.rs:
| Test Case | Description |
|---|---|
test_query_memory_cache_hit |
First query populates cache, second identical query returns is_cached=true |
test_query_memory_cache_miss_different_query |
Two dissimilar queries both return is_cached=false |
test_query_memory_cache_disabled |
Cache disabled in config, all queries return is_cached=false |
Config tests in services/memory/src/config.rs:
| Test Case | Description |
|---|---|
test_cache_config_defaults |
Default config has enabled=true, similarity_threshold=0.95, ttl_secs=300, max_entries=1000 |
test_cache_config_from_toml |
Custom values loaded from TOML |
test_cache_config_uses_defaults_when_omitted |
Config without [cache] section uses defaults |
Mocking strategy:
- Use
DuckDbManager::in_memory()for all DB operations. - Use the existing mock Model Gateway server pattern from
services/memory/src/service.rs:469-713for embedding and extraction clients. - For cache-specific tests, construct
CacheEntrydirectly without needing the full pipeline. - For TTL tests, use a very short TTL (e.g., 1 second) and
tokio::time::sleep().
Cargo Dependencies
No new crate dependencies required. All functionality is available via:
tokio(asyncRwLock,mpscchannels,time::Instant)std::sync::atomic(lock-free metrics counters)std::time::Instant(TTL tracking)
Trait Implementations
No new trait implementations required. The SemanticCache is a concrete struct used directly by the service layer.
Error Types
No new error types required. Cache operations are non-fatal:
- Cache lookup miss: falls through to the pipeline.
- Cache insertion failure: logged as warning, response still returned.
- Cache invalidation: best-effort, logged.
Files to Create/Modify
| File | Action | Purpose |
|---|---|---|
services/memory/src/cache/mod.rs |
Create | SemanticCache, CacheEntry, CachedResult, CacheMetrics, CacheMetricsSnapshot — cache manager with lookup, insert, invalidation, eviction, and metrics |
services/memory/src/cache/similarity.rs |
Create | cosine_similarity() — pure-Rust cosine similarity for in-memory embedding comparison |
services/memory/src/config.rs |
Modify | Add CacheConfig struct with enabled, similarity_threshold, ttl_secs, max_entries; add cache field to Config |
services/memory/src/lib.rs |
Modify | Add pub mod cache; |
services/memory/src/service.rs |
Modify | Add cache: Arc<SemanticCache> to MemoryServiceImpl; update constructor to accept CacheConfig; integrate cache lookup before pipeline and cache population after pipeline in query_memory; add cache invalidation comment to write_memory |
services/memory/src/main.rs |
Modify | Pass CacheConfig to MemoryServiceImpl::new(); add periodic cache metrics logging task |
Risks and Edge Cases
- Cache key collision with different tag filters: Two queries with the same text but different
memory_typetag filters should not share cache entries. Mitigation: the cache lookup filters bytag_filtermatch in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly. - Similarity threshold tuning: A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
- Cache size and memory pressure: Each cache entry stores the query embedding (768 floats = 3KB), the full
MemoryEntryproto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. Themax_entriescap prevents unbounded growth. - TTL granularity: TTL is checked lazily during
lookupandinsert, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue. - Write-through invalidation for unimplemented
write_memory: Thewrite_memoryhandler is currentlyUnimplemented. The invalidation hook is documented as a TODO comment. Whenwrite_memoryis implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability. - Concurrent access patterns: The cache uses
tokio::sync::RwLockwhich allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). TheRwLockwill not be a bottleneck unless the cache is invalidated very frequently. - Embedding client required for cache: The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns
failed_preconditionwhen no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated. - Cache coherence with extraction toggle: If the first query runs with
skip_extraction=false(extraction results cached) and a subsequent semantically similar query hasskip_extraction=true, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match onskip_extractionflag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative. - Linear scan performance: The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.
Deviation Log
| Deviation | Reason |
|---|---|
Merged feature/issue-31-extraction-step into feature branch (fast-forward) |
Issue #32 depends on #31 (extraction step) which is completed but not yet merged to main. The extraction types, client, and ExtractionConfig are required by the cache integration in service.rs. |
SemanticCache::metrics() is async (acquires read lock to get current_size) |
Plan showed it as a sync method, but reading entries.len() requires the RwLock. Made async for correctness. |
Used f64 intermediates in cosine similarity computation |
Plan specified f32 only. Using f64 for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to f32. |