feat: semantic cache (#32): issue-032.md

This commit is contained in:
2026-03-09 23:45:48 +01:00
parent 13cebc225f
commit d6e5902a59

View File

@@ -0,0 +1,524 @@
# Implementation Plan — Issue #32: Implement semantic cache
## Metadata
| Field | Value |
|---|---|
| Issue | [#32](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/32) |
| Title | Implement semantic cache |
| Milestone | Phase 4: Memory Service |
| Labels | |
| Status | `COMPLETED` |
| Language | Rust |
| Related Plans | [issue-028.md](issue-028.md), [issue-029.md](issue-029.md), [issue-030.md](issue-030.md), [issue-031.md](issue-031.md) |
| Blocked by | #31 (completed) |
## Acceptance Criteria
- [ ] Cache keyed by query embedding similarity (not exact match)
- [ ] Configurable similarity threshold for cache hits
- [ ] TTL-based cache expiration
- [ ] Cache invalidation on new memory writes
- [ ] Metrics: cache hit/miss rate tracking
## Architecture Analysis
### Service Context
This issue belongs to the **Memory Service** (Rust). It implements the semantic cache layer described in the architecture document:
> **Cache:** Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.
The cache sits between the `query_memory` gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.
**Affected gRPC endpoints:**
- `QueryMemory` (server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.
- `WriteMemory` — on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.
**Proto messages used:**
- `QueryMemoryRequest` — the query text is embedded and used as the cache key.
- `QueryMemoryResponse` — cached responses set `is_cached = true`. The `cached_extracted_segment` and `extraction_confidence` fields are populated from the cache entry.
- No proto changes required — the existing `is_cached` field on `QueryMemoryResponse` (field 4) already exists for this purpose.
### Existing Patterns
- **Config pattern:** `RetrievalConfig` and `ExtractionConfig` in `services/memory/src/config.rs` use `#[derive(Debug, Clone, Deserialize)]` with `#[serde(default)]` and named default functions. The cache config should follow the same pattern.
- **Builder pattern for service:** `MemoryServiceImpl` uses builder methods like `with_embedding_client()` and `with_extraction_client()` at `services/memory/src/service.rs:54-67`. The cache can be built into the service directly (always present, not optional) since it is purely in-memory.
- **Embedding generation:** The `EmbeddingClient` wrapped in `Arc<Mutex<EmbeddingClient>>` at `services/memory/src/service.rs:26` provides `generate()` to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (at `services/memory/src/service.rs:139-145`).
- **DuckDB connection pattern:** `DuckDbManager::with_connection()` at `services/memory/src/db/mod.rs:84-90` uses `Mutex<Connection>`. The cache is in-memory (not in DuckDB) to avoid database lock contention.
- **Module organization:** Each feature area has its own module directory (`embedding/`, `extraction/`, `retrieval/`). The cache should follow the same pattern as a `cache/` module.
### Dependencies
- **No new external crate dependencies** — the cache is an in-memory `HashMap`-based structure protected by `tokio::sync::RwLock`. Cosine similarity computation can reuse the `array_cosine_similarity` logic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed.
- **Embedding client** — the cache requires query embeddings. The `query_memory` handler already generates the query embedding before the pipeline runs (at `services/memory/src/service.rs:139-145`). The cache lookup and population reuse this embedding.
- **`std::time::Instant`** — for TTL tracking.
- **`std::sync::atomic`** — for lock-free cache metrics counters.
## Implementation Steps
### 1. Types & Configuration
**Add cache configuration to `services/memory/src/config.rs`:**
```rust
/// Configuration for the semantic query cache.
#[derive(Debug, Clone, Deserialize)]
pub struct CacheConfig {
/// Whether the cache is enabled (default: true).
#[serde(default = "default_cache_enabled")]
pub enabled: bool,
/// Cosine similarity threshold for cache hits (default: 0.95).
/// A cached query embedding must have cosine similarity >= this value
/// with the incoming query embedding to be considered a hit.
#[serde(default = "default_cache_similarity_threshold")]
pub similarity_threshold: f32,
/// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
#[serde(default = "default_cache_ttl_secs")]
pub ttl_secs: u64,
/// Maximum number of entries in the cache (default: 1000).
/// When exceeded, the oldest entry is evicted.
#[serde(default = "default_cache_max_entries")]
pub max_entries: usize,
}
```
Add `cache: CacheConfig` field to the `Config` struct with `#[serde(default)]`.
**Define cache types in a new `services/memory/src/cache/mod.rs`:**
```rust
use std::time::Instant;
/// A single entry in the semantic cache.
///
/// Stores the query embedding (as the cache key), the retrieval results,
/// and metadata for TTL tracking and invalidation.
#[derive(Debug, Clone)]
pub struct CacheEntry {
/// The embedding of the cached query (used for similarity matching).
pub query_embedding: Vec<f32>,
/// The original query text (for debugging/logging).
pub query_text: String,
/// The tag filter used in the original query (cache entries are scoped by tag).
pub tag_filter: Option<String>,
/// Cached response data (rank, entry, scores, extraction results).
pub results: Vec<CachedResult>,
/// Memory IDs in the result set (for invalidation on write).
pub result_memory_ids: Vec<String>,
/// When this entry was created (for TTL expiration).
pub created_at: Instant,
}
/// A single cached retrieval result.
#[derive(Debug, Clone)]
pub struct CachedResult {
/// The rank of the result in the original retrieval.
pub rank: u32,
/// The memory entry (proto format).
pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
/// Cosine similarity score from the retrieval pipeline.
pub cosine_similarity: f32,
/// Extracted segment (if extraction was performed).
pub extracted_segment: Option<String>,
/// Extraction confidence (if extraction was performed).
pub extraction_confidence: Option<f32>,
}
/// Cache hit/miss metrics.
#[derive(Debug)]
pub struct CacheMetrics {
/// Total cache hit count.
pub hits: std::sync::atomic::AtomicU64,
/// Total cache miss count.
pub misses: std::sync::atomic::AtomicU64,
/// Total cache evictions (TTL or capacity).
pub evictions: std::sync::atomic::AtomicU64,
/// Total cache invalidations (due to writes).
pub invalidations: std::sync::atomic::AtomicU64,
}
```
### 2. Core Logic
**Create `services/memory/src/cache/mod.rs` — Semantic cache manager:**
```rust
use tokio::sync::RwLock;
use crate::config::CacheConfig;
/// Semantic cache for deduplicating similar queries.
///
/// Keyed by query embedding similarity rather than exact string match.
/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
/// with exclusive write access (cache population, invalidation, eviction).
pub struct SemanticCache {
config: CacheConfig,
entries: RwLock<Vec<CacheEntry>>,
metrics: CacheMetrics,
}
impl SemanticCache {
/// Create a new semantic cache with the given configuration.
pub fn new(config: CacheConfig) -> Self;
/// Look up a cache entry by query embedding similarity.
///
/// Computes cosine similarity between the `query_embedding` and each
/// cached entry's embedding. Returns the first entry whose similarity
/// exceeds `config.similarity_threshold` and whose TTL has not expired.
///
/// Also filters by `tag_filter` — a cached entry is only a hit if its
/// tag filter matches the incoming query's tag filter.
///
/// Updates metrics (hit or miss counter).
///
/// Returns `None` on miss.
pub async fn lookup(
&self,
query_embedding: &[f32],
tag_filter: Option<&str>,
) -> Option<Vec<CachedResult>>;
/// Insert a new cache entry.
///
/// If the cache is at `max_entries` capacity, evicts the oldest entry
/// (by `created_at`). Also removes any expired entries during insertion.
pub async fn insert(&self, entry: CacheEntry);
/// Invalidate all cache entries whose result set includes the given memory ID.
///
/// Called when a memory is written or updated. This ensures stale cached
/// results are not served after the underlying data changes.
///
/// Updates the invalidation counter.
pub async fn invalidate_by_memory_id(&self, memory_id: &str);
/// Invalidate all cache entries (full cache flush).
pub async fn invalidate_all(&self);
/// Remove expired entries (TTL check).
///
/// Called during `lookup` and `insert` to lazily clean up expired entries.
async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);
/// Get a snapshot of cache metrics.
pub fn metrics(&self) -> CacheMetricsSnapshot;
}
```
**Create `services/memory/src/cache/similarity.rs` — Pure-Rust cosine similarity:**
```rust
/// Compute cosine similarity between two vectors.
///
/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
///
/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
/// because the cache operates in-memory, outside of DuckDB.
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;
```
**Key implementation details for `lookup()`:**
1. Acquire read lock on `entries`.
2. Get current time for TTL check.
3. Iterate over entries, skip expired ones.
4. For each non-expired entry, check tag filter match.
5. Compute `cosine_similarity(query_embedding, entry.query_embedding)`.
6. If similarity >= `config.similarity_threshold`, increment hit counter, return cloned results.
7. If no hit found, increment miss counter, return `None`.
**Key implementation details for `insert()`:**
1. Acquire write lock on `entries`.
2. Call `evict_expired()` to remove stale entries.
3. If `entries.len() >= config.max_entries`, remove the oldest entry by `created_at`. Increment eviction counter.
4. Push the new `CacheEntry`.
**Key implementation details for `invalidate_by_memory_id()`:**
1. Acquire write lock on `entries`.
2. Retain only entries whose `result_memory_ids` do not contain the given `memory_id`.
3. For each removed entry, increment invalidation counter.
**Metrics snapshot type:**
```rust
/// A point-in-time snapshot of cache metrics.
#[derive(Debug, Clone)]
pub struct CacheMetricsSnapshot {
pub hits: u64,
pub misses: u64,
pub evictions: u64,
pub invalidations: u64,
pub current_size: usize,
/// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
pub hit_rate: f64,
}
```
### 3. gRPC Handler Wiring
**Update `services/memory/src/service.rs` — Add cache to `MemoryServiceImpl` and integrate into `query_memory`:**
```rust
pub struct MemoryServiceImpl {
db: Arc<DuckDbManager>,
embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
retrieval_config: RetrievalConfig,
extraction_config: ExtractionConfig,
cache: Arc<SemanticCache>,
}
```
Update `MemoryServiceImpl::new()` to accept `CacheConfig` and construct the `SemanticCache`:
```rust
pub fn new(
db: Arc<DuckDbManager>,
retrieval_config: RetrievalConfig,
extraction_config: ExtractionConfig,
cache_config: CacheConfig,
) -> Self {
Self {
db,
embedding_client: None,
extraction_client: None,
retrieval_config,
extraction_config,
cache: Arc::new(SemanticCache::new(cache_config)),
}
}
```
**Updated `query_memory` handler flow:**
1. Validate request (existing code).
2. Generate query embedding (existing code at lines 139-145).
3. **NEW — Cache lookup:** If cache is enabled, call `self.cache.lookup(&query_embedding, tag_filter)`. If hit, stream cached results with `is_cached = true` and return immediately.
4. Run retrieval pipeline (existing code at lines 148-163).
5. Run extraction (existing code at lines 166-180).
6. **NEW — Cache population:** Build a `CacheEntry` from the retrieval + extraction results and insert into the cache.
7. Stream results with `is_cached = false` (existing code).
```rust
// After generating query_embedding and before pipeline:
let tag_filter_ref = if req.memory_type.is_empty() {
None
} else {
Some(req.memory_type.as_str())
};
if self.cache.config().enabled {
if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
tracing::debug!(
session_id = %ctx.session_id,
query = %req.query,
"Cache hit for query"
);
// Stream cached results
let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
tokio::spawn(async move {
for result in cached_results {
let response = QueryMemoryResponse {
rank: result.rank,
entry: Some(result.entry),
cosine_similarity: result.cosine_similarity,
is_cached: true,
cached_extracted_segment: result.extracted_segment,
extraction_confidence: result.extraction_confidence,
};
if tx.send(Ok(response)).await.is_err() {
break;
}
}
});
return Ok(Response::new(ReceiverStream::new(rx)));
}
}
// ... existing pipeline and extraction code ...
// After extraction, before streaming:
// Populate cache with results
if self.cache.config().enabled {
let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
let result_memory_ids: Vec<String> = candidates.iter()
.map(|c| c.memory_id.clone())
.collect();
let cache_entry = CacheEntry {
query_embedding: query_vector.clone(),
query_text: req.query.clone(),
tag_filter: params.tag_filter.clone(),
results: cached_results,
result_memory_ids,
created_at: Instant::now(),
};
self.cache.insert(cache_entry).await;
}
```
**Update `write_memory` handler (future-proofing):**
The `write_memory` handler is currently `Unimplemented`, but the invalidation hook should be documented and placed at the logical location. When `write_memory` is implemented, after successfully writing a memory, it must call:
```rust
self.cache.invalidate_by_memory_id(&memory_id).await;
```
For now, add a comment in the `write_memory` handler noting this requirement:
```rust
// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;
```
### 4. Service Integration
**Update `services/memory/src/main.rs` — Pass cache config:**
```rust
let cache_config = config.cache.clone();
let mut memory_service = MemoryServiceImpl::new(
db,
retrieval_config,
extraction_config.clone(),
cache_config,
);
```
**Metrics exposure:** The cache metrics are accessible via `self.cache.metrics()`. For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:
```rust
// Log cache metrics every 60 seconds
let cache_ref = memory_service.cache().clone();
tokio::spawn(async move {
let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
loop {
interval.tick().await;
let m = cache_ref.metrics();
tracing::info!(
hits = m.hits,
misses = m.misses,
hit_rate = format!("{:.1}%", m.hit_rate),
size = m.current_size,
evictions = m.evictions,
invalidations = m.invalidations,
"Cache metrics"
);
}
});
```
**Error mapping:** The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.
### 5. Tests
**Unit tests in `services/memory/src/cache/similarity.rs`:**
| Test Case | Description |
|---|---|
| `test_cosine_similarity_identical_vectors` | Two identical vectors return 1.0 |
| `test_cosine_similarity_orthogonal_vectors` | Two orthogonal vectors return 0.0 |
| `test_cosine_similarity_opposite_vectors` | Two opposite vectors return -1.0 |
| `test_cosine_similarity_zero_vector` | A zero vector returns 0.0 (no division by zero) |
| `test_cosine_similarity_different_magnitudes` | Vectors with same direction but different magnitudes return 1.0 |
| `test_cosine_similarity_known_value` | Known pair of vectors produces expected similarity |
**Unit tests in `services/memory/src/cache/mod.rs`:**
| Test Case | Description |
|---|---|
| `test_cache_new_creates_empty_cache` | New cache has 0 entries and 0 metrics |
| `test_cache_insert_and_lookup_hit` | Insert an entry, lookup with same embedding returns hit |
| `test_cache_lookup_miss_below_threshold` | Lookup with dissimilar embedding returns miss |
| `test_cache_lookup_miss_empty_cache` | Lookup on empty cache returns None |
| `test_cache_ttl_expiration` | Insert entry, wait past TTL, lookup returns None |
| `test_cache_invalidate_by_memory_id` | Insert entry with memory ID, invalidate, lookup returns None |
| `test_cache_invalidate_by_memory_id_partial` | Two entries, invalidate one memory ID, other entry survives |
| `test_cache_invalidate_all` | Insert entries, invalidate all, lookup returns None |
| `test_cache_max_entries_eviction` | Insert entries beyond max_entries, oldest is evicted |
| `test_cache_metrics_hit_count` | After hits, hit counter is incremented |
| `test_cache_metrics_miss_count` | After misses, miss counter is incremented |
| `test_cache_metrics_hit_rate` | After mix of hits and misses, hit rate is correct |
| `test_cache_metrics_eviction_count` | After eviction, eviction counter is incremented |
| `test_cache_metrics_invalidation_count` | After invalidation, invalidation counter is incremented |
| `test_cache_tag_filter_scoping` | Entry cached with tag "A", lookup with tag "B" misses |
| `test_cache_tag_filter_none_matches_none` | Entry cached without tag, lookup without tag hits |
| `test_cache_disabled_returns_none` | Cache with `enabled=false`, lookup always returns None |
| `test_cache_concurrent_read_write` | Spawn multiple readers and writers, verify no panics/deadlocks |
**Service-level tests in `services/memory/src/service.rs`:**
| Test Case | Description |
|---|---|
| `test_query_memory_cache_hit` | First query populates cache, second identical query returns `is_cached=true` |
| `test_query_memory_cache_miss_different_query` | Two dissimilar queries both return `is_cached=false` |
| `test_query_memory_cache_disabled` | Cache disabled in config, all queries return `is_cached=false` |
**Config tests in `services/memory/src/config.rs`:**
| Test Case | Description |
|---|---|
| `test_cache_config_defaults` | Default config has `enabled=true`, `similarity_threshold=0.95`, `ttl_secs=300`, `max_entries=1000` |
| `test_cache_config_from_toml` | Custom values loaded from TOML |
| `test_cache_config_uses_defaults_when_omitted` | Config without `[cache]` section uses defaults |
**Mocking strategy:**
- Use `DuckDbManager::in_memory()` for all DB operations.
- Use the existing mock Model Gateway server pattern from `services/memory/src/service.rs:469-713` for embedding and extraction clients.
- For cache-specific tests, construct `CacheEntry` directly without needing the full pipeline.
- For TTL tests, use a very short TTL (e.g., 1 second) and `tokio::time::sleep()`.
### Cargo Dependencies
No new crate dependencies required. All functionality is available via:
- `tokio` (async `RwLock`, `mpsc` channels, `time::Instant`)
- `std::sync::atomic` (lock-free metrics counters)
- `std::time::Instant` (TTL tracking)
### Trait Implementations
No new trait implementations required. The `SemanticCache` is a concrete struct used directly by the service layer.
### Error Types
No new error types required. Cache operations are non-fatal:
- Cache lookup miss: falls through to the pipeline.
- Cache insertion failure: logged as warning, response still returned.
- Cache invalidation: best-effort, logged.
## Files to Create/Modify
| File | Action | Purpose |
|---|---|---|
| `services/memory/src/cache/mod.rs` | Create | `SemanticCache`, `CacheEntry`, `CachedResult`, `CacheMetrics`, `CacheMetricsSnapshot` — cache manager with lookup, insert, invalidation, eviction, and metrics |
| `services/memory/src/cache/similarity.rs` | Create | `cosine_similarity()` — pure-Rust cosine similarity for in-memory embedding comparison |
| `services/memory/src/config.rs` | Modify | Add `CacheConfig` struct with `enabled`, `similarity_threshold`, `ttl_secs`, `max_entries`; add `cache` field to `Config` |
| `services/memory/src/lib.rs` | Modify | Add `pub mod cache;` |
| `services/memory/src/service.rs` | Modify | Add `cache: Arc<SemanticCache>` to `MemoryServiceImpl`; update constructor to accept `CacheConfig`; integrate cache lookup before pipeline and cache population after pipeline in `query_memory`; add cache invalidation comment to `write_memory` |
| `services/memory/src/main.rs` | Modify | Pass `CacheConfig` to `MemoryServiceImpl::new()`; add periodic cache metrics logging task |
## Risks and Edge Cases
- **Cache key collision with different tag filters:** Two queries with the same text but different `memory_type` tag filters should not share cache entries. Mitigation: the cache lookup filters by `tag_filter` match in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly.
- **Similarity threshold tuning:** A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
- **Cache size and memory pressure:** Each cache entry stores the query embedding (768 floats = 3KB), the full `MemoryEntry` proto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. The `max_entries` cap prevents unbounded growth.
- **TTL granularity:** TTL is checked lazily during `lookup` and `insert`, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue.
- **Write-through invalidation for unimplemented `write_memory`:** The `write_memory` handler is currently `Unimplemented`. The invalidation hook is documented as a TODO comment. When `write_memory` is implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability.
- **Concurrent access patterns:** The cache uses `tokio::sync::RwLock` which allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). The `RwLock` will not be a bottleneck unless the cache is invalidated very frequently.
- **Embedding client required for cache:** The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns `failed_precondition` when no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated.
- **Cache coherence with extraction toggle:** If the first query runs with `skip_extraction=false` (extraction results cached) and a subsequent semantically similar query has `skip_extraction=true`, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match on `skip_extraction` flag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative.
- **Linear scan performance:** The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.
## Deviation Log
| Deviation | Reason |
|---|---|
| Merged `feature/issue-31-extraction-step` into feature branch (fast-forward) | Issue #32 depends on #31 (extraction step) which is completed but not yet merged to `main`. The extraction types, client, and `ExtractionConfig` are required by the cache integration in `service.rs`. |
| `SemanticCache::metrics()` is `async` (acquires read lock to get `current_size`) | Plan showed it as a sync method, but reading `entries.len()` requires the RwLock. Made async for correctness. |
| Used `f64` intermediates in cosine similarity computation | Plan specified `f32` only. Using `f64` for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to `f32`. |