feat: semantic cache (#32): issue-032.md

2026-03-09 23:45:48 +01:00
parent 13cebc225f
commit d6e5902a59
1 changed files with 524 additions and 0 deletions
--- a/implementation-plans/issue-032.md
+++ b/implementation-plans/issue-032.md
@@ -0,0 +1,524 @@
+# Implementation Plan — Issue #32: Implement semantic cache
+
+## Metadata
+
+| Field | Value |
+|---|---|
+| Issue | [#32](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/32) |
+| Title | Implement semantic cache |
+| Milestone | Phase 4: Memory Service |
+| Labels | |
+| Status | `COMPLETED` |
+| Language | Rust |
+| Related Plans | [issue-028.md](issue-028.md), [issue-029.md](issue-029.md), [issue-030.md](issue-030.md), [issue-031.md](issue-031.md) |
+| Blocked by | #31 (completed) |
+
+## Acceptance Criteria
+
+- [ ] Cache keyed by query embedding similarity (not exact match)
+- [ ] Configurable similarity threshold for cache hits
+- [ ] TTL-based cache expiration
+- [ ] Cache invalidation on new memory writes
+- [ ] Metrics: cache hit/miss rate tracking
+
+## Architecture Analysis
+
+### Service Context
+
+This issue belongs to the **Memory Service** (Rust). It implements the semantic cache layer described in the architecture document:
+
+> **Cache:** Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.
+
+The cache sits between the `query_memory` gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.
+
+**Affected gRPC endpoints:**
+- `QueryMemory` (server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.
+- `WriteMemory` — on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.
+
+**Proto messages used:**
+- `QueryMemoryRequest` — the query text is embedded and used as the cache key.
+- `QueryMemoryResponse` — cached responses set `is_cached = true`. The `cached_extracted_segment` and `extraction_confidence` fields are populated from the cache entry.
+- No proto changes required — the existing `is_cached` field on `QueryMemoryResponse` (field 4) already exists for this purpose.
+
+### Existing Patterns
+
+- **Config pattern:** `RetrievalConfig` and `ExtractionConfig` in `services/memory/src/config.rs` use `#[derive(Debug, Clone, Deserialize)]` with `#[serde(default)]` and named default functions. The cache config should follow the same pattern.
+- **Builder pattern for service:** `MemoryServiceImpl` uses builder methods like `with_embedding_client()` and `with_extraction_client()` at `services/memory/src/service.rs:54-67`. The cache can be built into the service directly (always present, not optional) since it is purely in-memory.
+- **Embedding generation:** The `EmbeddingClient` wrapped in `Arc<Mutex<EmbeddingClient>>` at `services/memory/src/service.rs:26` provides `generate()` to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (at `services/memory/src/service.rs:139-145`).
+- **DuckDB connection pattern:** `DuckDbManager::with_connection()` at `services/memory/src/db/mod.rs:84-90` uses `Mutex<Connection>`. The cache is in-memory (not in DuckDB) to avoid database lock contention.
+- **Module organization:** Each feature area has its own module directory (`embedding/`, `extraction/`, `retrieval/`). The cache should follow the same pattern as a `cache/` module.
+
+### Dependencies
+
+- **No new external crate dependencies** — the cache is an in-memory `HashMap`-based structure protected by `tokio::sync::RwLock`. Cosine similarity computation can reuse the `array_cosine_similarity` logic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed.
+- **Embedding client** — the cache requires query embeddings. The `query_memory` handler already generates the query embedding before the pipeline runs (at `services/memory/src/service.rs:139-145`). The cache lookup and population reuse this embedding.
+- **`std::time::Instant`** — for TTL tracking.
+- **`std::sync::atomic`** — for lock-free cache metrics counters.
+
+## Implementation Steps
+
+### 1. Types & Configuration
+
+**Add cache configuration to `services/memory/src/config.rs`:**
+
+```rust
+/// Configuration for the semantic query cache.
+#[derive(Debug, Clone, Deserialize)]
+pub struct CacheConfig {
+    /// Whether the cache is enabled (default: true).
+    #[serde(default = "default_cache_enabled")]
+    pub enabled: bool,
+
+    /// Cosine similarity threshold for cache hits (default: 0.95).
+    /// A cached query embedding must have cosine similarity >= this value
+    /// with the incoming query embedding to be considered a hit.
+    #[serde(default = "default_cache_similarity_threshold")]
+    pub similarity_threshold: f32,
+
+    /// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
+    #[serde(default = "default_cache_ttl_secs")]
+    pub ttl_secs: u64,
+
+    /// Maximum number of entries in the cache (default: 1000).
+    /// When exceeded, the oldest entry is evicted.
+    #[serde(default = "default_cache_max_entries")]
+    pub max_entries: usize,
+}
+```
+
+Add `cache: CacheConfig` field to the `Config` struct with `#[serde(default)]`.
+
+**Define cache types in a new `services/memory/src/cache/mod.rs`:**
+
+```rust
+use std::time::Instant;
+
+/// A single entry in the semantic cache.
+///
+/// Stores the query embedding (as the cache key), the retrieval results,
+/// and metadata for TTL tracking and invalidation.
+#[derive(Debug, Clone)]
+pub struct CacheEntry {
+    /// The embedding of the cached query (used for similarity matching).
+    pub query_embedding: Vec<f32>,
+    /// The original query text (for debugging/logging).
+    pub query_text: String,
+    /// The tag filter used in the original query (cache entries are scoped by tag).
+    pub tag_filter: Option<String>,
+    /// Cached response data (rank, entry, scores, extraction results).
+    pub results: Vec<CachedResult>,
+    /// Memory IDs in the result set (for invalidation on write).
+    pub result_memory_ids: Vec<String>,
+    /// When this entry was created (for TTL expiration).
+    pub created_at: Instant,
+}
+
+/// A single cached retrieval result.
+#[derive(Debug, Clone)]
+pub struct CachedResult {
+    /// The rank of the result in the original retrieval.
+    pub rank: u32,
+    /// The memory entry (proto format).
+    pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
+    /// Cosine similarity score from the retrieval pipeline.
+    pub cosine_similarity: f32,
+    /// Extracted segment (if extraction was performed).
+    pub extracted_segment: Option<String>,
+    /// Extraction confidence (if extraction was performed).
+    pub extraction_confidence: Option<f32>,
+}
+
+/// Cache hit/miss metrics.
+#[derive(Debug)]
+pub struct CacheMetrics {
+    /// Total cache hit count.
+    pub hits: std::sync::atomic::AtomicU64,
+    /// Total cache miss count.
+    pub misses: std::sync::atomic::AtomicU64,
+    /// Total cache evictions (TTL or capacity).
+    pub evictions: std::sync::atomic::AtomicU64,
+    /// Total cache invalidations (due to writes).
+    pub invalidations: std::sync::atomic::AtomicU64,
+}
+```
+
+### 2. Core Logic
+
+**Create `services/memory/src/cache/mod.rs` — Semantic cache manager:**
+
+```rust
+use tokio::sync::RwLock;
+use crate::config::CacheConfig;
+
+/// Semantic cache for deduplicating similar queries.
+///
+/// Keyed by query embedding similarity rather than exact string match.
+/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
+/// with exclusive write access (cache population, invalidation, eviction).
+pub struct SemanticCache {
+    config: CacheConfig,
+    entries: RwLock<Vec<CacheEntry>>,
+    metrics: CacheMetrics,
+}
+
+impl SemanticCache {
+    /// Create a new semantic cache with the given configuration.
+    pub fn new(config: CacheConfig) -> Self;
+
+    /// Look up a cache entry by query embedding similarity.
+    ///
+    /// Computes cosine similarity between the `query_embedding` and each
+    /// cached entry's embedding. Returns the first entry whose similarity
+    /// exceeds `config.similarity_threshold` and whose TTL has not expired.
+    ///
+    /// Also filters by `tag_filter` — a cached entry is only a hit if its
+    /// tag filter matches the incoming query's tag filter.
+    ///
+    /// Updates metrics (hit or miss counter).
+    ///
+    /// Returns `None` on miss.
+    pub async fn lookup(
+        &self,
+        query_embedding: &[f32],
+        tag_filter: Option<&str>,
+    ) -> Option<Vec<CachedResult>>;
+
+    /// Insert a new cache entry.
+    ///
+    /// If the cache is at `max_entries` capacity, evicts the oldest entry
+    /// (by `created_at`). Also removes any expired entries during insertion.
+    pub async fn insert(&self, entry: CacheEntry);
+
+    /// Invalidate all cache entries whose result set includes the given memory ID.
+    ///
+    /// Called when a memory is written or updated. This ensures stale cached
+    /// results are not served after the underlying data changes.
+    ///
+    /// Updates the invalidation counter.
+    pub async fn invalidate_by_memory_id(&self, memory_id: &str);
+
+    /// Invalidate all cache entries (full cache flush).
+    pub async fn invalidate_all(&self);
+
+    /// Remove expired entries (TTL check).
+    ///
+    /// Called during `lookup` and `insert` to lazily clean up expired entries.
+    async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);
+
+    /// Get a snapshot of cache metrics.
+    pub fn metrics(&self) -> CacheMetricsSnapshot;
+}
+```
+
+**Create `services/memory/src/cache/similarity.rs` — Pure-Rust cosine similarity:**
+
+```rust
+/// Compute cosine similarity between two vectors.
+///
+/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
+///
+/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
+/// because the cache operates in-memory, outside of DuckDB.
+pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;
+```
+
+**Key implementation details for `lookup()`:**
+
+1. Acquire read lock on `entries`.
+2. Get current time for TTL check.
+3. Iterate over entries, skip expired ones.
+4. For each non-expired entry, check tag filter match.
+5. Compute `cosine_similarity(query_embedding, entry.query_embedding)`.
+6. If similarity >= `config.similarity_threshold`, increment hit counter, return cloned results.
+7. If no hit found, increment miss counter, return `None`.
+
+**Key implementation details for `insert()`:**
+
+1. Acquire write lock on `entries`.
+2. Call `evict_expired()` to remove stale entries.
+3. If `entries.len() >= config.max_entries`, remove the oldest entry by `created_at`. Increment eviction counter.
+4. Push the new `CacheEntry`.
+
+**Key implementation details for `invalidate_by_memory_id()`:**
+
+1. Acquire write lock on `entries`.
+2. Retain only entries whose `result_memory_ids` do not contain the given `memory_id`.
+3. For each removed entry, increment invalidation counter.
+
+**Metrics snapshot type:**
+
+```rust
+/// A point-in-time snapshot of cache metrics.
+#[derive(Debug, Clone)]
+pub struct CacheMetricsSnapshot {
+    pub hits: u64,
+    pub misses: u64,
+    pub evictions: u64,
+    pub invalidations: u64,
+    pub current_size: usize,
+    /// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
+    pub hit_rate: f64,
+}
+```
+
+### 3. gRPC Handler Wiring
+
+**Update `services/memory/src/service.rs` — Add cache to `MemoryServiceImpl` and integrate into `query_memory`:**
+
+```rust
+pub struct MemoryServiceImpl {
+    db: Arc<DuckDbManager>,
+    embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
+    extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
+    retrieval_config: RetrievalConfig,
+    extraction_config: ExtractionConfig,
+    cache: Arc<SemanticCache>,
+}
+```
+
+Update `MemoryServiceImpl::new()` to accept `CacheConfig` and construct the `SemanticCache`:
+
+```rust
+pub fn new(
+    db: Arc<DuckDbManager>,
+    retrieval_config: RetrievalConfig,
+    extraction_config: ExtractionConfig,
+    cache_config: CacheConfig,
+) -> Self {
+    Self {
+        db,
+        embedding_client: None,
+        extraction_client: None,
+        retrieval_config,
+        extraction_config,
+        cache: Arc::new(SemanticCache::new(cache_config)),
+    }
+}
+```
+
+**Updated `query_memory` handler flow:**
+
+1. Validate request (existing code).
+2. Generate query embedding (existing code at lines 139-145).
+3. **NEW — Cache lookup:** If cache is enabled, call `self.cache.lookup(&query_embedding, tag_filter)`. If hit, stream cached results with `is_cached = true` and return immediately.
+4. Run retrieval pipeline (existing code at lines 148-163).
+5. Run extraction (existing code at lines 166-180).
+6. **NEW — Cache population:** Build a `CacheEntry` from the retrieval + extraction results and insert into the cache.
+7. Stream results with `is_cached = false` (existing code).
+
+```rust
+// After generating query_embedding and before pipeline:
+let tag_filter_ref = if req.memory_type.is_empty() {
+    None
+} else {
+    Some(req.memory_type.as_str())
+};
+
+if self.cache.config().enabled {
+    if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
+        tracing::debug!(
+            session_id = %ctx.session_id,
+            query = %req.query,
+            "Cache hit for query"
+        );
+        // Stream cached results
+        let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
+        tokio::spawn(async move {
+            for result in cached_results {
+                let response = QueryMemoryResponse {
+                    rank: result.rank,
+                    entry: Some(result.entry),
+                    cosine_similarity: result.cosine_similarity,
+                    is_cached: true,
+                    cached_extracted_segment: result.extracted_segment,
+                    extraction_confidence: result.extraction_confidence,
+                };
+                if tx.send(Ok(response)).await.is_err() {
+                    break;
+                }
+            }
+        });
+        return Ok(Response::new(ReceiverStream::new(rx)));
+    }
+}
+
+// ... existing pipeline and extraction code ...
+
+// After extraction, before streaming:
+// Populate cache with results
+if self.cache.config().enabled {
+    let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
+    let result_memory_ids: Vec<String> = candidates.iter()
+        .map(|c| c.memory_id.clone())
+        .collect();
+    let cache_entry = CacheEntry {
+        query_embedding: query_vector.clone(),
+        query_text: req.query.clone(),
+        tag_filter: params.tag_filter.clone(),
+        results: cached_results,
+        result_memory_ids,
+        created_at: Instant::now(),
+    };
+    self.cache.insert(cache_entry).await;
+}
+```
+
+**Update `write_memory` handler (future-proofing):**
+
+The `write_memory` handler is currently `Unimplemented`, but the invalidation hook should be documented and placed at the logical location. When `write_memory` is implemented, after successfully writing a memory, it must call:
+
+```rust
+self.cache.invalidate_by_memory_id(&memory_id).await;
+```
+
+For now, add a comment in the `write_memory` handler noting this requirement:
+
+```rust
+// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;
+```
+
+### 4. Service Integration
+
+**Update `services/memory/src/main.rs` — Pass cache config:**
+
+```rust
+let cache_config = config.cache.clone();
+let mut memory_service = MemoryServiceImpl::new(
+    db,
+    retrieval_config,
+    extraction_config.clone(),
+    cache_config,
+);
+```
+
+**Metrics exposure:** The cache metrics are accessible via `self.cache.metrics()`. For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:
+
+```rust
+// Log cache metrics every 60 seconds
+let cache_ref = memory_service.cache().clone();
+tokio::spawn(async move {
+    let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
+    loop {
+        interval.tick().await;
+        let m = cache_ref.metrics();
+        tracing::info!(
+            hits = m.hits,
+            misses = m.misses,
+            hit_rate = format!("{:.1}%", m.hit_rate),
+            size = m.current_size,
+            evictions = m.evictions,
+            invalidations = m.invalidations,
+            "Cache metrics"
+        );
+    }
+});
+```
+
+**Error mapping:** The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.
+
+### 5. Tests
+
+**Unit tests in `services/memory/src/cache/similarity.rs`:**
+
+| Test Case | Description |
+|---|---|
+| `test_cosine_similarity_identical_vectors` | Two identical vectors return 1.0 |
+| `test_cosine_similarity_orthogonal_vectors` | Two orthogonal vectors return 0.0 |
+| `test_cosine_similarity_opposite_vectors` | Two opposite vectors return -1.0 |
+| `test_cosine_similarity_zero_vector` | A zero vector returns 0.0 (no division by zero) |
+| `test_cosine_similarity_different_magnitudes` | Vectors with same direction but different magnitudes return 1.0 |
+| `test_cosine_similarity_known_value` | Known pair of vectors produces expected similarity |
+
+**Unit tests in `services/memory/src/cache/mod.rs`:**
+
+| Test Case | Description |
+|---|---|
+| `test_cache_new_creates_empty_cache` | New cache has 0 entries and 0 metrics |
+| `test_cache_insert_and_lookup_hit` | Insert an entry, lookup with same embedding returns hit |
+| `test_cache_lookup_miss_below_threshold` | Lookup with dissimilar embedding returns miss |
+| `test_cache_lookup_miss_empty_cache` | Lookup on empty cache returns None |
+| `test_cache_ttl_expiration` | Insert entry, wait past TTL, lookup returns None |
+| `test_cache_invalidate_by_memory_id` | Insert entry with memory ID, invalidate, lookup returns None |
+| `test_cache_invalidate_by_memory_id_partial` | Two entries, invalidate one memory ID, other entry survives |
+| `test_cache_invalidate_all` | Insert entries, invalidate all, lookup returns None |
+| `test_cache_max_entries_eviction` | Insert entries beyond max_entries, oldest is evicted |
+| `test_cache_metrics_hit_count` | After hits, hit counter is incremented |
+| `test_cache_metrics_miss_count` | After misses, miss counter is incremented |
+| `test_cache_metrics_hit_rate` | After mix of hits and misses, hit rate is correct |
+| `test_cache_metrics_eviction_count` | After eviction, eviction counter is incremented |
+| `test_cache_metrics_invalidation_count` | After invalidation, invalidation counter is incremented |
+| `test_cache_tag_filter_scoping` | Entry cached with tag "A", lookup with tag "B" misses |
+| `test_cache_tag_filter_none_matches_none` | Entry cached without tag, lookup without tag hits |
+| `test_cache_disabled_returns_none` | Cache with `enabled=false`, lookup always returns None |
+| `test_cache_concurrent_read_write` | Spawn multiple readers and writers, verify no panics/deadlocks |
+
+**Service-level tests in `services/memory/src/service.rs`:**
+
+| Test Case | Description |
+|---|---|
+| `test_query_memory_cache_hit` | First query populates cache, second identical query returns `is_cached=true` |
+| `test_query_memory_cache_miss_different_query` | Two dissimilar queries both return `is_cached=false` |
+| `test_query_memory_cache_disabled` | Cache disabled in config, all queries return `is_cached=false` |
+
+**Config tests in `services/memory/src/config.rs`:**
+
+| Test Case | Description |
+|---|---|
+| `test_cache_config_defaults` | Default config has `enabled=true`, `similarity_threshold=0.95`, `ttl_secs=300`, `max_entries=1000` |
+| `test_cache_config_from_toml` | Custom values loaded from TOML |
+| `test_cache_config_uses_defaults_when_omitted` | Config without `[cache]` section uses defaults |
+
+**Mocking strategy:**
+- Use `DuckDbManager::in_memory()` for all DB operations.
+- Use the existing mock Model Gateway server pattern from `services/memory/src/service.rs:469-713` for embedding and extraction clients.
+- For cache-specific tests, construct `CacheEntry` directly without needing the full pipeline.
+- For TTL tests, use a very short TTL (e.g., 1 second) and `tokio::time::sleep()`.
+
+### Cargo Dependencies
+
+No new crate dependencies required. All functionality is available via:
+- `tokio` (async `RwLock`, `mpsc` channels, `time::Instant`)
+- `std::sync::atomic` (lock-free metrics counters)
+- `std::time::Instant` (TTL tracking)
+
+### Trait Implementations
+
+No new trait implementations required. The `SemanticCache` is a concrete struct used directly by the service layer.
+
+### Error Types
+
+No new error types required. Cache operations are non-fatal:
+- Cache lookup miss: falls through to the pipeline.
+- Cache insertion failure: logged as warning, response still returned.
+- Cache invalidation: best-effort, logged.
+
+## Files to Create/Modify
+
+| File | Action | Purpose |
+|---|---|---|
+| `services/memory/src/cache/mod.rs` | Create | `SemanticCache`, `CacheEntry`, `CachedResult`, `CacheMetrics`, `CacheMetricsSnapshot` — cache manager with lookup, insert, invalidation, eviction, and metrics |
+| `services/memory/src/cache/similarity.rs` | Create | `cosine_similarity()` — pure-Rust cosine similarity for in-memory embedding comparison |
+| `services/memory/src/config.rs` | Modify | Add `CacheConfig` struct with `enabled`, `similarity_threshold`, `ttl_secs`, `max_entries`; add `cache` field to `Config` |
+| `services/memory/src/lib.rs` | Modify | Add `pub mod cache;` |
+| `services/memory/src/service.rs` | Modify | Add `cache: Arc<SemanticCache>` to `MemoryServiceImpl`; update constructor to accept `CacheConfig`; integrate cache lookup before pipeline and cache population after pipeline in `query_memory`; add cache invalidation comment to `write_memory` |
+| `services/memory/src/main.rs` | Modify | Pass `CacheConfig` to `MemoryServiceImpl::new()`; add periodic cache metrics logging task |
+
+## Risks and Edge Cases
+
+- **Cache key collision with different tag filters:** Two queries with the same text but different `memory_type` tag filters should not share cache entries. Mitigation: the cache lookup filters by `tag_filter` match in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly.
+- **Similarity threshold tuning:** A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
+- **Cache size and memory pressure:** Each cache entry stores the query embedding (768 floats = 3KB), the full `MemoryEntry` proto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. The `max_entries` cap prevents unbounded growth.
+- **TTL granularity:** TTL is checked lazily during `lookup` and `insert`, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue.
+- **Write-through invalidation for unimplemented `write_memory`:** The `write_memory` handler is currently `Unimplemented`. The invalidation hook is documented as a TODO comment. When `write_memory` is implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability.
+- **Concurrent access patterns:** The cache uses `tokio::sync::RwLock` which allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). The `RwLock` will not be a bottleneck unless the cache is invalidated very frequently.
+- **Embedding client required for cache:** The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns `failed_precondition` when no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated.
+- **Cache coherence with extraction toggle:** If the first query runs with `skip_extraction=false` (extraction results cached) and a subsequent semantically similar query has `skip_extraction=true`, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match on `skip_extraction` flag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative.
+- **Linear scan performance:** The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.
+
+## Deviation Log
+
+| Deviation | Reason |
+|---|---|
+| Merged `feature/issue-31-extraction-step` into feature branch (fast-forward) | Issue #32 depends on #31 (extraction step) which is completed but not yet merged to `main`. The extraction types, client, and `ExtractionConfig` are required by the cache integration in `service.rs`. |
+| `SemanticCache::metrics()` is `async` (acquires read lock to get `current_size`) | Plan showed it as a sync method, but reading `entries.len()` requires the RwLock. Made async for correctness. |
+| Used `f64` intermediates in cosine similarity computation | Plan specified `f32` only. Using `f64` for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to `f32`. |