feat: semantic cache (#32): issue-032.md
This commit is contained in:
524
implementation-plans/issue-032.md
Normal file
524
implementation-plans/issue-032.md
Normal file
@@ -0,0 +1,524 @@
|
||||
# Implementation Plan — Issue #32: Implement semantic cache
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Issue | [#32](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/32) |
|
||||
| Title | Implement semantic cache |
|
||||
| Milestone | Phase 4: Memory Service |
|
||||
| Labels | |
|
||||
| Status | `COMPLETED` |
|
||||
| Language | Rust |
|
||||
| Related Plans | [issue-028.md](issue-028.md), [issue-029.md](issue-029.md), [issue-030.md](issue-030.md), [issue-031.md](issue-031.md) |
|
||||
| Blocked by | #31 (completed) |
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] Cache keyed by query embedding similarity (not exact match)
|
||||
- [ ] Configurable similarity threshold for cache hits
|
||||
- [ ] TTL-based cache expiration
|
||||
- [ ] Cache invalidation on new memory writes
|
||||
- [ ] Metrics: cache hit/miss rate tracking
|
||||
|
||||
## Architecture Analysis
|
||||
|
||||
### Service Context
|
||||
|
||||
This issue belongs to the **Memory Service** (Rust). It implements the semantic cache layer described in the architecture document:
|
||||
|
||||
> **Cache:** Keyed on semantic similarity of query (embedding-based, not exact string match). Cache entry stores extracted relevant segment + provenance. TTL configurable per memory type. Invalidated on write to any memory in the result set.
|
||||
|
||||
The cache sits between the `query_memory` gRPC handler and the retrieval pipeline. When a query arrives, the cache is checked first by computing the query's embedding and comparing it against cached query embeddings using cosine similarity. If a cache hit is found (similarity exceeds a configurable threshold), the cached results are returned directly, bypassing the full retrieval pipeline and extraction step.
|
||||
|
||||
**Affected gRPC endpoints:**
|
||||
- `QueryMemory` (server-streaming) — cache lookup is inserted before the retrieval pipeline; cache population is inserted after retrieval + extraction completes.
|
||||
- `WriteMemory` — on write, the cache must be invalidated for any cached entry whose result set includes the written memory ID.
|
||||
|
||||
**Proto messages used:**
|
||||
- `QueryMemoryRequest` — the query text is embedded and used as the cache key.
|
||||
- `QueryMemoryResponse` — cached responses set `is_cached = true`. The `cached_extracted_segment` and `extraction_confidence` fields are populated from the cache entry.
|
||||
- No proto changes required — the existing `is_cached` field on `QueryMemoryResponse` (field 4) already exists for this purpose.
|
||||
|
||||
### Existing Patterns
|
||||
|
||||
- **Config pattern:** `RetrievalConfig` and `ExtractionConfig` in `services/memory/src/config.rs` use `#[derive(Debug, Clone, Deserialize)]` with `#[serde(default)]` and named default functions. The cache config should follow the same pattern.
|
||||
- **Builder pattern for service:** `MemoryServiceImpl` uses builder methods like `with_embedding_client()` and `with_extraction_client()` at `services/memory/src/service.rs:54-67`. The cache can be built into the service directly (always present, not optional) since it is purely in-memory.
|
||||
- **Embedding generation:** The `EmbeddingClient` wrapped in `Arc<Mutex<EmbeddingClient>>` at `services/memory/src/service.rs:26` provides `generate()` to produce query embeddings. The cache will reuse the query embedding already generated for the retrieval pipeline (at `services/memory/src/service.rs:139-145`).
|
||||
- **DuckDB connection pattern:** `DuckDbManager::with_connection()` at `services/memory/src/db/mod.rs:84-90` uses `Mutex<Connection>`. The cache is in-memory (not in DuckDB) to avoid database lock contention.
|
||||
- **Module organization:** Each feature area has its own module directory (`embedding/`, `extraction/`, `retrieval/`). The cache should follow the same pattern as a `cache/` module.
|
||||
|
||||
### Dependencies
|
||||
|
||||
- **No new external crate dependencies** — the cache is an in-memory `HashMap`-based structure protected by `tokio::sync::RwLock`. Cosine similarity computation can reuse the `array_cosine_similarity` logic from the retrieval pipeline, but since the cache operates outside DuckDB, a pure-Rust cosine similarity function is needed.
|
||||
- **Embedding client** — the cache requires query embeddings. The `query_memory` handler already generates the query embedding before the pipeline runs (at `services/memory/src/service.rs:139-145`). The cache lookup and population reuse this embedding.
|
||||
- **`std::time::Instant`** — for TTL tracking.
|
||||
- **`std::sync::atomic`** — for lock-free cache metrics counters.
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### 1. Types & Configuration
|
||||
|
||||
**Add cache configuration to `services/memory/src/config.rs`:**
|
||||
|
||||
```rust
|
||||
/// Configuration for the semantic query cache.
|
||||
#[derive(Debug, Clone, Deserialize)]
|
||||
pub struct CacheConfig {
|
||||
/// Whether the cache is enabled (default: true).
|
||||
#[serde(default = "default_cache_enabled")]
|
||||
pub enabled: bool,
|
||||
|
||||
/// Cosine similarity threshold for cache hits (default: 0.95).
|
||||
/// A cached query embedding must have cosine similarity >= this value
|
||||
/// with the incoming query embedding to be considered a hit.
|
||||
#[serde(default = "default_cache_similarity_threshold")]
|
||||
pub similarity_threshold: f32,
|
||||
|
||||
/// Time-to-live for cache entries in seconds (default: 300 = 5 minutes).
|
||||
#[serde(default = "default_cache_ttl_secs")]
|
||||
pub ttl_secs: u64,
|
||||
|
||||
/// Maximum number of entries in the cache (default: 1000).
|
||||
/// When exceeded, the oldest entry is evicted.
|
||||
#[serde(default = "default_cache_max_entries")]
|
||||
pub max_entries: usize,
|
||||
}
|
||||
```
|
||||
|
||||
Add `cache: CacheConfig` field to the `Config` struct with `#[serde(default)]`.
|
||||
|
||||
**Define cache types in a new `services/memory/src/cache/mod.rs`:**
|
||||
|
||||
```rust
|
||||
use std::time::Instant;
|
||||
|
||||
/// A single entry in the semantic cache.
|
||||
///
|
||||
/// Stores the query embedding (as the cache key), the retrieval results,
|
||||
/// and metadata for TTL tracking and invalidation.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CacheEntry {
|
||||
/// The embedding of the cached query (used for similarity matching).
|
||||
pub query_embedding: Vec<f32>,
|
||||
/// The original query text (for debugging/logging).
|
||||
pub query_text: String,
|
||||
/// The tag filter used in the original query (cache entries are scoped by tag).
|
||||
pub tag_filter: Option<String>,
|
||||
/// Cached response data (rank, entry, scores, extraction results).
|
||||
pub results: Vec<CachedResult>,
|
||||
/// Memory IDs in the result set (for invalidation on write).
|
||||
pub result_memory_ids: Vec<String>,
|
||||
/// When this entry was created (for TTL expiration).
|
||||
pub created_at: Instant,
|
||||
}
|
||||
|
||||
/// A single cached retrieval result.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CachedResult {
|
||||
/// The rank of the result in the original retrieval.
|
||||
pub rank: u32,
|
||||
/// The memory entry (proto format).
|
||||
pub entry: llm_multiverse_proto::llm_multiverse::v1::MemoryEntry,
|
||||
/// Cosine similarity score from the retrieval pipeline.
|
||||
pub cosine_similarity: f32,
|
||||
/// Extracted segment (if extraction was performed).
|
||||
pub extracted_segment: Option<String>,
|
||||
/// Extraction confidence (if extraction was performed).
|
||||
pub extraction_confidence: Option<f32>,
|
||||
}
|
||||
|
||||
/// Cache hit/miss metrics.
|
||||
#[derive(Debug)]
|
||||
pub struct CacheMetrics {
|
||||
/// Total cache hit count.
|
||||
pub hits: std::sync::atomic::AtomicU64,
|
||||
/// Total cache miss count.
|
||||
pub misses: std::sync::atomic::AtomicU64,
|
||||
/// Total cache evictions (TTL or capacity).
|
||||
pub evictions: std::sync::atomic::AtomicU64,
|
||||
/// Total cache invalidations (due to writes).
|
||||
pub invalidations: std::sync::atomic::AtomicU64,
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Core Logic
|
||||
|
||||
**Create `services/memory/src/cache/mod.rs` — Semantic cache manager:**
|
||||
|
||||
```rust
|
||||
use tokio::sync::RwLock;
|
||||
use crate::config::CacheConfig;
|
||||
|
||||
/// Semantic cache for deduplicating similar queries.
|
||||
///
|
||||
/// Keyed by query embedding similarity rather than exact string match.
|
||||
/// Thread-safe via `RwLock` for concurrent read access (cache lookups)
|
||||
/// with exclusive write access (cache population, invalidation, eviction).
|
||||
pub struct SemanticCache {
|
||||
config: CacheConfig,
|
||||
entries: RwLock<Vec<CacheEntry>>,
|
||||
metrics: CacheMetrics,
|
||||
}
|
||||
|
||||
impl SemanticCache {
|
||||
/// Create a new semantic cache with the given configuration.
|
||||
pub fn new(config: CacheConfig) -> Self;
|
||||
|
||||
/// Look up a cache entry by query embedding similarity.
|
||||
///
|
||||
/// Computes cosine similarity between the `query_embedding` and each
|
||||
/// cached entry's embedding. Returns the first entry whose similarity
|
||||
/// exceeds `config.similarity_threshold` and whose TTL has not expired.
|
||||
///
|
||||
/// Also filters by `tag_filter` — a cached entry is only a hit if its
|
||||
/// tag filter matches the incoming query's tag filter.
|
||||
///
|
||||
/// Updates metrics (hit or miss counter).
|
||||
///
|
||||
/// Returns `None` on miss.
|
||||
pub async fn lookup(
|
||||
&self,
|
||||
query_embedding: &[f32],
|
||||
tag_filter: Option<&str>,
|
||||
) -> Option<Vec<CachedResult>>;
|
||||
|
||||
/// Insert a new cache entry.
|
||||
///
|
||||
/// If the cache is at `max_entries` capacity, evicts the oldest entry
|
||||
/// (by `created_at`). Also removes any expired entries during insertion.
|
||||
pub async fn insert(&self, entry: CacheEntry);
|
||||
|
||||
/// Invalidate all cache entries whose result set includes the given memory ID.
|
||||
///
|
||||
/// Called when a memory is written or updated. This ensures stale cached
|
||||
/// results are not served after the underlying data changes.
|
||||
///
|
||||
/// Updates the invalidation counter.
|
||||
pub async fn invalidate_by_memory_id(&self, memory_id: &str);
|
||||
|
||||
/// Invalidate all cache entries (full cache flush).
|
||||
pub async fn invalidate_all(&self);
|
||||
|
||||
/// Remove expired entries (TTL check).
|
||||
///
|
||||
/// Called during `lookup` and `insert` to lazily clean up expired entries.
|
||||
async fn evict_expired(&self, entries: &mut Vec<CacheEntry>);
|
||||
|
||||
/// Get a snapshot of cache metrics.
|
||||
pub fn metrics(&self) -> CacheMetricsSnapshot;
|
||||
}
|
||||
```
|
||||
|
||||
**Create `services/memory/src/cache/similarity.rs` — Pure-Rust cosine similarity:**
|
||||
|
||||
```rust
|
||||
/// Compute cosine similarity between two vectors.
|
||||
///
|
||||
/// Returns a value in [-1.0, 1.0]. Returns 0.0 if either vector has zero magnitude.
|
||||
///
|
||||
/// This is a pure-Rust implementation (not DuckDB's `array_cosine_similarity`)
|
||||
/// because the cache operates in-memory, outside of DuckDB.
|
||||
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32;
|
||||
```
|
||||
|
||||
**Key implementation details for `lookup()`:**
|
||||
|
||||
1. Acquire read lock on `entries`.
|
||||
2. Get current time for TTL check.
|
||||
3. Iterate over entries, skip expired ones.
|
||||
4. For each non-expired entry, check tag filter match.
|
||||
5. Compute `cosine_similarity(query_embedding, entry.query_embedding)`.
|
||||
6. If similarity >= `config.similarity_threshold`, increment hit counter, return cloned results.
|
||||
7. If no hit found, increment miss counter, return `None`.
|
||||
|
||||
**Key implementation details for `insert()`:**
|
||||
|
||||
1. Acquire write lock on `entries`.
|
||||
2. Call `evict_expired()` to remove stale entries.
|
||||
3. If `entries.len() >= config.max_entries`, remove the oldest entry by `created_at`. Increment eviction counter.
|
||||
4. Push the new `CacheEntry`.
|
||||
|
||||
**Key implementation details for `invalidate_by_memory_id()`:**
|
||||
|
||||
1. Acquire write lock on `entries`.
|
||||
2. Retain only entries whose `result_memory_ids` do not contain the given `memory_id`.
|
||||
3. For each removed entry, increment invalidation counter.
|
||||
|
||||
**Metrics snapshot type:**
|
||||
|
||||
```rust
|
||||
/// A point-in-time snapshot of cache metrics.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CacheMetricsSnapshot {
|
||||
pub hits: u64,
|
||||
pub misses: u64,
|
||||
pub evictions: u64,
|
||||
pub invalidations: u64,
|
||||
pub current_size: usize,
|
||||
/// Hit rate as a percentage (0.0-100.0). Returns 0.0 if no lookups performed.
|
||||
pub hit_rate: f64,
|
||||
}
|
||||
```
|
||||
|
||||
### 3. gRPC Handler Wiring
|
||||
|
||||
**Update `services/memory/src/service.rs` — Add cache to `MemoryServiceImpl` and integrate into `query_memory`:**
|
||||
|
||||
```rust
|
||||
pub struct MemoryServiceImpl {
|
||||
db: Arc<DuckDbManager>,
|
||||
embedding_client: Option<Arc<Mutex<EmbeddingClient>>>,
|
||||
extraction_client: Option<Arc<Mutex<ExtractionClient>>>,
|
||||
retrieval_config: RetrievalConfig,
|
||||
extraction_config: ExtractionConfig,
|
||||
cache: Arc<SemanticCache>,
|
||||
}
|
||||
```
|
||||
|
||||
Update `MemoryServiceImpl::new()` to accept `CacheConfig` and construct the `SemanticCache`:
|
||||
|
||||
```rust
|
||||
pub fn new(
|
||||
db: Arc<DuckDbManager>,
|
||||
retrieval_config: RetrievalConfig,
|
||||
extraction_config: ExtractionConfig,
|
||||
cache_config: CacheConfig,
|
||||
) -> Self {
|
||||
Self {
|
||||
db,
|
||||
embedding_client: None,
|
||||
extraction_client: None,
|
||||
retrieval_config,
|
||||
extraction_config,
|
||||
cache: Arc::new(SemanticCache::new(cache_config)),
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Updated `query_memory` handler flow:**
|
||||
|
||||
1. Validate request (existing code).
|
||||
2. Generate query embedding (existing code at lines 139-145).
|
||||
3. **NEW — Cache lookup:** If cache is enabled, call `self.cache.lookup(&query_embedding, tag_filter)`. If hit, stream cached results with `is_cached = true` and return immediately.
|
||||
4. Run retrieval pipeline (existing code at lines 148-163).
|
||||
5. Run extraction (existing code at lines 166-180).
|
||||
6. **NEW — Cache population:** Build a `CacheEntry` from the retrieval + extraction results and insert into the cache.
|
||||
7. Stream results with `is_cached = false` (existing code).
|
||||
|
||||
```rust
|
||||
// After generating query_embedding and before pipeline:
|
||||
let tag_filter_ref = if req.memory_type.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(req.memory_type.as_str())
|
||||
};
|
||||
|
||||
if self.cache.config().enabled {
|
||||
if let Some(cached_results) = self.cache.lookup(&query_vector, tag_filter_ref).await {
|
||||
tracing::debug!(
|
||||
session_id = %ctx.session_id,
|
||||
query = %req.query,
|
||||
"Cache hit for query"
|
||||
);
|
||||
// Stream cached results
|
||||
let (tx, rx) = tokio::sync::mpsc::channel(cached_results.len().max(1));
|
||||
tokio::spawn(async move {
|
||||
for result in cached_results {
|
||||
let response = QueryMemoryResponse {
|
||||
rank: result.rank,
|
||||
entry: Some(result.entry),
|
||||
cosine_similarity: result.cosine_similarity,
|
||||
is_cached: true,
|
||||
cached_extracted_segment: result.extracted_segment,
|
||||
extraction_confidence: result.extraction_confidence,
|
||||
};
|
||||
if tx.send(Ok(response)).await.is_err() {
|
||||
break;
|
||||
}
|
||||
}
|
||||
});
|
||||
return Ok(Response::new(ReceiverStream::new(rx)));
|
||||
}
|
||||
}
|
||||
|
||||
// ... existing pipeline and extraction code ...
|
||||
|
||||
// After extraction, before streaming:
|
||||
// Populate cache with results
|
||||
if self.cache.config().enabled {
|
||||
let cached_results: Vec<CachedResult> = /* build from candidates + extraction_results */;
|
||||
let result_memory_ids: Vec<String> = candidates.iter()
|
||||
.map(|c| c.memory_id.clone())
|
||||
.collect();
|
||||
let cache_entry = CacheEntry {
|
||||
query_embedding: query_vector.clone(),
|
||||
query_text: req.query.clone(),
|
||||
tag_filter: params.tag_filter.clone(),
|
||||
results: cached_results,
|
||||
result_memory_ids,
|
||||
created_at: Instant::now(),
|
||||
};
|
||||
self.cache.insert(cache_entry).await;
|
||||
}
|
||||
```
|
||||
|
||||
**Update `write_memory` handler (future-proofing):**
|
||||
|
||||
The `write_memory` handler is currently `Unimplemented`, but the invalidation hook should be documented and placed at the logical location. When `write_memory` is implemented, after successfully writing a memory, it must call:
|
||||
|
||||
```rust
|
||||
self.cache.invalidate_by_memory_id(&memory_id).await;
|
||||
```
|
||||
|
||||
For now, add a comment in the `write_memory` handler noting this requirement:
|
||||
|
||||
```rust
|
||||
// TODO(#32): After write succeeds, call self.cache.invalidate_by_memory_id(&memory_id).await;
|
||||
```
|
||||
|
||||
### 4. Service Integration
|
||||
|
||||
**Update `services/memory/src/main.rs` — Pass cache config:**
|
||||
|
||||
```rust
|
||||
let cache_config = config.cache.clone();
|
||||
let mut memory_service = MemoryServiceImpl::new(
|
||||
db,
|
||||
retrieval_config,
|
||||
extraction_config.clone(),
|
||||
cache_config,
|
||||
);
|
||||
```
|
||||
|
||||
**Metrics exposure:** The cache metrics are accessible via `self.cache.metrics()`. For now, metrics are logged periodically or on demand. A dedicated metrics endpoint or gRPC health check can be added in a future issue. Add periodic logging in the service startup:
|
||||
|
||||
```rust
|
||||
// Log cache metrics every 60 seconds
|
||||
let cache_ref = memory_service.cache().clone();
|
||||
tokio::spawn(async move {
|
||||
let mut interval = tokio::time::interval(std::time::Duration::from_secs(60));
|
||||
loop {
|
||||
interval.tick().await;
|
||||
let m = cache_ref.metrics();
|
||||
tracing::info!(
|
||||
hits = m.hits,
|
||||
misses = m.misses,
|
||||
hit_rate = format!("{:.1}%", m.hit_rate),
|
||||
size = m.current_size,
|
||||
evictions = m.evictions,
|
||||
invalidations = m.invalidations,
|
||||
"Cache metrics"
|
||||
);
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
**Error mapping:** The cache layer introduces no new error types that propagate to gRPC — cache misses simply fall through to the pipeline, and cache insertion failures are logged but do not block the response.
|
||||
|
||||
### 5. Tests
|
||||
|
||||
**Unit tests in `services/memory/src/cache/similarity.rs`:**
|
||||
|
||||
| Test Case | Description |
|
||||
|---|---|
|
||||
| `test_cosine_similarity_identical_vectors` | Two identical vectors return 1.0 |
|
||||
| `test_cosine_similarity_orthogonal_vectors` | Two orthogonal vectors return 0.0 |
|
||||
| `test_cosine_similarity_opposite_vectors` | Two opposite vectors return -1.0 |
|
||||
| `test_cosine_similarity_zero_vector` | A zero vector returns 0.0 (no division by zero) |
|
||||
| `test_cosine_similarity_different_magnitudes` | Vectors with same direction but different magnitudes return 1.0 |
|
||||
| `test_cosine_similarity_known_value` | Known pair of vectors produces expected similarity |
|
||||
|
||||
**Unit tests in `services/memory/src/cache/mod.rs`:**
|
||||
|
||||
| Test Case | Description |
|
||||
|---|---|
|
||||
| `test_cache_new_creates_empty_cache` | New cache has 0 entries and 0 metrics |
|
||||
| `test_cache_insert_and_lookup_hit` | Insert an entry, lookup with same embedding returns hit |
|
||||
| `test_cache_lookup_miss_below_threshold` | Lookup with dissimilar embedding returns miss |
|
||||
| `test_cache_lookup_miss_empty_cache` | Lookup on empty cache returns None |
|
||||
| `test_cache_ttl_expiration` | Insert entry, wait past TTL, lookup returns None |
|
||||
| `test_cache_invalidate_by_memory_id` | Insert entry with memory ID, invalidate, lookup returns None |
|
||||
| `test_cache_invalidate_by_memory_id_partial` | Two entries, invalidate one memory ID, other entry survives |
|
||||
| `test_cache_invalidate_all` | Insert entries, invalidate all, lookup returns None |
|
||||
| `test_cache_max_entries_eviction` | Insert entries beyond max_entries, oldest is evicted |
|
||||
| `test_cache_metrics_hit_count` | After hits, hit counter is incremented |
|
||||
| `test_cache_metrics_miss_count` | After misses, miss counter is incremented |
|
||||
| `test_cache_metrics_hit_rate` | After mix of hits and misses, hit rate is correct |
|
||||
| `test_cache_metrics_eviction_count` | After eviction, eviction counter is incremented |
|
||||
| `test_cache_metrics_invalidation_count` | After invalidation, invalidation counter is incremented |
|
||||
| `test_cache_tag_filter_scoping` | Entry cached with tag "A", lookup with tag "B" misses |
|
||||
| `test_cache_tag_filter_none_matches_none` | Entry cached without tag, lookup without tag hits |
|
||||
| `test_cache_disabled_returns_none` | Cache with `enabled=false`, lookup always returns None |
|
||||
| `test_cache_concurrent_read_write` | Spawn multiple readers and writers, verify no panics/deadlocks |
|
||||
|
||||
**Service-level tests in `services/memory/src/service.rs`:**
|
||||
|
||||
| Test Case | Description |
|
||||
|---|---|
|
||||
| `test_query_memory_cache_hit` | First query populates cache, second identical query returns `is_cached=true` |
|
||||
| `test_query_memory_cache_miss_different_query` | Two dissimilar queries both return `is_cached=false` |
|
||||
| `test_query_memory_cache_disabled` | Cache disabled in config, all queries return `is_cached=false` |
|
||||
|
||||
**Config tests in `services/memory/src/config.rs`:**
|
||||
|
||||
| Test Case | Description |
|
||||
|---|---|
|
||||
| `test_cache_config_defaults` | Default config has `enabled=true`, `similarity_threshold=0.95`, `ttl_secs=300`, `max_entries=1000` |
|
||||
| `test_cache_config_from_toml` | Custom values loaded from TOML |
|
||||
| `test_cache_config_uses_defaults_when_omitted` | Config without `[cache]` section uses defaults |
|
||||
|
||||
**Mocking strategy:**
|
||||
- Use `DuckDbManager::in_memory()` for all DB operations.
|
||||
- Use the existing mock Model Gateway server pattern from `services/memory/src/service.rs:469-713` for embedding and extraction clients.
|
||||
- For cache-specific tests, construct `CacheEntry` directly without needing the full pipeline.
|
||||
- For TTL tests, use a very short TTL (e.g., 1 second) and `tokio::time::sleep()`.
|
||||
|
||||
### Cargo Dependencies
|
||||
|
||||
No new crate dependencies required. All functionality is available via:
|
||||
- `tokio` (async `RwLock`, `mpsc` channels, `time::Instant`)
|
||||
- `std::sync::atomic` (lock-free metrics counters)
|
||||
- `std::time::Instant` (TTL tracking)
|
||||
|
||||
### Trait Implementations
|
||||
|
||||
No new trait implementations required. The `SemanticCache` is a concrete struct used directly by the service layer.
|
||||
|
||||
### Error Types
|
||||
|
||||
No new error types required. Cache operations are non-fatal:
|
||||
- Cache lookup miss: falls through to the pipeline.
|
||||
- Cache insertion failure: logged as warning, response still returned.
|
||||
- Cache invalidation: best-effort, logged.
|
||||
|
||||
## Files to Create/Modify
|
||||
|
||||
| File | Action | Purpose |
|
||||
|---|---|---|
|
||||
| `services/memory/src/cache/mod.rs` | Create | `SemanticCache`, `CacheEntry`, `CachedResult`, `CacheMetrics`, `CacheMetricsSnapshot` — cache manager with lookup, insert, invalidation, eviction, and metrics |
|
||||
| `services/memory/src/cache/similarity.rs` | Create | `cosine_similarity()` — pure-Rust cosine similarity for in-memory embedding comparison |
|
||||
| `services/memory/src/config.rs` | Modify | Add `CacheConfig` struct with `enabled`, `similarity_threshold`, `ttl_secs`, `max_entries`; add `cache` field to `Config` |
|
||||
| `services/memory/src/lib.rs` | Modify | Add `pub mod cache;` |
|
||||
| `services/memory/src/service.rs` | Modify | Add `cache: Arc<SemanticCache>` to `MemoryServiceImpl`; update constructor to accept `CacheConfig`; integrate cache lookup before pipeline and cache population after pipeline in `query_memory`; add cache invalidation comment to `write_memory` |
|
||||
| `services/memory/src/main.rs` | Modify | Pass `CacheConfig` to `MemoryServiceImpl::new()`; add periodic cache metrics logging task |
|
||||
|
||||
## Risks and Edge Cases
|
||||
|
||||
- **Cache key collision with different tag filters:** Two queries with the same text but different `memory_type` tag filters should not share cache entries. Mitigation: the cache lookup filters by `tag_filter` match in addition to embedding similarity. A cache entry is only a hit if both the embedding similarity threshold is met AND the tag filter matches exactly.
|
||||
- **Similarity threshold tuning:** A threshold of 0.95 is aggressive — semantically similar but not identical queries may miss. A lower threshold (e.g., 0.90) increases hit rate but risks returning stale/irrelevant results. Mitigation: make the threshold configurable and start with 0.95 as the safe default.
|
||||
- **Cache size and memory pressure:** Each cache entry stores the query embedding (768 floats = 3KB), the full `MemoryEntry` proto messages (variable size), and extraction results. With 1000 entries and average 5 results per entry, memory usage is roughly 1000 * (3KB + 5 * ~2KB) = ~13MB. This is acceptable for the target hardware. The `max_entries` cap prevents unbounded growth.
|
||||
- **TTL granularity:** TTL is checked lazily during `lookup` and `insert`, not by a background sweeper. This means expired entries may linger until the next operation. For the expected query rate, this is acceptable. A background sweeper can be added if memory pressure becomes an issue.
|
||||
- **Write-through invalidation for unimplemented `write_memory`:** The `write_memory` handler is currently `Unimplemented`. The invalidation hook is documented as a TODO comment. When `write_memory` is implemented (issue #34 or similar), the cache invalidation must be wired in. Risk: if forgotten, stale cache entries will be served. Mitigation: the TODO comment references issue #32 for traceability.
|
||||
- **Concurrent access patterns:** The cache uses `tokio::sync::RwLock` which allows multiple concurrent readers (cache lookups) with exclusive writer access (inserts, invalidations). This is appropriate for a read-heavy workload (many queries, fewer writes). The `RwLock` will not be a bottleneck unless the cache is invalidated very frequently.
|
||||
- **Embedding client required for cache:** The cache lookup requires a query embedding, which is generated by the embedding client. If no embedding client is configured, the cache cannot be used. This is already handled by the existing check that returns `failed_precondition` when no embedding client is present — the cache lookup code path is only reached after the embedding is successfully generated.
|
||||
- **Cache coherence with extraction toggle:** If the first query runs with `skip_extraction=false` (extraction results cached) and a subsequent semantically similar query has `skip_extraction=true`, the cache hit will return extraction results even though the caller didn't want them. Mitigation: the caller can ignore the extraction fields; alternatively, the cache lookup could also match on `skip_extraction` flag. Start with the simpler approach (cache does not differentiate by extraction toggle) since extracted results are strictly more informative.
|
||||
- **Linear scan performance:** The cache lookup iterates over all entries computing cosine similarity. For 1000 entries with 768-dim vectors, this is ~1000 * 768 multiply-adds = ~768K floating point ops, which completes in microseconds on modern hardware. This is negligible compared to the retrieval pipeline latency. No indexing needed at this scale.
|
||||
|
||||
## Deviation Log
|
||||
|
||||
| Deviation | Reason |
|
||||
|---|---|
|
||||
| Merged `feature/issue-31-extraction-step` into feature branch (fast-forward) | Issue #32 depends on #31 (extraction step) which is completed but not yet merged to `main`. The extraction types, client, and `ExtractionConfig` are required by the cache integration in `service.rs`. |
|
||||
| `SemanticCache::metrics()` is `async` (acquires read lock to get `current_size`) | Plan showed it as a sync method, but reading `entries.len()` requires the RwLock. Made async for correctness. |
|
||||
| Used `f64` intermediates in cosine similarity computation | Plan specified `f32` only. Using `f64` for dot product and magnitude accumulation avoids precision loss with large vectors. Cast result back to `f32`. |
|
||||
Reference in New Issue
Block a user