Files
llm-multiverse/implementation-plans/issue-040.md
2026-03-10 14:05:38 +01:00

11 KiB

Implementation Plan — Issue #40: Implement model routing logic

Metadata

Field Value
Issue #40
Title Implement model routing logic
Milestone Phase 5: Model Gateway
Labels
Status COMPLETED
Language Rust / Protobuf
Related Plans issue-038.md, issue-039.md
Blocked by #39, #20

Acceptance Criteria

  • Routing table: task type → model name (e.g., "code" → qwen2.5-coder:14b)
  • Model hint override from request
  • Default model fallback
  • Configuration-driven routing (not hardcoded)
  • Audit logging of every inference request via Audit Service

Architecture Analysis

Service Context

  • Belongs to Model Gateway service (services/model-gateway/).
  • Affects the Inference, StreamInference, and GenerateEmbedding gRPC endpoints (currently stubbed as Unimplemented). This issue creates the routing logic that those endpoints will call; the endpoint implementations themselves are in #41/#42.
  • Proto messages involved: InferenceParams (has TaskComplexity), InferenceRequest, StreamInferenceRequest, GenerateEmbeddingRequest (has optional model field), AppendRequest/AuditEntry for audit logging.

Existing Patterns

  • Config: services/model-gateway/src/config.rsModelRoutingConfig already has default_model, simple_model, complex_model, embedding_model, and aliases: HashMap<String, String>. No new config fields are needed.
  • Proto: TaskComplexity enum has UNSPECIFIED, SIMPLE, COMPLEX. The InferenceParams message carries task_complexity but currently has no explicit model hint/override field.
  • Audit pattern: services/memory/src/service.rsAuditServiceClient<Channel> wrapped in Arc<Mutex<...>>, attached via with_audit_client() builder method, best-effort logging with tracing::warn! on failure, uses AUDIT_ACTION_MEMORY_WRITE (int value 4). Model Gateway will use AUDIT_ACTION_INFERENCE_REQUEST (int value 7).
  • Service struct: ModelGatewayServiceImpl in service.rs uses constructor new(config) pattern. Builder methods (e.g., with_audit_client) will follow the memory service convention.

Dependencies

  • Audit Service: gRPC client (audit_service_client::AuditServiceClient) from llm-multiverse-proto crate — already available via tonic dependency in Cargo.toml.
  • sha2 crate: For params_hash field in AuditEntry, following the memory service pattern.
  • Proto change: Add optional string model_hint field to InferenceParams in model_gateway.proto so callers can explicitly request a model by name or alias. The GenerateEmbeddingRequest already has an optional string model field for this purpose.

Implementation Steps

1. Proto Update — Add model_hint to InferenceParams

Add an optional string model_hint field to the InferenceParams message in proto/llm_multiverse/v1/model_gateway.proto:

message InferenceParams {
  SessionContext context = 1;
  string prompt = 2;
  TaskComplexity task_complexity = 3;
  uint32 max_tokens = 4;
  optional float temperature = 5;
  optional float top_p = 6;
  repeated string stop_sequences = 7;
  // Explicit model name or alias override. If set, bypasses task_complexity routing.
  optional string model_hint = 8;
}

After editing the proto, regenerate Rust stubs (run the existing buf generate / build.rs flow).

2. Core Logic — ModelRouter in routing.rs

Create services/model-gateway/src/routing.rs with a ModelRouter struct.

ModelRouter struct:

pub struct ModelRouter {
    config: ModelRoutingConfig,
}

Takes an owned or cloned ModelRoutingConfig.

resolve_model() method — the primary routing entry point:

pub fn resolve_model(
    &self,
    task_complexity: i32,       // proto enum as i32
    model_hint: Option<&str>,   // from InferenceParams.model_hint
) -> String

Resolution order:

  1. If model_hint is Some(hint) and non-empty → use hint as the candidate name.
  2. Else, map task_complexity to a config field:
    • TASK_COMPLEXITY_SIMPLE (1) → self.config.simple_model
    • TASK_COMPLEXITY_COMPLEX (2) → self.config.complex_model
    • TASK_COMPLEXITY_UNSPECIFIED (0) or any other value → self.config.default_model
  3. Apply alias expansion: if the candidate name exists as a key in self.config.aliases, replace it with the alias value.
  4. Return the resolved model name.

resolve_embedding_model() method:

pub fn resolve_embedding_model(
    &self,
    model_override: Option<&str>,
) -> String

Resolution order:

  1. If model_override is Some(name) and non-empty → use name as candidate.
  2. Else → self.config.embedding_model.
  3. Apply alias expansion (same as above).
  4. Return resolved name.

resolve_alias() private helper:

fn resolve_alias(&self, name: &str) -> String

Looks up name in self.config.aliases; returns the mapped value if found, otherwise returns name unchanged. Single-level resolution only (no recursive alias chains) to avoid cycles.

3. Audit Logging — audit_log_inference() helper

Add an audit_log_inference() free function in service.rs (or a separate audit.rs module), following the exact pattern from the memory service.

Function signature:

async fn audit_log_inference(
    audit_client: &Arc<Mutex<AuditServiceClient<Channel>>>,
    ctx: &SessionContext,
    model_name: &str,
    prompt_length: usize,
    task_complexity: i32,
    rpc_name: &str,        // "Inference", "StreamInference", or "GenerateEmbedding"
    result_status: &str,   // "success" or "failure"
)

AuditEntry construction:

  • action: 7 (= AUDIT_ACTION_INFERENCE_REQUEST)
  • tool_name: the rpc_name parameter
  • params_hash: SHA-256 of "{rpc_name}:{model_name}:{prompt_length}:{task_complexity}"
  • result_status: passed through
  • metadata: include {"model": model_name, "prompt_length": prompt_length.to_string(), "task_complexity": task_complexity.to_string()}
  • session_id, agent_id: extracted from SessionContext (same pattern as memory service)

Best-effort semantics: wrap client.append() in if let Err(e) with tracing::warn! — never fail the inference request due to audit failure.

4. Service Integration — Wire into ModelGatewayServiceImpl

Add fields to ModelGatewayServiceImpl:

pub struct ModelGatewayServiceImpl {
    config: Config,
    ollama: OllamaClient,
    router: ModelRouter,
    audit_client: Option<Arc<Mutex<AuditServiceClient<Channel>>>>,
}

Update new() constructor:

pub fn new(config: Config) -> Result<Self, anyhow::Error> {
    let ollama = OllamaClient::new(&config)?;
    let router = ModelRouter::new(config.routing.clone());
    Ok(Self { config, ollama, router, audit_client: None })
}

Add builder method:

pub fn with_audit_client(mut self, client: AuditServiceClient<Channel>) -> Self {
    self.audit_client = Some(Arc::new(Mutex::new(client)));
    self
}

Update main.rs: If config.audit_addr is Some(addr), connect the AuditServiceClient and attach it via with_audit_client(). Use the same pattern as the memory service's main.

Add sha2 dependency to services/model-gateway/Cargo.toml:

sha2 = "0.10"

Update lib.rs to expose the new module:

pub mod routing;

5. Tests

Unit tests in routing.rs (#[cfg(test)] mod tests)

Test Description
test_resolve_simple_task TaskComplexity::SIMPLE (1) with no hint → simple_model
test_resolve_complex_task TaskComplexity::COMPLEX (2) with no hint → complex_model
test_resolve_unspecified_task TaskComplexity::UNSPECIFIED (0) with no hint → default_model
test_resolve_unknown_value Unknown integer (99) with no hint → default_model
test_model_hint_overrides_complexity model_hint = Some("custom:7b") ignores task_complexity
test_model_hint_empty_string_ignored model_hint = Some("") falls through to complexity routing
test_alias_expansion Alias "code" → "codellama:7b"; resolve with hint "code" → "codellama:7b"
test_alias_expansion_via_complexity simple_model = "fast", alias "fast" → "phi3:mini"; resolve SIMPLE → "phi3:mini"
test_no_alias_passthrough Model name not in aliases → returned unchanged
test_resolve_embedding_default No override → embedding_model
test_resolve_embedding_override Override "custom-embed" → "custom-embed" (with alias check)
test_resolve_embedding_alias Override matches alias key → expanded

Unit tests in service.rs (extend existing test module)

Test Description
test_service_has_router Verify ModelGatewayServiceImpl::new() initializes router field
test_service_works_without_audit_client Confirm audit_client is None by default, existing endpoints still work
test_with_audit_client_builder Verify with_audit_client() sets the field to Some(...)

No integration tests are needed in this issue — those belong to #41/#42 which implement the actual gRPC endpoints that call the router and audit functions.

Files to Create/Modify

File Action Purpose
proto/llm_multiverse/v1/model_gateway.proto Modify Add optional string model_hint = 8 to InferenceParams
services/model-gateway/src/routing.rs Create ModelRouter struct with resolve_model(), resolve_embedding_model(), alias expansion
services/model-gateway/src/lib.rs Modify Add pub mod routing;
services/model-gateway/src/service.rs Modify Add router: ModelRouter and audit_client fields, with_audit_client() builder, audit_log_inference() helper
services/model-gateway/src/main.rs Modify Connect AuditServiceClient when audit_addr is configured, attach via builder
services/model-gateway/Cargo.toml Modify Add sha2 = "0.10" dependency

Risks and Edge Cases

  • Proto regeneration: Adding field 8 to InferenceParams is backward compatible (optional field, no renumbering). The Rust stubs must be regenerated before the routing code compiles. If the build.rs / buf generate step is not run, compilation will fail with a missing field error.
  • Alias cycles: If aliases contain A → B and B → A, single-level resolution prevents infinite loops. This is the intended design — only one alias lookup is performed.
  • Empty model_hint string: An empty string "" from proto is a valid Some("") in Rust for optional string fields. The router must treat empty strings the same as None (fall through to complexity routing).
  • Audit client unavailability: If the Audit Service is down, the tracing::warn! will fire but inference requests will proceed normally. This matches the memory service behavior.
  • Concurrent audit writes: Arc<Mutex<AuditServiceClient>> serializes audit calls per service instance. Under high load, audit logging could become a bottleneck. This is acceptable for the current single-instance design and consistent with the memory service pattern.

Deviation Log

(Filled during implementation if deviations from plan occur)

Deviation Reason