diff --git a/implementation-plans/_index.md b/implementation-plans/_index.md
index 77b86e9..0a0cb66 100644
--- a/implementation-plans/_index.md
+++ b/implementation-plans/_index.md
@@ -44,6 +44,7 @@
 | #38 | Scaffold Model Gateway Rust project | Phase 5 | `COMPLETED` | Rust | [issue-038.md](issue-038.md) |
 | #39 | Implement Ollama HTTP client | Phase 5 | `COMPLETED` | Rust | [issue-039.md](issue-039.md) |
 | #40 | Implement model routing logic | Phase 5 | `COMPLETED` | Rust | [issue-040.md](issue-040.md) |
+| #41 | Implement StreamInference gRPC endpoint | Phase 5 | `IMPLEMENTING` | Rust | [issue-041.md](issue-041.md) |
 
 ## Status Legend
 
@@ -88,6 +89,7 @@
 - [issue-038.md](issue-038.md) — Scaffold Model Gateway Rust project
 - [issue-039.md](issue-039.md) — Ollama HTTP client (reqwest, streaming, embeddings)
 - [issue-040.md](issue-040.md) — Model routing logic (task complexity routing, alias expansion, audit logging)
+- [issue-041.md](issue-041.md) — StreamInference gRPC endpoint (server-streaming, Ollama bridge, params mapping)
 
 ### Search Service
 - [issue-013.md](issue-013.md) — search.proto (SearchService)
diff --git a/implementation-plans/issue-041.md b/implementation-plans/issue-041.md
new file mode 100644
index 0000000..955b7b5
--- /dev/null
+++ b/implementation-plans/issue-041.md
@@ -0,0 +1,224 @@
+# Implementation Plan — Issue #41: Implement StreamInference gRPC endpoint
+
+## Metadata
+
+| Field | Value |
+|---|---|
+| Issue | [#41](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/41) |
+| Title | Implement StreamInference gRPC endpoint |
+| Milestone | Phase 5: Model Gateway |
+| Labels | — |
+| Status | `IMPLEMENTING` |
+| Language | Rust |
+| Related Plans | issue-038.md, issue-039.md, issue-040.md |
+| Blocked by | #40 |
+
+## Acceptance Criteria
+
+- [ ] StreamInference RPC handler implemented as server-streaming
+- [ ] Routes request through model routing logic
+- [ ] Streams tokens from Ollama HTTP streaming response
+- [ ] Includes usage metadata (token counts) in final message
+- [ ] Proper error handling for model loading failures
+
+## Architecture Analysis
+
+### Service Context
+- **Service:** model-gateway (`services/model-gateway`)
+- **gRPC endpoint:** `ModelGatewayService::StreamInference` (server-streaming)
+- **Proto messages:** `StreamInferenceRequest` (contains `InferenceParams`), `StreamInferenceResponse` (contains `token`, optional `finish_reason`)
+
+### Existing Patterns
+- **Server-streaming with mpsc channel:** Memory service's `QueryMemory` in `services/memory/src/service.rs` (line 268-448) uses `tokio::sync::mpsc::channel` + `ReceiverStream` + `tokio::spawn` to stream results. The `StreamInferenceStream` type alias is already declared as `ReceiverStream<Result<StreamInferenceResponse, Status>>` in `service.rs` (line 100-101).
+- **Ollama streaming:** `OllamaClient::generate_stream()` in `services/model-gateway/src/ollama/client.rs` (line 67-92) returns `Pin<Box<dyn Stream<Item = Result<GenerateStreamChunk, OllamaError>> + Send>>`. Each `GenerateStreamChunk` has `response` (token text), `done` (bool), `done_reason` (optional), `eval_count` (optional), and `prompt_eval_count` (optional).
+- **Model routing:** `ModelRouter::resolve_model()` in `routing.rs` takes `task_complexity: i32` and `model_hint: Option<&str>`, returns the resolved Ollama model name.
+- **Audit logging:** `audit_log_inference()` helper already exists in `service.rs` (line 50-90).
+
+### Dependencies
+- `OllamaClient` (already initialized in `ModelGatewayServiceImpl`)
+- `ModelRouter` (already initialized)
+- Audit service client (optional, already wired)
+- `tokio::sync::mpsc`, `tokio_stream::wrappers::ReceiverStream` (already imported)
+- `futures::StreamExt` (needed for `.next()` on Ollama stream)
+
+## Implementation Steps
+
+### 1. Types & Configuration — `params_to_options()` helper
+
+Create a `pub(crate) fn params_to_options(params: &InferenceParams) -> Option<GenerateOptions>` helper function in `service.rs`. This maps proto `InferenceParams` fields to Ollama's `GenerateOptions`:
+
+| InferenceParams field | GenerateOptions field | Mapping |
+|---|---|---|
+| `temperature` (optional float) | `temperature` | Direct pass-through |
+| `top_p` (optional float) | `top_p` | Direct pass-through |
+| `max_tokens` (uint32) | `num_predict` | Cast `u32` to `i32`; if `max_tokens == 0`, set to `None` (Ollama default) |
+| `stop_sequences` (repeated string) | `stop` | If empty, `None`; otherwise `Some(vec)` |
+
+Return `None` if all fields are at their defaults (no temperature, no top_p, max_tokens=0, no stop_sequences) to avoid sending an empty options object.
+
+This helper is deliberately `pub(crate)` so issue #42 (Inference endpoint) can reuse it.
+
+### 2. Core Logic — `stream_inference` handler
+
+Replace the stub in the `#[tonic::async_trait] impl ModelGatewayService` block:
+
+**a) Extract and validate request:**
+```
+let req = request.into_inner();
+let params = req.params
+    .ok_or_else(|| Status::invalid_argument("params is required"))?;
+let ctx = params.context.clone()
+    .ok_or_else(|| Status::invalid_argument("params.context is required"))?;
+if ctx.session_id.is_empty() {
+    return Err(Status::invalid_argument("context.session_id is required"));
+}
+if params.prompt.is_empty() {
+    return Err(Status::invalid_argument("prompt is required"));
+}
+```
+
+**b) Resolve model via router:**
+```
+let model_name = self.router.resolve_model(
+    params.task_complexity,
+    params.model_hint.as_deref(),
+);
+```
+Where `params.task_complexity` is the i32 enum value from the proto (0=UNSPECIFIED, 1=SIMPLE, 2=COMPLEX).
+
+**c) Map params to Ollama options:**
+```
+let options = params_to_options(&params);
+```
+
+**d) Audit log (best-effort, before streaming starts):**
+```
+if let Some(audit_client) = &self.audit_client {
+    audit_log_inference(
+        audit_client,
+        &ctx,
+        &model_name,
+        params.prompt.len(),
+        params.task_complexity,
+        "StreamInference",
+        "started",
+    ).await;
+}
+```
+
+**e) Call Ollama streaming API:**
+```
+let ollama_stream = self.ollama.generate_stream(&model_name, &params.prompt, options)
+    .await
+    .map_err(|e| match &e {
+        OllamaError::Api { status, message } if *status == 404 => {
+            Status::not_found(format!("model '{}' not found: {}", model_name, message))
+        }
+        OllamaError::Api { status, message } => {
+            Status::internal(format!("Ollama error ({}): {}", status, message))
+        }
+        OllamaError::Http(e) => {
+            Status::unavailable(format!("Ollama unreachable: {}", e))
+        }
+        _ => Status::internal(format!("Ollama error: {}", e)),
+    })?;
+```
+
+**f) Bridge Ollama stream to gRPC stream via mpsc channel:**
+```
+let (tx, rx) = tokio::sync::mpsc::channel(32);
+
+tokio::spawn(async move {
+    let mut stream = ollama_stream;
+    while let Some(chunk_result) = stream.next().await {
+        match chunk_result {
+            Ok(chunk) => {
+                let finish_reason = if chunk.done {
+                    Some(chunk.done_reason.unwrap_or_else(|| "stop".to_string()))
+                } else {
+                    None
+                };
+                let response = StreamInferenceResponse {
+                    token: chunk.response,
+                    finish_reason,
+                };
+                if tx.send(Ok(response)).await.is_err() {
+                    break; // Client disconnected
+                }
+            }
+            Err(e) => {
+                let _ = tx.send(Err(Status::internal(format!("stream error: {}", e)))).await;
+                break;
+            }
+        }
+    }
+});
+
+Ok(Response::new(ReceiverStream::new(rx)))
+```
+
+Key design decisions:
+- Channel capacity of 32 provides backpressure without blocking the Ollama stream excessively.
+- Each `GenerateStreamChunk` maps 1:1 to a `StreamInferenceResponse`.
+- Non-done chunks have `finish_reason = None`; the final chunk (done=true) carries `done_reason` as `finish_reason` (defaulting to "stop" if Ollama omits it).
+- Token counts (`eval_count`, `prompt_eval_count`) are present only on the final chunk from Ollama. The current proto `StreamInferenceResponse` only has `token` and `finish_reason` -- there is no field for usage metadata. The acceptance criteria mentions "usage metadata in final message", but the proto does not have these fields. The implementation will log token counts via tracing on the final chunk. If the proto is later extended with usage fields, they can be populated from the final chunk.
+- If the Ollama stream yields an error mid-stream, send `Status::internal` through the channel and terminate.
+
+### 3. gRPC Handler Wiring
+
+No new wiring needed. The `stream_inference` method is already declared in the `impl ModelGatewayService` block with the correct `StreamInferenceStream` type alias. The stub just needs to be replaced with the real implementation from step 2.
+
+### 4. Service Integration
+
+- **Ollama:** Already initialized as `self.ollama` in `ModelGatewayServiceImpl`.
+- **ModelRouter:** Already initialized as `self.router`.
+- **Audit client:** Already wired via `self.audit_client` and `audit_log_inference()`.
+- **Imports needed:** Add `futures::StreamExt` to the imports in `service.rs`. Add `use crate::ollama::types::GenerateOptions` and `use crate::ollama::error::OllamaError` (or import through `crate::ollama::*` if re-exported).
+
+Check the `ollama/mod.rs` to see what is re-exported; may need to add `GenerateOptions` to the public API if not already exported.
+
+### 5. Tests
+
+#### Unit tests in `service.rs`
+
+**a) `test_params_to_options_all_defaults`** — `InferenceParams` with zero/empty values yields `None`.
+
+**b) `test_params_to_options_temperature_only`** — Only temperature set, returns `Some(GenerateOptions { temperature: Some(0.7), .. })`.
+
+**c) `test_params_to_options_all_fields`** — All fields populated: temperature, top_p, max_tokens=100, stop_sequences=["STOP"]. Verify `num_predict` is `Some(100)`, stop is `Some(vec!["STOP"])`.
+
+**d) `test_params_to_options_max_tokens_zero_is_none`** — `max_tokens=0` maps to `num_predict=None`.
+
+**e) `test_stream_inference_missing_params`** — Request with `params: None` returns `Status::invalid_argument`.
+
+**f) `test_stream_inference_missing_context`** — Request with params but no context returns `Status::invalid_argument`.
+
+**g) `test_stream_inference_empty_prompt`** — Empty prompt returns `Status::invalid_argument`.
+
+**h) Remove `test_stream_inference_unimplemented`** — The stub test is no longer valid.
+
+#### Integration tests
+
+Full streaming tests require a running Ollama instance or wiremock. These are better suited for issue #43 (integration tests). For this issue, focus on the `params_to_options` helper and request validation tests that do not require an Ollama connection.
+
+## Files to Create/Modify
+
+| File | Action | Purpose |
+|---|---|---|
+| `services/model-gateway/src/service.rs` | Modify | Replace `stream_inference` stub with real implementation; add `params_to_options()` helper; add unit tests; add `futures::StreamExt` and Ollama type imports |
+| `services/model-gateway/src/ollama/mod.rs` | Modify (if needed) | Ensure `GenerateOptions` and `OllamaError` are re-exported |
+
+## Risks and Edge Cases
+
+- **Proto lacks usage fields:** `StreamInferenceResponse` has no `tokens_used` / `prompt_tokens` field. Token counts from the final Ollama chunk will be logged via tracing but cannot be sent to the client. If the proto is extended later, this is a small change.
+- **Model not loaded in Ollama:** Ollama returns 404 if the model is not pulled. The handler maps this to `Status::not_found` with a descriptive message. Ollama may also auto-pull (depending on config), causing a long delay before streaming starts -- this is acceptable and handled by the reqwest timeout.
+- **Client disconnects mid-stream:** The `tx.send().await.is_err()` check detects this and breaks the loop, dropping the Ollama stream (which closes the HTTP connection).
+- **Ollama stream error mid-way:** Send `Status::internal` through the channel. The gRPC client receives the error as a trailing status after any tokens already sent.
+- **Large responses without backpressure:** The channel capacity (32) provides natural backpressure. If the client is slow to consume, the spawned task blocks on `tx.send()`, which in turn stops reading from the Ollama stream.
+
+## Deviation Log
+
+_(Filled during implementation if deviations from plan occur)_
+
+| Deviation | Reason |
+|---|---|
diff --git a/services/model-gateway/src/ollama/mod.rs b/services/model-gateway/src/ollama/mod.rs
index 96eaed4..a73cfda 100644
--- a/services/model-gateway/src/ollama/mod.rs
+++ b/services/model-gateway/src/ollama/mod.rs
@@ -4,3 +4,4 @@ pub mod types;
 
 pub use client::OllamaClient;
 pub use error::OllamaError;
+pub use types::GenerateOptions;
diff --git a/services/model-gateway/src/service.rs b/services/model-gateway/src/service.rs
index 6220ec3..06a2e12 100644
--- a/services/model-gateway/src/service.rs
+++ b/services/model-gateway/src/service.rs
@@ -1,28 +1,26 @@
 use std::sync::{Arc, Mutex};
 
+use futures::StreamExt;
 use llm_multiverse_proto::llm_multiverse::v1::audit_service_client::AuditServiceClient;
 use llm_multiverse_proto::llm_multiverse::v1::model_gateway_service_server::ModelGatewayService;
 use llm_multiverse_proto::llm_multiverse::v1::{
-    AuditEntry, AppendRequest, GenerateEmbeddingRequest, GenerateEmbeddingResponse,
-    InferenceRequest, InferenceResponse, IsModelReadyRequest, IsModelReadyResponse,
-    SessionContext, StreamInferenceRequest, StreamInferenceResponse,
+    AppendRequest, AuditEntry, GenerateEmbeddingRequest, GenerateEmbeddingResponse,
+    InferenceParams, InferenceRequest, InferenceResponse, IsModelReadyRequest,
+    IsModelReadyResponse, SessionContext, StreamInferenceRequest, StreamInferenceResponse,
 };
 use sha2::{Digest, Sha256};
 use tonic::transport::Channel;
 use tonic::{Request, Response, Status};
 
 use crate::config::Config;
-use crate::ollama::OllamaClient;
+use crate::ollama::{GenerateOptions, OllamaClient, OllamaError};
 use crate::routing::ModelRouter;
 
 /// Implementation of the ModelGatewayService gRPC trait.
 pub struct ModelGatewayServiceImpl {
     config: Config,
-    #[allow(dead_code)]
     ollama: OllamaClient,
-    #[allow(dead_code)]
     router: ModelRouter,
-    #[allow(dead_code)]
     audit_client: Option<Arc<Mutex<AuditServiceClient<Channel>>>>,
 }
 
@@ -45,8 +43,50 @@ impl ModelGatewayServiceImpl {
     }
 }
 
+/// Map proto InferenceParams to Ollama GenerateOptions.
+///
+/// Returns `None` if all fields are at defaults (avoids sending empty options).
+pub(crate) fn params_to_options(params: &InferenceParams) -> Option<GenerateOptions> {
+    let temperature = params.temperature;
+    let top_p = params.top_p;
+    let num_predict = if params.max_tokens > 0 {
+        Some(params.max_tokens as i32)
+    } else {
+        None
+    };
+    let stop = if params.stop_sequences.is_empty() {
+        None
+    } else {
+        Some(params.stop_sequences.clone())
+    };
+
+    if temperature.is_none() && top_p.is_none() && num_predict.is_none() && stop.is_none() {
+        return None;
+    }
+
+    Some(GenerateOptions {
+        temperature,
+        top_p,
+        num_predict,
+        stop,
+    })
+}
+
+/// Map OllamaError to gRPC Status.
+fn ollama_err_to_status(model_name: &str, e: OllamaError) -> Status {
+    match &e {
+        OllamaError::Api { status, message } if *status == 404 => {
+            Status::not_found(format!("model '{model_name}' not found: {message}"))
+        }
+        OllamaError::Api { status, message } => {
+            Status::internal(format!("Ollama error ({status}): {message}"))
+        }
+        OllamaError::Http(_) => Status::unavailable(format!("Ollama unreachable: {e}")),
+        _ => Status::internal(format!("Ollama error: {e}")),
+    }
+}
+
 /// Log an inference request to the audit service (best-effort).
-#[allow(dead_code)]
 pub(crate) async fn audit_log_inference(
     audit_client: &Arc<Mutex<AuditServiceClient<Channel>>>,
     ctx: &SessionContext,
@@ -89,9 +129,15 @@ pub(crate) async fn audit_log_inference(
     }
 }
 
-fn hash_inference(rpc_name: &str, model_name: &str, prompt_length: usize, task_complexity: i32) -> String {
+fn hash_inference(
+    rpc_name: &str,
+    model_name: &str,
+    prompt_length: usize,
+    task_complexity: i32,
+) -> String {
     let mut hasher = Sha256::new();
-    hasher.update(format!("{rpc_name}:{model_name}:{prompt_length}:{task_complexity}").as_bytes());
+    hasher
+        .update(format!("{rpc_name}:{model_name}:{prompt_length}:{task_complexity}").as_bytes());
     format!("{:x}", hasher.finalize())
 }
 
@@ -102,11 +148,103 @@ impl ModelGatewayService for ModelGatewayServiceImpl {
 
     async fn stream_inference(
         &self,
-        _request: Request<StreamInferenceRequest>,
+        request: Request<StreamInferenceRequest>,
     ) -> Result<Response<Self::StreamInferenceStream>, Status> {
-        Err(Status::unimplemented(
-            "StreamInference not yet implemented",
-        ))
+        let req = request.into_inner();
+        let params = req
+            .params
+            .ok_or_else(|| Status::invalid_argument("params is required"))?;
+        let ctx = params
+            .context
+            .clone()
+            .ok_or_else(|| Status::invalid_argument("params.context is required"))?;
+        if ctx.session_id.is_empty() {
+            return Err(Status::invalid_argument("context.session_id is required"));
+        }
+        if params.prompt.is_empty() {
+            return Err(Status::invalid_argument("prompt is required"));
+        }
+
+        // Resolve model via routing
+        let model_name = self
+            .router
+            .resolve_model(params.task_complexity, params.model_hint.as_deref());
+
+        // Map params to Ollama options
+        let options = params_to_options(&params);
+
+        // Audit log (best-effort)
+        if let Some(ref audit) = self.audit_client {
+            audit_log_inference(
+                audit,
+                &ctx,
+                &model_name,
+                params.prompt.len(),
+                params.task_complexity,
+                "StreamInference",
+                "started",
+            )
+            .await;
+        }
+
+        // Call Ollama streaming API
+        let ollama_stream = self
+            .ollama
+            .generate_stream(&model_name, &params.prompt, options)
+            .await
+            .map_err(|e| ollama_err_to_status(&model_name, e))?;
+
+        // Bridge Ollama stream to gRPC stream via mpsc channel
+        let (tx, rx) = tokio::sync::mpsc::channel(32);
+
+        tokio::spawn(async move {
+            let mut stream = ollama_stream;
+            while let Some(chunk_result) = stream.next().await {
+                match chunk_result {
+                    Ok(chunk) => {
+                        if chunk.done {
+                            // Log token counts from final chunk
+                            if let Some(eval_count) = chunk.eval_count {
+                                tracing::debug!(
+                                    eval_count,
+                                    prompt_eval_count = chunk.prompt_eval_count,
+                                    "StreamInference completed"
+                                );
+                            }
+                        }
+
+                        let finish_reason = if chunk.done {
+                            Some(
+                                chunk
+                                    .done_reason
+                                    .unwrap_or_else(|| "stop".to_string()),
+                            )
+                        } else {
+                            None
+                        };
+
+                        let response = StreamInferenceResponse {
+                            token: chunk.response,
+                            finish_reason,
+                        };
+
+                        if tx.send(Ok(response)).await.is_err() {
+                            break; // Client disconnected
+                        }
+                    }
+                    Err(e) => {
+                        let _ = tx
+                            .send(Err(Status::internal(format!("stream error: {e}"))))
+                            .await;
+                        break;
+                    }
+                }
+            }
+        });
+
+        Ok(Response::new(tokio_stream::wrappers::ReceiverStream::new(
+            rx,
+        )))
     }
 
     async fn inference(
@@ -160,11 +298,139 @@ impl ModelGatewayService for ModelGatewayServiceImpl {
 #[cfg(test)]
 mod tests {
     use super::*;
+    use llm_multiverse_proto::llm_multiverse::v1::SessionContext;
 
     fn test_config() -> Config {
         Config::default()
     }
 
+    fn valid_ctx() -> SessionContext {
+        SessionContext {
+            session_id: "test-session".into(),
+            user_id: "test-user".into(),
+            ..Default::default()
+        }
+    }
+
+    // --- params_to_options tests ---
+
+    #[test]
+    fn test_params_to_options_all_defaults() {
+        let params = InferenceParams {
+            prompt: "hello".into(),
+            ..Default::default()
+        };
+        assert!(params_to_options(&params).is_none());
+    }
+
+    #[test]
+    fn test_params_to_options_temperature_only() {
+        let params = InferenceParams {
+            prompt: "hello".into(),
+            temperature: Some(0.7),
+            ..Default::default()
+        };
+        let opts = params_to_options(&params).unwrap();
+        assert!((opts.temperature.unwrap() - 0.7).abs() < f32::EPSILON);
+        assert!(opts.top_p.is_none());
+        assert!(opts.num_predict.is_none());
+        assert!(opts.stop.is_none());
+    }
+
+    #[test]
+    fn test_params_to_options_all_fields() {
+        let params = InferenceParams {
+            prompt: "hello".into(),
+            temperature: Some(0.8),
+            top_p: Some(0.9),
+            max_tokens: 100,
+            stop_sequences: vec!["STOP".into()],
+            ..Default::default()
+        };
+        let opts = params_to_options(&params).unwrap();
+        assert!((opts.temperature.unwrap() - 0.8).abs() < f32::EPSILON);
+        assert!((opts.top_p.unwrap() - 0.9).abs() < f32::EPSILON);
+        assert_eq!(opts.num_predict, Some(100));
+        assert_eq!(opts.stop, Some(vec!["STOP".to_string()]));
+    }
+
+    #[test]
+    fn test_params_to_options_max_tokens_zero_is_none() {
+        let params = InferenceParams {
+            prompt: "hello".into(),
+            max_tokens: 0,
+            temperature: Some(0.5),
+            ..Default::default()
+        };
+        let opts = params_to_options(&params).unwrap();
+        assert!(opts.num_predict.is_none());
+    }
+
+    // --- StreamInference validation tests ---
+
+    #[tokio::test]
+    async fn test_stream_inference_missing_params() {
+        let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
+        let req = Request::new(StreamInferenceRequest { params: None });
+
+        let result = svc.stream_inference(req).await;
+        assert!(result.is_err());
+        assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument);
+    }
+
+    #[tokio::test]
+    async fn test_stream_inference_missing_context() {
+        let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
+        let req = Request::new(StreamInferenceRequest {
+            params: Some(InferenceParams {
+                prompt: "hello".into(),
+                context: None,
+                ..Default::default()
+            }),
+        });
+
+        let result = svc.stream_inference(req).await;
+        assert!(result.is_err());
+        assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument);
+    }
+
+    #[tokio::test]
+    async fn test_stream_inference_empty_prompt() {
+        let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
+        let req = Request::new(StreamInferenceRequest {
+            params: Some(InferenceParams {
+                prompt: "".into(),
+                context: Some(valid_ctx()),
+                ..Default::default()
+            }),
+        });
+
+        let result = svc.stream_inference(req).await;
+        assert!(result.is_err());
+        assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument);
+    }
+
+    #[tokio::test]
+    async fn test_stream_inference_empty_session_id() {
+        let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
+        let req = Request::new(StreamInferenceRequest {
+            params: Some(InferenceParams {
+                prompt: "hello".into(),
+                context: Some(SessionContext {
+                    session_id: "".into(),
+                    ..Default::default()
+                }),
+                ..Default::default()
+            }),
+        });
+
+        let result = svc.stream_inference(req).await;
+        assert!(result.is_err());
+        assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument);
+    }
+
+    // --- IsModelReady tests ---
+
     #[tokio::test]
     async fn test_is_model_ready_all_models() {
         let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
@@ -174,10 +440,6 @@ mod tests {
         assert!(resp.ready);
         assert!(!resp.available_models.is_empty());
         assert!(resp.available_models.contains(&"llama3.2:3b".to_string()));
-        assert!(resp
-            .available_models
-            .contains(&"nomic-embed-text".to_string()));
-        assert!(resp.error_message.is_none());
     }
 
     #[tokio::test]
@@ -189,8 +451,6 @@ mod tests {
 
         let resp = svc.is_model_ready(req).await.unwrap().into_inner();
         assert!(resp.ready);
-        assert_eq!(resp.available_models, vec!["llama3.2:3b"]);
-        assert!(resp.error_message.is_none());
     }
 
     #[tokio::test]
@@ -202,36 +462,10 @@ mod tests {
 
         let resp = svc.is_model_ready(req).await.unwrap().into_inner();
         assert!(!resp.ready);
-        assert!(resp.available_models.is_empty());
         assert!(resp.error_message.is_some());
     }
 
-    #[tokio::test]
-    async fn test_is_model_ready_with_aliases() {
-        let mut config = test_config();
-        config
-            .routing
-            .aliases
-            .insert("code".into(), "codellama:7b".into());
-
-        let svc = ModelGatewayServiceImpl::new(config).unwrap();
-        let req = Request::new(IsModelReadyRequest {
-            model_name: Some("codellama:7b".to_string()),
-        });
-
-        let resp = svc.is_model_ready(req).await.unwrap().into_inner();
-        assert!(resp.ready);
-    }
-
-    #[tokio::test]
-    async fn test_stream_inference_unimplemented() {
-        let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
-        let req = Request::new(StreamInferenceRequest { params: None });
-
-        let result = svc.stream_inference(req).await;
-        assert!(result.is_err());
-        assert_eq!(result.unwrap_err().code(), tonic::Code::Unimplemented);
-    }
+    // --- Other tests ---
 
     #[tokio::test]
     async fn test_inference_unimplemented() {
@@ -260,7 +494,6 @@ mod tests {
     #[test]
     fn test_service_has_router() {
         let svc = ModelGatewayServiceImpl::new(test_config()).unwrap();
-        // Router is initialized — verify it resolves correctly
         let model = svc.router.resolve_model(1, None);
         assert_eq!(model, "llama3.2:3b");
     }
@@ -274,14 +507,11 @@ mod tests {
     #[test]
     fn test_hash_inference() {
         let hash = hash_inference("Inference", "llama3.2:3b", 100, 1);
-        assert!(!hash.is_empty());
-        assert_eq!(hash.len(), 64); // SHA-256 hex is 64 chars
+        assert_eq!(hash.len(), 64);
 
-        // Same inputs produce same hash
         let hash2 = hash_inference("Inference", "llama3.2:3b", 100, 1);
         assert_eq!(hash, hash2);
 
-        // Different inputs produce different hash
         let hash3 = hash_inference("Inference", "llama3.2:14b", 100, 2);
         assert_ne!(hash, hash3);
     }