25 KiB
Implementation Plan — Issue #39: Implement Ollama HTTP client
Metadata
| Field | Value |
|---|---|
| Issue | #39 |
| Title | Implement Ollama HTTP client |
| Milestone | Phase 5: Model Gateway |
| Labels | |
| Status | COMPLETED |
| Language | Rust |
| Related Plans | issue-012.md, issue-038.md |
| Blocked by | #38 (completed) |
Acceptance Criteria
- Async HTTP client using reqwest
- Support for /api/generate (streaming and non-streaming)
- Support for /api/chat with message history
- Support for /api/embed (embeddings)
- Connection pooling and timeout configuration
- Error handling for Ollama-specific error responses
Architecture Analysis
Service Context
This issue belongs to the Model Gateway service (services/model-gateway/). The Ollama HTTP client is the core backend layer that the gRPC service handlers (service.rs) will call to fulfil Inference, StreamInference, GenerateEmbedding, and IsModelReady RPCs.
The client wraps the Ollama REST API (default http://localhost:11434) and exposes typed Rust methods that the gRPC handlers can call directly. The gRPC handlers (issue #40+) will translate proto request/response types to/from the Ollama client types defined here.
gRPC endpoints affected (consumers of this client):
Inference— will callOllamaClient::generate()(non-streaming)StreamInference— will callOllamaClient::generate_stream()(streaming)GenerateEmbedding— will callOllamaClient::embed()IsModelReady— will callOllamaClient::list_models()to check actual Ollama availability
Proto messages involved:
InferenceParams— carries prompt, model routing hints (TaskComplexity), temperature, top_p, max_tokens, stop_sequencesInferenceResponse— text, finish_reason, tokens_usedStreamInferenceResponse— token, finish_reasonGenerateEmbeddingRequest— text, modelGenerateEmbeddingResponse— embedding vector, dimensions
Existing Patterns
- Config:
services/model-gateway/src/config.rsalready definesConfigwithollama_url: String(defaulthttp://localhost:11434) andModelRoutingConfigfor model name resolution. - Service struct:
services/model-gateway/src/service.rsdefinesModelGatewayServiceImplholdingConfig. TheOllamaClientwill be added here as a field. - Error types: Other services use
thiserrorfor module-level error enums (e.g.,DbError,EmbeddingError,ProvenanceError). The model-gatewayCargo.tomlalready includesthiserror = "2". - Async runtime:
tokiowithfeatures = ["full"]is already a dependency.tokio-stream = "0.1"is also present. - Serde:
serde = { version = "1", features = ["derive"] }is already a dependency for config deserialization.
Dependencies
- reqwest (new) — HTTP client with connection pooling, async support, JSON serialization, and streaming response bodies. Features needed:
json,stream. - futures (new) — For
Streamtrait and stream combinators (futures::Stream,futures::StreamExt). Needed to expose streaming generate responses as aStreamtype. - serde_json (new) — For parsing newline-delimited JSON (NDJSON) from Ollama streaming responses. While
reqwestcan deserialize full JSON responses, streaming requires manual line-by-line parsing. - No proto changes required — the Ollama client is an internal HTTP layer; the proto definitions are already complete from issue #12.
Implementation Steps
1. Types & Configuration
Add Ollama-specific configuration to services/model-gateway/src/config.rs:
/// Configuration for the Ollama HTTP client.
#[derive(Debug, Clone, Deserialize)]
pub struct OllamaClientConfig {
/// Request timeout in seconds (default: 300 — generous for large model inference).
#[serde(default = "default_request_timeout_secs")]
pub request_timeout_secs: u64,
/// Connection timeout in seconds (default: 10).
#[serde(default = "default_connect_timeout_secs")]
pub connect_timeout_secs: u64,
/// Maximum idle connections in the pool (default: 10).
#[serde(default = "default_pool_max_idle")]
pub pool_max_idle: usize,
/// Idle connection timeout in seconds (default: 60).
#[serde(default = "default_pool_idle_timeout_secs")]
pub pool_idle_timeout_secs: u64,
}
Add #[serde(default)] pub client: OllamaClientConfig field to the existing Config struct.
Define Ollama API request/response types in services/model-gateway/src/ollama/types.rs:
These are serde structs matching the Ollama REST API JSON schema.
use serde::{Deserialize, Serialize};
// --- /api/generate ---
#[derive(Debug, Serialize)]
pub struct GenerateRequest {
pub model: String,
pub prompt: String,
pub stream: bool,
#[serde(skip_serializing_if = "Option::is_none")]
pub options: Option<GenerateOptions>,
}
#[derive(Debug, Serialize)]
pub struct GenerateOptions {
#[serde(skip_serializing_if = "Option::is_none")]
pub temperature: Option<f32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub top_p: Option<f32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub num_predict: Option<i32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub stop: Option<Vec<String>>,
}
/// Full response from /api/generate with stream:false.
#[derive(Debug, Deserialize)]
pub struct GenerateResponse {
pub model: String,
pub response: String,
pub done: bool,
#[serde(default)]
pub done_reason: Option<String>,
#[serde(default)]
pub total_duration: Option<u64>,
#[serde(default)]
pub eval_count: Option<u32>,
#[serde(default)]
pub prompt_eval_count: Option<u32>,
}
/// Single chunk from /api/generate with stream:true (NDJSON).
#[derive(Debug, Deserialize)]
pub struct GenerateStreamChunk {
pub model: String,
pub response: String,
pub done: bool,
#[serde(default)]
pub done_reason: Option<String>,
#[serde(default)]
pub eval_count: Option<u32>,
#[serde(default)]
pub prompt_eval_count: Option<u32>,
}
// --- /api/chat ---
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ChatRole {
System,
User,
Assistant,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatMessage {
pub role: ChatRole,
pub content: String,
}
#[derive(Debug, Serialize)]
pub struct ChatRequest {
pub model: String,
pub messages: Vec<ChatMessage>,
pub stream: bool,
#[serde(skip_serializing_if = "Option::is_none")]
pub options: Option<GenerateOptions>,
}
#[derive(Debug, Deserialize)]
pub struct ChatResponse {
pub model: String,
pub message: ChatMessage,
pub done: bool,
#[serde(default)]
pub done_reason: Option<String>,
#[serde(default)]
pub total_duration: Option<u64>,
#[serde(default)]
pub eval_count: Option<u32>,
#[serde(default)]
pub prompt_eval_count: Option<u32>,
}
// --- /api/embed ---
#[derive(Debug, Serialize)]
pub struct EmbedRequest {
pub model: String,
pub input: Vec<String>,
}
#[derive(Debug, Deserialize)]
pub struct EmbedResponse {
pub model: String,
pub embeddings: Vec<Vec<f32>>,
}
// --- /api/tags (list models) ---
#[derive(Debug, Deserialize)]
pub struct ListModelsResponse {
pub models: Vec<ModelInfo>,
}
#[derive(Debug, Deserialize)]
pub struct ModelInfo {
pub name: String,
pub model: String,
#[serde(default)]
pub size: u64,
#[serde(default)]
pub digest: Option<String>,
}
// --- /api/show (model details) ---
#[derive(Debug, Serialize)]
pub struct ShowModelRequest {
pub model: String,
}
#[derive(Debug, Deserialize)]
pub struct ShowModelResponse {
pub modelfile: Option<String>,
pub parameters: Option<String>,
pub template: Option<String>,
}
2. Core Logic
Create services/model-gateway/src/ollama/error.rs — Error types:
use thiserror::Error;
#[derive(Debug, Error)]
pub enum OllamaError {
/// HTTP-level error (connection refused, timeout, DNS, TLS, etc.).
#[error("HTTP error: {0}")]
Http(#[from] reqwest::Error),
/// Ollama returned a non-2xx status code.
#[error("Ollama API error (status {status}): {message}")]
Api {
status: u16,
message: String,
},
/// Failed to deserialize Ollama JSON response.
#[error("deserialization error: {0}")]
Deserialization(String),
/// Stream terminated unexpectedly without a done:true chunk.
#[error("stream ended unexpectedly")]
StreamIncomplete,
}
Create services/model-gateway/src/ollama/client.rs — OllamaClient:
use std::time::Duration;
use futures::Stream;
use reqwest::Client;
use crate::config::{Config, OllamaClientConfig};
use super::error::OllamaError;
use super::types::*;
/// Async HTTP client for the Ollama REST API.
///
/// Wraps `reqwest::Client` with connection pooling, timeouts, and
/// typed methods for each Ollama endpoint.
pub struct OllamaClient {
client: Client,
base_url: String,
}
impl OllamaClient {
/// Create a new client from the service configuration.
///
/// Configures connection pooling, timeouts, and the base URL
/// from `Config.ollama_url` and `Config.client`.
pub fn new(config: &Config) -> Result<Self, OllamaError> {
let client_config = &config.client;
let client = Client::builder()
.timeout(Duration::from_secs(client_config.request_timeout_secs))
.connect_timeout(Duration::from_secs(client_config.connect_timeout_secs))
.pool_max_idle_per_host(client_config.pool_max_idle)
.pool_idle_timeout(Duration::from_secs(client_config.pool_idle_timeout_secs))
.build()?;
let base_url = config.ollama_url.trim_end_matches('/').to_string();
Ok(Self { client, base_url })
}
/// POST /api/generate (non-streaming).
///
/// Sends a prompt to the specified model and returns the complete response.
pub async fn generate(
&self,
model: &str,
prompt: &str,
options: Option<GenerateOptions>,
) -> Result<GenerateResponse, OllamaError> {
let request = GenerateRequest {
model: model.to_string(),
prompt: prompt.to_string(),
stream: false,
options,
};
let resp = self.client
.post(format!("{}/api/generate", self.base_url))
.json(&request)
.send()
.await?;
self.handle_error_response(resp)
.await?
.json::<GenerateResponse>()
.await
.map_err(|e| OllamaError::Deserialization(e.to_string()))
}
/// POST /api/generate (streaming).
///
/// Returns a `Stream` of `GenerateStreamChunk` items. Each chunk
/// contains a partial token. The final chunk has `done: true`.
///
/// Ollama streams NDJSON (one JSON object per line). This method
/// reads the response body as a byte stream, splits on newlines,
/// and deserializes each line.
pub async fn generate_stream(
&self,
model: &str,
prompt: &str,
options: Option<GenerateOptions>,
) -> Result<
impl Stream<Item = Result<GenerateStreamChunk, OllamaError>>,
OllamaError,
> {
let request = GenerateRequest {
model: model.to_string(),
prompt: prompt.to_string(),
stream: true,
options,
};
let resp = self.client
.post(format!("{}/api/generate", self.base_url))
.json(&request)
.send()
.await?;
let resp = self.handle_error_response(resp).await?;
Ok(Self::ndjson_stream::<GenerateStreamChunk>(resp))
}
/// POST /api/chat (non-streaming).
///
/// Sends a chat conversation (message history) to the model.
pub async fn chat(
&self,
model: &str,
messages: Vec<ChatMessage>,
options: Option<GenerateOptions>,
) -> Result<ChatResponse, OllamaError> {
let request = ChatRequest {
model: model.to_string(),
messages,
stream: false,
options,
};
let resp = self.client
.post(format!("{}/api/chat", self.base_url))
.json(&request)
.send()
.await?;
self.handle_error_response(resp)
.await?
.json::<ChatResponse>()
.await
.map_err(|e| OllamaError::Deserialization(e.to_string()))
}
/// POST /api/embed.
///
/// Generates embedding vectors for the given input texts.
/// Returns one embedding vector per input string.
pub async fn embed(
&self,
model: &str,
input: Vec<String>,
) -> Result<EmbedResponse, OllamaError> {
let request = EmbedRequest {
model: model.to_string(),
input,
};
let resp = self.client
.post(format!("{}/api/embed", self.base_url))
.json(&request)
.send()
.await?;
self.handle_error_response(resp)
.await?
.json::<EmbedResponse>()
.await
.map_err(|e| OllamaError::Deserialization(e.to_string()))
}
/// GET /api/tags.
///
/// Lists all models available on the Ollama instance.
pub async fn list_models(&self) -> Result<ListModelsResponse, OllamaError> {
let resp = self.client
.get(format!("{}/api/tags", self.base_url))
.send()
.await?;
self.handle_error_response(resp)
.await?
.json::<ListModelsResponse>()
.await
.map_err(|e| OllamaError::Deserialization(e.to_string()))
}
/// Check if Ollama is reachable by hitting GET /api/tags.
/// Returns true if the request succeeds, false otherwise.
pub async fn is_healthy(&self) -> bool {
self.list_models().await.is_ok()
}
/// Parse NDJSON streaming response into a Stream of typed chunks.
///
/// Ollama streams responses as newline-delimited JSON. Each line
/// is a complete JSON object. This method uses `bytes_stream()`
/// from reqwest and buffers bytes until a newline is found,
/// then deserializes each complete line.
fn ndjson_stream<T: serde::de::DeserializeOwned>(
resp: reqwest::Response,
) -> impl Stream<Item = Result<T, OllamaError>> {
use futures::StreamExt;
let byte_stream = resp.bytes_stream();
let mut buffer = Vec::new();
futures::stream::unfold(
(byte_stream, buffer),
|(mut stream, mut buf)| async move {
// Implementation: accumulate bytes, split on \n,
// deserialize each complete line as T.
// Return None when stream ends.
// ...
},
)
}
/// Check response status and extract error message for non-2xx responses.
async fn handle_error_response(
&self,
resp: reqwest::Response,
) -> Result<reqwest::Response, OllamaError> {
if resp.status().is_success() {
return Ok(resp);
}
let status = resp.status().as_u16();
let message = resp
.text()
.await
.unwrap_or_else(|_| "unknown error".to_string());
Err(OllamaError::Api { status, message })
}
}
NDJSON stream implementation detail:
The ndjson_stream method will use reqwest::Response::bytes_stream() (requires the stream feature) and futures::stream::unfold to:
- Accumulate bytes from the HTTP response body into a buffer.
- On each newline boundary, extract the complete line.
- Deserialize the line as
Tusingserde_json::from_slice. - Yield
Ok(T)orErr(OllamaError::Deserialization(...)). - Return
Nonewhen the byte stream is exhausted.
This approach handles partial JSON objects that span multiple TCP chunks correctly.
3. gRPC Handler Wiring
This issue does not implement the gRPC handler wiring — that is deferred to subsequent issues. However, the OllamaClient must be integrated into ModelGatewayServiceImpl so that future handler implementations can use it.
Update services/model-gateway/src/service.rs:
Add OllamaClient as a field on ModelGatewayServiceImpl:
use crate::ollama::OllamaClient;
pub struct ModelGatewayServiceImpl {
config: Config,
ollama: OllamaClient,
}
impl ModelGatewayServiceImpl {
pub fn new(config: Config) -> anyhow::Result<Self> {
let ollama = OllamaClient::new(&config)?;
Ok(Self { config, ollama })
}
}
Note: The constructor changes from infallible to Result since reqwest::Client::builder().build() can fail. Update main.rs accordingly to use ?.
Update services/model-gateway/src/main.rs:
Change ModelGatewayServiceImpl::new(config) to ModelGatewayServiceImpl::new(config)?.
4. Service Integration
No cross-service integration is needed for this issue. The OllamaClient is a standalone HTTP client that talks to the local Ollama instance. Integration with gRPC handlers will happen in follow-up issues.
5. Tests
Unit tests for serde types in services/model-gateway/src/ollama/types.rs:
| Test Case | Description |
|---|---|
test_generate_request_serialization |
GenerateRequest serializes to expected JSON with stream: false |
test_generate_request_serialization_with_options |
Options fields are included when Some, omitted when None |
test_generate_response_deserialization |
Deserialize a complete Ollama generate response JSON |
test_generate_response_missing_optional_fields |
Optional fields default to None when absent |
test_generate_stream_chunk_deserialization |
Deserialize a streaming chunk (partial token, done: false) |
test_generate_stream_chunk_final |
Deserialize final chunk with done: true and done_reason |
test_chat_request_serialization |
ChatRequest with multiple messages serializes correctly |
test_chat_role_serialization |
ChatRole variants serialize as lowercase strings |
test_chat_response_deserialization |
Deserialize a complete chat response |
test_embed_request_serialization |
EmbedRequest with multiple inputs serializes correctly |
test_embed_response_deserialization |
Deserialize embedding response with vector data |
test_list_models_response_deserialization |
Deserialize model listing with multiple models |
test_model_info_optional_fields |
ModelInfo handles missing digest gracefully |
Unit tests for error handling in services/model-gateway/src/ollama/error.rs:
| Test Case | Description |
|---|---|
test_error_display_http |
OllamaError::Http formats with reqwest message |
test_error_display_api |
OllamaError::Api includes status code and message |
test_error_display_deserialization |
OllamaError::Deserialization includes detail |
Integration-style tests for OllamaClient in services/model-gateway/src/ollama/client.rs:
Use a mock HTTP server (either mockito or wiremock) to simulate Ollama API responses:
| Test Case | Description |
|---|---|
test_generate_success |
Mock /api/generate returns valid JSON, verify parsed response |
test_generate_with_options |
Verify temperature, top_p, num_predict, stop are sent in request body |
test_generate_stream_success |
Mock returns NDJSON with 3 chunks + final, verify all chunks yielded |
test_generate_stream_empty_response |
Mock returns single done: true chunk |
test_chat_success |
Mock /api/chat returns valid response, verify message parsing |
test_chat_with_history |
Send multi-message conversation, verify all messages in request body |
test_embed_success |
Mock /api/embed returns embedding vectors, verify dimensions |
test_embed_multiple_inputs |
Send multiple texts, verify multiple embeddings returned |
test_list_models_success |
Mock /api/tags returns model list |
test_list_models_empty |
Mock returns empty model list |
test_is_healthy_success |
Mock /api/tags returns 200, is_healthy() returns true |
test_is_healthy_failure |
Mock returns 500, is_healthy() returns false |
test_api_error_404 |
Mock returns 404 with error message, verify OllamaError::Api |
test_api_error_500 |
Mock returns 500 with error body, verify error extraction |
test_connection_timeout |
Client configured with very short timeout, verify OllamaError::Http |
test_base_url_trailing_slash |
Config URL with trailing slash is normalized |
Mocking strategy:
Use wiremock crate as a dev-dependency. It provides a MockServer that binds to a random port, allowing parallel test execution without port conflicts. Each test creates its own MockServer, configures expected requests/responses, then creates an OllamaClient pointed at the mock server URL.
For streaming tests, the mock server returns a response body containing multiple NDJSON lines separated by \n.
Configuration tests in services/model-gateway/src/config.rs:
| Test Case | Description |
|---|---|
test_client_config_defaults |
OllamaClientConfig::default() returns expected timeout/pool values |
test_client_config_from_toml |
Custom client config loads from TOML |
test_config_with_client_section |
Full Config with [client] section parses correctly |
Files to Create/Modify
| File | Action | Purpose |
|---|---|---|
services/model-gateway/Cargo.toml |
Modify | Add reqwest (with json, stream features), futures, serde_json dependencies; add wiremock dev-dependency |
services/model-gateway/src/config.rs |
Modify | Add OllamaClientConfig struct with timeout/pool settings; add client field to Config |
services/model-gateway/src/ollama/mod.rs |
Create | Module declaration, re-exports of OllamaClient, OllamaError, and types |
services/model-gateway/src/ollama/types.rs |
Create | Serde request/response structs for all Ollama API endpoints |
services/model-gateway/src/ollama/error.rs |
Create | OllamaError enum with Http, Api, Deserialization, StreamIncomplete variants |
services/model-gateway/src/ollama/client.rs |
Create | OllamaClient struct with generate, generate_stream, chat, embed, list_models, is_healthy methods and NDJSON stream parser |
services/model-gateway/src/lib.rs |
Modify | Add pub mod ollama; |
services/model-gateway/src/service.rs |
Modify | Add OllamaClient field to ModelGatewayServiceImpl; change constructor to return Result |
services/model-gateway/src/main.rs |
Modify | Update ModelGatewayServiceImpl::new(config) call to handle Result with ? |
Risks and Edge Cases
- Streaming NDJSON parsing: Ollama sends newline-delimited JSON. TCP chunks may not align with JSON object boundaries — a single chunk could contain a partial JSON line or multiple lines. The buffer-based
ndjson_streamimplementation must handle both cases. Mitigation: accumulate bytes until\nis found, only parse complete lines. - Large model response times: Inference on large models (14B+) can take minutes. The default request timeout of 300 seconds should be sufficient, but this is configurable. Streaming mitigates perceived latency by yielding tokens incrementally.
- Ollama API version compatibility: The
/api/embedendpoint (withinputarray) was introduced in Ollama 0.1.44+. Older Ollama versions use/api/embeddingswith a different request shape. Mitigation: target the newer API. Document the minimum Ollama version requirement. - Connection pool exhaustion: If many concurrent gRPC requests hit the gateway simultaneously, the reqwest connection pool could be exhausted. Mitigation:
pool_max_idleis configurable; the default of 10 is reasonable for a single-node setup. Consider adding a semaphore for concurrency limiting in a future issue if needed. wiremocktest isolation: Each test creates its ownMockServeron a random port, so tests can run in parallel safely. However,wiremockadds to dev-dependency compile time.- Constructor change breaks existing tests: Changing
ModelGatewayServiceImpl::new()from infallible toResultwill break existing tests inservice.rs. Mitigation: update the test helpertest_config()to also construct theOllamaClient, or use a test-only constructor that accepts a pre-built client. Alternatively, keep a separatenew_with_client()constructor for testability and dependency injection. - reqwest TLS: The default reqwest build pulls in
rustlsornative-tls. Since Ollama runs locally over plain HTTP, TLS is not needed. Consider usingdefault-features = falsewith just the required features to minimize compile time and binary size. However, if a user runs Ollama behind a TLS reverse proxy, TLS support is needed. Mitigation: use default features (includes TLS) for now; optimize later if compile time is a concern.
Deviation Log
(Filled during implementation if deviations from plan occur)
| Deviation | Reason |
|---|