Files
llm-multiverse/implementation-plans/issue-039.md
2026-03-10 13:57:40 +01:00

25 KiB

Implementation Plan — Issue #39: Implement Ollama HTTP client

Metadata

Field Value
Issue #39
Title Implement Ollama HTTP client
Milestone Phase 5: Model Gateway
Labels
Status COMPLETED
Language Rust
Related Plans issue-012.md, issue-038.md
Blocked by #38 (completed)

Acceptance Criteria

  • Async HTTP client using reqwest
  • Support for /api/generate (streaming and non-streaming)
  • Support for /api/chat with message history
  • Support for /api/embed (embeddings)
  • Connection pooling and timeout configuration
  • Error handling for Ollama-specific error responses

Architecture Analysis

Service Context

This issue belongs to the Model Gateway service (services/model-gateway/). The Ollama HTTP client is the core backend layer that the gRPC service handlers (service.rs) will call to fulfil Inference, StreamInference, GenerateEmbedding, and IsModelReady RPCs.

The client wraps the Ollama REST API (default http://localhost:11434) and exposes typed Rust methods that the gRPC handlers can call directly. The gRPC handlers (issue #40+) will translate proto request/response types to/from the Ollama client types defined here.

gRPC endpoints affected (consumers of this client):

  • Inference — will call OllamaClient::generate() (non-streaming)
  • StreamInference — will call OllamaClient::generate_stream() (streaming)
  • GenerateEmbedding — will call OllamaClient::embed()
  • IsModelReady — will call OllamaClient::list_models() to check actual Ollama availability

Proto messages involved:

  • InferenceParams — carries prompt, model routing hints (TaskComplexity), temperature, top_p, max_tokens, stop_sequences
  • InferenceResponse — text, finish_reason, tokens_used
  • StreamInferenceResponse — token, finish_reason
  • GenerateEmbeddingRequest — text, model
  • GenerateEmbeddingResponse — embedding vector, dimensions

Existing Patterns

  • Config: services/model-gateway/src/config.rs already defines Config with ollama_url: String (default http://localhost:11434) and ModelRoutingConfig for model name resolution.
  • Service struct: services/model-gateway/src/service.rs defines ModelGatewayServiceImpl holding Config. The OllamaClient will be added here as a field.
  • Error types: Other services use thiserror for module-level error enums (e.g., DbError, EmbeddingError, ProvenanceError). The model-gateway Cargo.toml already includes thiserror = "2".
  • Async runtime: tokio with features = ["full"] is already a dependency. tokio-stream = "0.1" is also present.
  • Serde: serde = { version = "1", features = ["derive"] } is already a dependency for config deserialization.

Dependencies

  • reqwest (new) — HTTP client with connection pooling, async support, JSON serialization, and streaming response bodies. Features needed: json, stream.
  • futures (new) — For Stream trait and stream combinators (futures::Stream, futures::StreamExt). Needed to expose streaming generate responses as a Stream type.
  • serde_json (new) — For parsing newline-delimited JSON (NDJSON) from Ollama streaming responses. While reqwest can deserialize full JSON responses, streaming requires manual line-by-line parsing.
  • No proto changes required — the Ollama client is an internal HTTP layer; the proto definitions are already complete from issue #12.

Implementation Steps

1. Types & Configuration

Add Ollama-specific configuration to services/model-gateway/src/config.rs:

/// Configuration for the Ollama HTTP client.
#[derive(Debug, Clone, Deserialize)]
pub struct OllamaClientConfig {
    /// Request timeout in seconds (default: 300 — generous for large model inference).
    #[serde(default = "default_request_timeout_secs")]
    pub request_timeout_secs: u64,

    /// Connection timeout in seconds (default: 10).
    #[serde(default = "default_connect_timeout_secs")]
    pub connect_timeout_secs: u64,

    /// Maximum idle connections in the pool (default: 10).
    #[serde(default = "default_pool_max_idle")]
    pub pool_max_idle: usize,

    /// Idle connection timeout in seconds (default: 60).
    #[serde(default = "default_pool_idle_timeout_secs")]
    pub pool_idle_timeout_secs: u64,
}

Add #[serde(default)] pub client: OllamaClientConfig field to the existing Config struct.

Define Ollama API request/response types in services/model-gateway/src/ollama/types.rs:

These are serde structs matching the Ollama REST API JSON schema.

use serde::{Deserialize, Serialize};

// --- /api/generate ---

#[derive(Debug, Serialize)]
pub struct GenerateRequest {
    pub model: String,
    pub prompt: String,
    pub stream: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub options: Option<GenerateOptions>,
}

#[derive(Debug, Serialize)]
pub struct GenerateOptions {
    #[serde(skip_serializing_if = "Option::is_none")]
    pub temperature: Option<f32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub top_p: Option<f32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub num_predict: Option<i32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub stop: Option<Vec<String>>,
}

/// Full response from /api/generate with stream:false.
#[derive(Debug, Deserialize)]
pub struct GenerateResponse {
    pub model: String,
    pub response: String,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub total_duration: Option<u64>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

/// Single chunk from /api/generate with stream:true (NDJSON).
#[derive(Debug, Deserialize)]
pub struct GenerateStreamChunk {
    pub model: String,
    pub response: String,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

// --- /api/chat ---

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ChatRole {
    System,
    User,
    Assistant,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatMessage {
    pub role: ChatRole,
    pub content: String,
}

#[derive(Debug, Serialize)]
pub struct ChatRequest {
    pub model: String,
    pub messages: Vec<ChatMessage>,
    pub stream: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub options: Option<GenerateOptions>,
}

#[derive(Debug, Deserialize)]
pub struct ChatResponse {
    pub model: String,
    pub message: ChatMessage,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub total_duration: Option<u64>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

// --- /api/embed ---

#[derive(Debug, Serialize)]
pub struct EmbedRequest {
    pub model: String,
    pub input: Vec<String>,
}

#[derive(Debug, Deserialize)]
pub struct EmbedResponse {
    pub model: String,
    pub embeddings: Vec<Vec<f32>>,
}

// --- /api/tags (list models) ---

#[derive(Debug, Deserialize)]
pub struct ListModelsResponse {
    pub models: Vec<ModelInfo>,
}

#[derive(Debug, Deserialize)]
pub struct ModelInfo {
    pub name: String,
    pub model: String,
    #[serde(default)]
    pub size: u64,
    #[serde(default)]
    pub digest: Option<String>,
}

// --- /api/show (model details) ---

#[derive(Debug, Serialize)]
pub struct ShowModelRequest {
    pub model: String,
}

#[derive(Debug, Deserialize)]
pub struct ShowModelResponse {
    pub modelfile: Option<String>,
    pub parameters: Option<String>,
    pub template: Option<String>,
}

2. Core Logic

Create services/model-gateway/src/ollama/error.rs — Error types:

use thiserror::Error;

#[derive(Debug, Error)]
pub enum OllamaError {
    /// HTTP-level error (connection refused, timeout, DNS, TLS, etc.).
    #[error("HTTP error: {0}")]
    Http(#[from] reqwest::Error),

    /// Ollama returned a non-2xx status code.
    #[error("Ollama API error (status {status}): {message}")]
    Api {
        status: u16,
        message: String,
    },

    /// Failed to deserialize Ollama JSON response.
    #[error("deserialization error: {0}")]
    Deserialization(String),

    /// Stream terminated unexpectedly without a done:true chunk.
    #[error("stream ended unexpectedly")]
    StreamIncomplete,
}

Create services/model-gateway/src/ollama/client.rs — OllamaClient:

use std::time::Duration;
use futures::Stream;
use reqwest::Client;

use crate::config::{Config, OllamaClientConfig};
use super::error::OllamaError;
use super::types::*;

/// Async HTTP client for the Ollama REST API.
///
/// Wraps `reqwest::Client` with connection pooling, timeouts, and
/// typed methods for each Ollama endpoint.
pub struct OllamaClient {
    client: Client,
    base_url: String,
}

impl OllamaClient {
    /// Create a new client from the service configuration.
    ///
    /// Configures connection pooling, timeouts, and the base URL
    /// from `Config.ollama_url` and `Config.client`.
    pub fn new(config: &Config) -> Result<Self, OllamaError> {
        let client_config = &config.client;
        let client = Client::builder()
            .timeout(Duration::from_secs(client_config.request_timeout_secs))
            .connect_timeout(Duration::from_secs(client_config.connect_timeout_secs))
            .pool_max_idle_per_host(client_config.pool_max_idle)
            .pool_idle_timeout(Duration::from_secs(client_config.pool_idle_timeout_secs))
            .build()?;

        let base_url = config.ollama_url.trim_end_matches('/').to_string();

        Ok(Self { client, base_url })
    }

    /// POST /api/generate (non-streaming).
    ///
    /// Sends a prompt to the specified model and returns the complete response.
    pub async fn generate(
        &self,
        model: &str,
        prompt: &str,
        options: Option<GenerateOptions>,
    ) -> Result<GenerateResponse, OllamaError> {
        let request = GenerateRequest {
            model: model.to_string(),
            prompt: prompt.to_string(),
            stream: false,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/generate", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<GenerateResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// POST /api/generate (streaming).
    ///
    /// Returns a `Stream` of `GenerateStreamChunk` items. Each chunk
    /// contains a partial token. The final chunk has `done: true`.
    ///
    /// Ollama streams NDJSON (one JSON object per line). This method
    /// reads the response body as a byte stream, splits on newlines,
    /// and deserializes each line.
    pub async fn generate_stream(
        &self,
        model: &str,
        prompt: &str,
        options: Option<GenerateOptions>,
    ) -> Result<
        impl Stream<Item = Result<GenerateStreamChunk, OllamaError>>,
        OllamaError,
    > {
        let request = GenerateRequest {
            model: model.to_string(),
            prompt: prompt.to_string(),
            stream: true,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/generate", self.base_url))
            .json(&request)
            .send()
            .await?;

        let resp = self.handle_error_response(resp).await?;
        Ok(Self::ndjson_stream::<GenerateStreamChunk>(resp))
    }

    /// POST /api/chat (non-streaming).
    ///
    /// Sends a chat conversation (message history) to the model.
    pub async fn chat(
        &self,
        model: &str,
        messages: Vec<ChatMessage>,
        options: Option<GenerateOptions>,
    ) -> Result<ChatResponse, OllamaError> {
        let request = ChatRequest {
            model: model.to_string(),
            messages,
            stream: false,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/chat", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<ChatResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// POST /api/embed.
    ///
    /// Generates embedding vectors for the given input texts.
    /// Returns one embedding vector per input string.
    pub async fn embed(
        &self,
        model: &str,
        input: Vec<String>,
    ) -> Result<EmbedResponse, OllamaError> {
        let request = EmbedRequest {
            model: model.to_string(),
            input,
        };

        let resp = self.client
            .post(format!("{}/api/embed", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<EmbedResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// GET /api/tags.
    ///
    /// Lists all models available on the Ollama instance.
    pub async fn list_models(&self) -> Result<ListModelsResponse, OllamaError> {
        let resp = self.client
            .get(format!("{}/api/tags", self.base_url))
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<ListModelsResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// Check if Ollama is reachable by hitting GET /api/tags.
    /// Returns true if the request succeeds, false otherwise.
    pub async fn is_healthy(&self) -> bool {
        self.list_models().await.is_ok()
    }

    /// Parse NDJSON streaming response into a Stream of typed chunks.
    ///
    /// Ollama streams responses as newline-delimited JSON. Each line
    /// is a complete JSON object. This method uses `bytes_stream()`
    /// from reqwest and buffers bytes until a newline is found,
    /// then deserializes each complete line.
    fn ndjson_stream<T: serde::de::DeserializeOwned>(
        resp: reqwest::Response,
    ) -> impl Stream<Item = Result<T, OllamaError>> {
        use futures::StreamExt;

        let byte_stream = resp.bytes_stream();
        let mut buffer = Vec::new();

        futures::stream::unfold(
            (byte_stream, buffer),
            |(mut stream, mut buf)| async move {
                // Implementation: accumulate bytes, split on \n,
                // deserialize each complete line as T.
                // Return None when stream ends.
                // ...
            },
        )
    }

    /// Check response status and extract error message for non-2xx responses.
    async fn handle_error_response(
        &self,
        resp: reqwest::Response,
    ) -> Result<reqwest::Response, OllamaError> {
        if resp.status().is_success() {
            return Ok(resp);
        }

        let status = resp.status().as_u16();
        let message = resp
            .text()
            .await
            .unwrap_or_else(|_| "unknown error".to_string());

        Err(OllamaError::Api { status, message })
    }
}

NDJSON stream implementation detail:

The ndjson_stream method will use reqwest::Response::bytes_stream() (requires the stream feature) and futures::stream::unfold to:

  1. Accumulate bytes from the HTTP response body into a buffer.
  2. On each newline boundary, extract the complete line.
  3. Deserialize the line as T using serde_json::from_slice.
  4. Yield Ok(T) or Err(OllamaError::Deserialization(...)).
  5. Return None when the byte stream is exhausted.

This approach handles partial JSON objects that span multiple TCP chunks correctly.

3. gRPC Handler Wiring

This issue does not implement the gRPC handler wiring — that is deferred to subsequent issues. However, the OllamaClient must be integrated into ModelGatewayServiceImpl so that future handler implementations can use it.

Update services/model-gateway/src/service.rs:

Add OllamaClient as a field on ModelGatewayServiceImpl:

use crate::ollama::OllamaClient;

pub struct ModelGatewayServiceImpl {
    config: Config,
    ollama: OllamaClient,
}

impl ModelGatewayServiceImpl {
    pub fn new(config: Config) -> anyhow::Result<Self> {
        let ollama = OllamaClient::new(&config)?;
        Ok(Self { config, ollama })
    }
}

Note: The constructor changes from infallible to Result since reqwest::Client::builder().build() can fail. Update main.rs accordingly to use ?.

Update services/model-gateway/src/main.rs:

Change ModelGatewayServiceImpl::new(config) to ModelGatewayServiceImpl::new(config)?.

4. Service Integration

No cross-service integration is needed for this issue. The OllamaClient is a standalone HTTP client that talks to the local Ollama instance. Integration with gRPC handlers will happen in follow-up issues.

5. Tests

Unit tests for serde types in services/model-gateway/src/ollama/types.rs:

Test Case Description
test_generate_request_serialization GenerateRequest serializes to expected JSON with stream: false
test_generate_request_serialization_with_options Options fields are included when Some, omitted when None
test_generate_response_deserialization Deserialize a complete Ollama generate response JSON
test_generate_response_missing_optional_fields Optional fields default to None when absent
test_generate_stream_chunk_deserialization Deserialize a streaming chunk (partial token, done: false)
test_generate_stream_chunk_final Deserialize final chunk with done: true and done_reason
test_chat_request_serialization ChatRequest with multiple messages serializes correctly
test_chat_role_serialization ChatRole variants serialize as lowercase strings
test_chat_response_deserialization Deserialize a complete chat response
test_embed_request_serialization EmbedRequest with multiple inputs serializes correctly
test_embed_response_deserialization Deserialize embedding response with vector data
test_list_models_response_deserialization Deserialize model listing with multiple models
test_model_info_optional_fields ModelInfo handles missing digest gracefully

Unit tests for error handling in services/model-gateway/src/ollama/error.rs:

Test Case Description
test_error_display_http OllamaError::Http formats with reqwest message
test_error_display_api OllamaError::Api includes status code and message
test_error_display_deserialization OllamaError::Deserialization includes detail

Integration-style tests for OllamaClient in services/model-gateway/src/ollama/client.rs:

Use a mock HTTP server (either mockito or wiremock) to simulate Ollama API responses:

Test Case Description
test_generate_success Mock /api/generate returns valid JSON, verify parsed response
test_generate_with_options Verify temperature, top_p, num_predict, stop are sent in request body
test_generate_stream_success Mock returns NDJSON with 3 chunks + final, verify all chunks yielded
test_generate_stream_empty_response Mock returns single done: true chunk
test_chat_success Mock /api/chat returns valid response, verify message parsing
test_chat_with_history Send multi-message conversation, verify all messages in request body
test_embed_success Mock /api/embed returns embedding vectors, verify dimensions
test_embed_multiple_inputs Send multiple texts, verify multiple embeddings returned
test_list_models_success Mock /api/tags returns model list
test_list_models_empty Mock returns empty model list
test_is_healthy_success Mock /api/tags returns 200, is_healthy() returns true
test_is_healthy_failure Mock returns 500, is_healthy() returns false
test_api_error_404 Mock returns 404 with error message, verify OllamaError::Api
test_api_error_500 Mock returns 500 with error body, verify error extraction
test_connection_timeout Client configured with very short timeout, verify OllamaError::Http
test_base_url_trailing_slash Config URL with trailing slash is normalized

Mocking strategy:

Use wiremock crate as a dev-dependency. It provides a MockServer that binds to a random port, allowing parallel test execution without port conflicts. Each test creates its own MockServer, configures expected requests/responses, then creates an OllamaClient pointed at the mock server URL.

For streaming tests, the mock server returns a response body containing multiple NDJSON lines separated by \n.

Configuration tests in services/model-gateway/src/config.rs:

Test Case Description
test_client_config_defaults OllamaClientConfig::default() returns expected timeout/pool values
test_client_config_from_toml Custom client config loads from TOML
test_config_with_client_section Full Config with [client] section parses correctly

Files to Create/Modify

File Action Purpose
services/model-gateway/Cargo.toml Modify Add reqwest (with json, stream features), futures, serde_json dependencies; add wiremock dev-dependency
services/model-gateway/src/config.rs Modify Add OllamaClientConfig struct with timeout/pool settings; add client field to Config
services/model-gateway/src/ollama/mod.rs Create Module declaration, re-exports of OllamaClient, OllamaError, and types
services/model-gateway/src/ollama/types.rs Create Serde request/response structs for all Ollama API endpoints
services/model-gateway/src/ollama/error.rs Create OllamaError enum with Http, Api, Deserialization, StreamIncomplete variants
services/model-gateway/src/ollama/client.rs Create OllamaClient struct with generate, generate_stream, chat, embed, list_models, is_healthy methods and NDJSON stream parser
services/model-gateway/src/lib.rs Modify Add pub mod ollama;
services/model-gateway/src/service.rs Modify Add OllamaClient field to ModelGatewayServiceImpl; change constructor to return Result
services/model-gateway/src/main.rs Modify Update ModelGatewayServiceImpl::new(config) call to handle Result with ?

Risks and Edge Cases

  • Streaming NDJSON parsing: Ollama sends newline-delimited JSON. TCP chunks may not align with JSON object boundaries — a single chunk could contain a partial JSON line or multiple lines. The buffer-based ndjson_stream implementation must handle both cases. Mitigation: accumulate bytes until \n is found, only parse complete lines.
  • Large model response times: Inference on large models (14B+) can take minutes. The default request timeout of 300 seconds should be sufficient, but this is configurable. Streaming mitigates perceived latency by yielding tokens incrementally.
  • Ollama API version compatibility: The /api/embed endpoint (with input array) was introduced in Ollama 0.1.44+. Older Ollama versions use /api/embeddings with a different request shape. Mitigation: target the newer API. Document the minimum Ollama version requirement.
  • Connection pool exhaustion: If many concurrent gRPC requests hit the gateway simultaneously, the reqwest connection pool could be exhausted. Mitigation: pool_max_idle is configurable; the default of 10 is reasonable for a single-node setup. Consider adding a semaphore for concurrency limiting in a future issue if needed.
  • wiremock test isolation: Each test creates its own MockServer on a random port, so tests can run in parallel safely. However, wiremock adds to dev-dependency compile time.
  • Constructor change breaks existing tests: Changing ModelGatewayServiceImpl::new() from infallible to Result will break existing tests in service.rs. Mitigation: update the test helper test_config() to also construct the OllamaClient, or use a test-only constructor that accepts a pre-built client. Alternatively, keep a separate new_with_client() constructor for testability and dependency injection.
  • reqwest TLS: The default reqwest build pulls in rustls or native-tls. Since Ollama runs locally over plain HTTP, TLS is not needed. Consider using default-features = false with just the required features to minimize compile time and binary size. However, if a user runs Ollama behind a TLS reverse proxy, TLS support is needed. Mitigation: use default features (includes TLS) for now; optimize later if compile time is a concern.

Deviation Log

(Filled during implementation if deviations from plan occur)

Deviation Reason