Files

Pi Agent 0dd9dcd876 docs: mark issue #39 as COMPLETED in implementation plans

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-10 13:57:40 +01:00

25 KiB

Raw Blame History

Implementation Plan — Issue #39: Implement Ollama HTTP client

Metadata

Field	Value
Issue	#39
Title	Implement Ollama HTTP client
Milestone	Phase 5: Model Gateway
Labels
Status	`COMPLETED`
Language	Rust
Related Plans	issue-012.md, issue-038.md
Blocked by	#38 (completed)

Acceptance Criteria

Async HTTP client using reqwest
Support for /api/generate (streaming and non-streaming)
Support for /api/chat with message history
Support for /api/embed (embeddings)
Connection pooling and timeout configuration
Error handling for Ollama-specific error responses

Architecture Analysis

Service Context

This issue belongs to the Model Gateway service (services/model-gateway/). The Ollama HTTP client is the core backend layer that the gRPC service handlers (service.rs) will call to fulfil Inference, StreamInference, GenerateEmbedding, and IsModelReady RPCs.

The client wraps the Ollama REST API (default http://localhost:11434) and exposes typed Rust methods that the gRPC handlers can call directly. The gRPC handlers (issue #40+) will translate proto request/response types to/from the Ollama client types defined here.

gRPC endpoints affected (consumers of this client):

Inference — will call OllamaClient::generate() (non-streaming)
StreamInference — will call OllamaClient::generate_stream() (streaming)
GenerateEmbedding — will call OllamaClient::embed()
IsModelReady — will call OllamaClient::list_models() to check actual Ollama availability

Proto messages involved:

InferenceParams — carries prompt, model routing hints (TaskComplexity), temperature, top_p, max_tokens, stop_sequences
InferenceResponse — text, finish_reason, tokens_used
StreamInferenceResponse — token, finish_reason
GenerateEmbeddingRequest — text, model
GenerateEmbeddingResponse — embedding vector, dimensions

Existing Patterns

Config: services/model-gateway/src/config.rs already defines Config with ollama_url: String (default http://localhost:11434) and ModelRoutingConfig for model name resolution.
Service struct: services/model-gateway/src/service.rs defines ModelGatewayServiceImpl holding Config. The OllamaClient will be added here as a field.
Error types: Other services use thiserror for module-level error enums (e.g., DbError, EmbeddingError, ProvenanceError). The model-gateway Cargo.toml already includes thiserror = "2".
Async runtime: tokio with features = ["full"] is already a dependency. tokio-stream = "0.1" is also present.
Serde: serde = { version = "1", features = ["derive"] } is already a dependency for config deserialization.

Dependencies

reqwest (new) — HTTP client with connection pooling, async support, JSON serialization, and streaming response bodies. Features needed: json, stream.
futures (new) — For Stream trait and stream combinators (futures::Stream, futures::StreamExt). Needed to expose streaming generate responses as a Stream type.
serde_json (new) — For parsing newline-delimited JSON (NDJSON) from Ollama streaming responses. While reqwest can deserialize full JSON responses, streaming requires manual line-by-line parsing.
No proto changes required — the Ollama client is an internal HTTP layer; the proto definitions are already complete from issue #12.

Implementation Steps

1. Types & Configuration

Add Ollama-specific configuration to services/model-gateway/src/config.rs:

/// Configuration for the Ollama HTTP client.
#[derive(Debug, Clone, Deserialize)]
pub struct OllamaClientConfig {
    /// Request timeout in seconds (default: 300 — generous for large model inference).
    #[serde(default = "default_request_timeout_secs")]
    pub request_timeout_secs: u64,

    /// Connection timeout in seconds (default: 10).
    #[serde(default = "default_connect_timeout_secs")]
    pub connect_timeout_secs: u64,

    /// Maximum idle connections in the pool (default: 10).
    #[serde(default = "default_pool_max_idle")]
    pub pool_max_idle: usize,

    /// Idle connection timeout in seconds (default: 60).
    #[serde(default = "default_pool_idle_timeout_secs")]
    pub pool_idle_timeout_secs: u64,
}

Add #[serde(default)] pub client: OllamaClientConfig field to the existing Config struct.

Define Ollama API request/response types in services/model-gateway/src/ollama/types.rs:

These are serde structs matching the Ollama REST API JSON schema.

use serde::{Deserialize, Serialize};

// --- /api/generate ---

#[derive(Debug, Serialize)]
pub struct GenerateRequest {
    pub model: String,
    pub prompt: String,
    pub stream: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub options: Option<GenerateOptions>,
}

#[derive(Debug, Serialize)]
pub struct GenerateOptions {
    #[serde(skip_serializing_if = "Option::is_none")]
    pub temperature: Option<f32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub top_p: Option<f32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub num_predict: Option<i32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub stop: Option<Vec<String>>,
}

/// Full response from /api/generate with stream:false.
#[derive(Debug, Deserialize)]
pub struct GenerateResponse {
    pub model: String,
    pub response: String,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub total_duration: Option<u64>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

/// Single chunk from /api/generate with stream:true (NDJSON).
#[derive(Debug, Deserialize)]
pub struct GenerateStreamChunk {
    pub model: String,
    pub response: String,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

// --- /api/chat ---

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ChatRole {
    System,
    User,
    Assistant,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatMessage {
    pub role: ChatRole,
    pub content: String,
}

#[derive(Debug, Serialize)]
pub struct ChatRequest {
    pub model: String,
    pub messages: Vec<ChatMessage>,
    pub stream: bool,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub options: Option<GenerateOptions>,
}

#[derive(Debug, Deserialize)]
pub struct ChatResponse {
    pub model: String,
    pub message: ChatMessage,
    pub done: bool,
    #[serde(default)]
    pub done_reason: Option<String>,
    #[serde(default)]
    pub total_duration: Option<u64>,
    #[serde(default)]
    pub eval_count: Option<u32>,
    #[serde(default)]
    pub prompt_eval_count: Option<u32>,
}

// --- /api/embed ---

#[derive(Debug, Serialize)]
pub struct EmbedRequest {
    pub model: String,
    pub input: Vec<String>,
}

#[derive(Debug, Deserialize)]
pub struct EmbedResponse {
    pub model: String,
    pub embeddings: Vec<Vec<f32>>,
}

// --- /api/tags (list models) ---

#[derive(Debug, Deserialize)]
pub struct ListModelsResponse {
    pub models: Vec<ModelInfo>,
}

#[derive(Debug, Deserialize)]
pub struct ModelInfo {
    pub name: String,
    pub model: String,
    #[serde(default)]
    pub size: u64,
    #[serde(default)]
    pub digest: Option<String>,
}

// --- /api/show (model details) ---

#[derive(Debug, Serialize)]
pub struct ShowModelRequest {
    pub model: String,
}

#[derive(Debug, Deserialize)]
pub struct ShowModelResponse {
    pub modelfile: Option<String>,
    pub parameters: Option<String>,
    pub template: Option<String>,
}

2. Core Logic

Create services/model-gateway/src/ollama/error.rs — Error types:

use thiserror::Error;

#[derive(Debug, Error)]
pub enum OllamaError {
    /// HTTP-level error (connection refused, timeout, DNS, TLS, etc.).
    #[error("HTTP error: {0}")]
    Http(#[from] reqwest::Error),

    /// Ollama returned a non-2xx status code.
    #[error("Ollama API error (status {status}): {message}")]
    Api {
        status: u16,
        message: String,
    },

    /// Failed to deserialize Ollama JSON response.
    #[error("deserialization error: {0}")]
    Deserialization(String),

    /// Stream terminated unexpectedly without a done:true chunk.
    #[error("stream ended unexpectedly")]
    StreamIncomplete,
}

Create services/model-gateway/src/ollama/client.rs — OllamaClient:

use std::time::Duration;
use futures::Stream;
use reqwest::Client;

use crate::config::{Config, OllamaClientConfig};
use super::error::OllamaError;
use super::types::*;

/// Async HTTP client for the Ollama REST API.
///
/// Wraps `reqwest::Client` with connection pooling, timeouts, and
/// typed methods for each Ollama endpoint.
pub struct OllamaClient {
    client: Client,
    base_url: String,
}

impl OllamaClient {
    /// Create a new client from the service configuration.
    ///
    /// Configures connection pooling, timeouts, and the base URL
    /// from `Config.ollama_url` and `Config.client`.
    pub fn new(config: &Config) -> Result<Self, OllamaError> {
        let client_config = &config.client;
        let client = Client::builder()
            .timeout(Duration::from_secs(client_config.request_timeout_secs))
            .connect_timeout(Duration::from_secs(client_config.connect_timeout_secs))
            .pool_max_idle_per_host(client_config.pool_max_idle)
            .pool_idle_timeout(Duration::from_secs(client_config.pool_idle_timeout_secs))
            .build()?;

        let base_url = config.ollama_url.trim_end_matches('/').to_string();

        Ok(Self { client, base_url })
    }

    /// POST /api/generate (non-streaming).
    ///
    /// Sends a prompt to the specified model and returns the complete response.
    pub async fn generate(
        &self,
        model: &str,
        prompt: &str,
        options: Option<GenerateOptions>,
    ) -> Result<GenerateResponse, OllamaError> {
        let request = GenerateRequest {
            model: model.to_string(),
            prompt: prompt.to_string(),
            stream: false,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/generate", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<GenerateResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// POST /api/generate (streaming).
    ///
    /// Returns a `Stream` of `GenerateStreamChunk` items. Each chunk
    /// contains a partial token. The final chunk has `done: true`.
    ///
    /// Ollama streams NDJSON (one JSON object per line). This method
    /// reads the response body as a byte stream, splits on newlines,
    /// and deserializes each line.
    pub async fn generate_stream(
        &self,
        model: &str,
        prompt: &str,
        options: Option<GenerateOptions>,
    ) -> Result<
        impl Stream<Item = Result<GenerateStreamChunk, OllamaError>>,
        OllamaError,
    > {
        let request = GenerateRequest {
            model: model.to_string(),
            prompt: prompt.to_string(),
            stream: true,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/generate", self.base_url))
            .json(&request)
            .send()
            .await?;

        let resp = self.handle_error_response(resp).await?;
        Ok(Self::ndjson_stream::<GenerateStreamChunk>(resp))
    }

    /// POST /api/chat (non-streaming).
    ///
    /// Sends a chat conversation (message history) to the model.
    pub async fn chat(
        &self,
        model: &str,
        messages: Vec<ChatMessage>,
        options: Option<GenerateOptions>,
    ) -> Result<ChatResponse, OllamaError> {
        let request = ChatRequest {
            model: model.to_string(),
            messages,
            stream: false,
            options,
        };

        let resp = self.client
            .post(format!("{}/api/chat", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<ChatResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// POST /api/embed.
    ///
    /// Generates embedding vectors for the given input texts.
    /// Returns one embedding vector per input string.
    pub async fn embed(
        &self,
        model: &str,
        input: Vec<String>,
    ) -> Result<EmbedResponse, OllamaError> {
        let request = EmbedRequest {
            model: model.to_string(),
            input,
        };

        let resp = self.client
            .post(format!("{}/api/embed", self.base_url))
            .json(&request)
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<EmbedResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// GET /api/tags.
    ///
    /// Lists all models available on the Ollama instance.
    pub async fn list_models(&self) -> Result<ListModelsResponse, OllamaError> {
        let resp = self.client
            .get(format!("{}/api/tags", self.base_url))
            .send()
            .await?;

        self.handle_error_response(resp)
            .await?
            .json::<ListModelsResponse>()
            .await
            .map_err(|e| OllamaError::Deserialization(e.to_string()))
    }

    /// Check if Ollama is reachable by hitting GET /api/tags.
    /// Returns true if the request succeeds, false otherwise.
    pub async fn is_healthy(&self) -> bool {
        self.list_models().await.is_ok()
    }

    /// Parse NDJSON streaming response into a Stream of typed chunks.
    ///
    /// Ollama streams responses as newline-delimited JSON. Each line
    /// is a complete JSON object. This method uses `bytes_stream()`
    /// from reqwest and buffers bytes until a newline is found,
    /// then deserializes each complete line.
    fn ndjson_stream<T: serde::de::DeserializeOwned>(
        resp: reqwest::Response,
    ) -> impl Stream<Item = Result<T, OllamaError>> {
        use futures::StreamExt;

        let byte_stream = resp.bytes_stream();
        let mut buffer = Vec::new();

        futures::stream::unfold(
            (byte_stream, buffer),
            |(mut stream, mut buf)| async move {
                // Implementation: accumulate bytes, split on \n,
                // deserialize each complete line as T.
                // Return None when stream ends.
                // ...
            },
        )
    }

    /// Check response status and extract error message for non-2xx responses.
    async fn handle_error_response(
        &self,
        resp: reqwest::Response,
    ) -> Result<reqwest::Response, OllamaError> {
        if resp.status().is_success() {
            return Ok(resp);
        }

        let status = resp.status().as_u16();
        let message = resp
            .text()
            .await
            .unwrap_or_else(|_| "unknown error".to_string());

        Err(OllamaError::Api { status, message })
    }
}

NDJSON stream implementation detail:

The ndjson_stream method will use reqwest::Response::bytes_stream() (requires the stream feature) and futures::stream::unfold to:

Accumulate bytes from the HTTP response body into a buffer.
On each newline boundary, extract the complete line.
Deserialize the line as T using serde_json::from_slice.
Yield Ok(T) or Err(OllamaError::Deserialization(...)).
Return None when the byte stream is exhausted.

This approach handles partial JSON objects that span multiple TCP chunks correctly.

3. gRPC Handler Wiring

This issue does not implement the gRPC handler wiring — that is deferred to subsequent issues. However, the OllamaClient must be integrated into ModelGatewayServiceImpl so that future handler implementations can use it.

Update services/model-gateway/src/service.rs:

Add OllamaClient as a field on ModelGatewayServiceImpl:

use crate::ollama::OllamaClient;

pub struct ModelGatewayServiceImpl {
    config: Config,
    ollama: OllamaClient,
}

impl ModelGatewayServiceImpl {
    pub fn new(config: Config) -> anyhow::Result<Self> {
        let ollama = OllamaClient::new(&config)?;
        Ok(Self { config, ollama })
    }
}

Note: The constructor changes from infallible to Result since reqwest::Client::builder().build() can fail. Update main.rs accordingly to use ?.

Update services/model-gateway/src/main.rs:

Change ModelGatewayServiceImpl::new(config) to ModelGatewayServiceImpl::new(config)?.

4. Service Integration

No cross-service integration is needed for this issue. The OllamaClient is a standalone HTTP client that talks to the local Ollama instance. Integration with gRPC handlers will happen in follow-up issues.

5. Tests

Unit tests for serde types in services/model-gateway/src/ollama/types.rs:

Test Case	Description
`test_generate_request_serialization`	`GenerateRequest` serializes to expected JSON with `stream: false`
`test_generate_request_serialization_with_options`	Options fields are included when `Some`, omitted when `None`
`test_generate_response_deserialization`	Deserialize a complete Ollama generate response JSON
`test_generate_response_missing_optional_fields`	Optional fields default to `None` when absent
`test_generate_stream_chunk_deserialization`	Deserialize a streaming chunk (partial token, `done: false`)
`test_generate_stream_chunk_final`	Deserialize final chunk with `done: true` and `done_reason`
`test_chat_request_serialization`	`ChatRequest` with multiple messages serializes correctly
`test_chat_role_serialization`	`ChatRole` variants serialize as lowercase strings
`test_chat_response_deserialization`	Deserialize a complete chat response
`test_embed_request_serialization`	`EmbedRequest` with multiple inputs serializes correctly
`test_embed_response_deserialization`	Deserialize embedding response with vector data
`test_list_models_response_deserialization`	Deserialize model listing with multiple models
`test_model_info_optional_fields`	`ModelInfo` handles missing `digest` gracefully

Unit tests for error handling in services/model-gateway/src/ollama/error.rs:

Test Case	Description
`test_error_display_http`	`OllamaError::Http` formats with reqwest message
`test_error_display_api`	`OllamaError::Api` includes status code and message
`test_error_display_deserialization`	`OllamaError::Deserialization` includes detail

Integration-style tests for OllamaClient in services/model-gateway/src/ollama/client.rs:

Use a mock HTTP server (either mockito or wiremock) to simulate Ollama API responses:

Test Case	Description
`test_generate_success`	Mock `/api/generate` returns valid JSON, verify parsed response
`test_generate_with_options`	Verify temperature, top_p, num_predict, stop are sent in request body
`test_generate_stream_success`	Mock returns NDJSON with 3 chunks + final, verify all chunks yielded
`test_generate_stream_empty_response`	Mock returns single `done: true` chunk
`test_chat_success`	Mock `/api/chat` returns valid response, verify message parsing
`test_chat_with_history`	Send multi-message conversation, verify all messages in request body
`test_embed_success`	Mock `/api/embed` returns embedding vectors, verify dimensions
`test_embed_multiple_inputs`	Send multiple texts, verify multiple embeddings returned
`test_list_models_success`	Mock `/api/tags` returns model list
`test_list_models_empty`	Mock returns empty model list
`test_is_healthy_success`	Mock `/api/tags` returns 200, `is_healthy()` returns true
`test_is_healthy_failure`	Mock returns 500, `is_healthy()` returns false
`test_api_error_404`	Mock returns 404 with error message, verify `OllamaError::Api`
`test_api_error_500`	Mock returns 500 with error body, verify error extraction
`test_connection_timeout`	Client configured with very short timeout, verify `OllamaError::Http`
`test_base_url_trailing_slash`	Config URL with trailing slash is normalized

Mocking strategy:

Use wiremock crate as a dev-dependency. It provides a MockServer that binds to a random port, allowing parallel test execution without port conflicts. Each test creates its own MockServer, configures expected requests/responses, then creates an OllamaClient pointed at the mock server URL.

For streaming tests, the mock server returns a response body containing multiple NDJSON lines separated by \n.

Configuration tests in services/model-gateway/src/config.rs:

Test Case	Description
`test_client_config_defaults`	`OllamaClientConfig::default()` returns expected timeout/pool values
`test_client_config_from_toml`	Custom client config loads from TOML
`test_config_with_client_section`	Full `Config` with `[client]` section parses correctly

Files to Create/Modify

File	Action	Purpose
`services/model-gateway/Cargo.toml`	Modify	Add `reqwest` (with `json`, `stream` features), `futures`, `serde_json` dependencies; add `wiremock` dev-dependency
`services/model-gateway/src/config.rs`	Modify	Add `OllamaClientConfig` struct with timeout/pool settings; add `client` field to `Config`
`services/model-gateway/src/ollama/mod.rs`	Create	Module declaration, re-exports of `OllamaClient`, `OllamaError`, and types
`services/model-gateway/src/ollama/types.rs`	Create	Serde request/response structs for all Ollama API endpoints
`services/model-gateway/src/ollama/error.rs`	Create	`OllamaError` enum with `Http`, `Api`, `Deserialization`, `StreamIncomplete` variants
`services/model-gateway/src/ollama/client.rs`	Create	`OllamaClient` struct with `generate`, `generate_stream`, `chat`, `embed`, `list_models`, `is_healthy` methods and NDJSON stream parser
`services/model-gateway/src/lib.rs`	Modify	Add `pub mod ollama;`
`services/model-gateway/src/service.rs`	Modify	Add `OllamaClient` field to `ModelGatewayServiceImpl`; change constructor to return `Result`
`services/model-gateway/src/main.rs`	Modify	Update `ModelGatewayServiceImpl::new(config)` call to handle `Result` with `?`

Risks and Edge Cases

Streaming NDJSON parsing: Ollama sends newline-delimited JSON. TCP chunks may not align with JSON object boundaries — a single chunk could contain a partial JSON line or multiple lines. The buffer-based ndjson_stream implementation must handle both cases. Mitigation: accumulate bytes until \n is found, only parse complete lines.
Large model response times: Inference on large models (14B+) can take minutes. The default request timeout of 300 seconds should be sufficient, but this is configurable. Streaming mitigates perceived latency by yielding tokens incrementally.
Ollama API version compatibility: The /api/embed endpoint (with input array) was introduced in Ollama 0.1.44+. Older Ollama versions use /api/embeddings with a different request shape. Mitigation: target the newer API. Document the minimum Ollama version requirement.
Connection pool exhaustion: If many concurrent gRPC requests hit the gateway simultaneously, the reqwest connection pool could be exhausted. Mitigation: pool_max_idle is configurable; the default of 10 is reasonable for a single-node setup. Consider adding a semaphore for concurrency limiting in a future issue if needed.
wiremock test isolation: Each test creates its own MockServer on a random port, so tests can run in parallel safely. However, wiremock adds to dev-dependency compile time.
Constructor change breaks existing tests: Changing ModelGatewayServiceImpl::new() from infallible to Result will break existing tests in service.rs. Mitigation: update the test helper test_config() to also construct the OllamaClient, or use a test-only constructor that accepts a pre-built client. Alternatively, keep a separate new_with_client() constructor for testability and dependency injection.
reqwest TLS: The default reqwest build pulls in rustls or native-tls. Since Ollama runs locally over plain HTTP, TLS is not needed. Consider using default-features = false with just the required features to minimize compile time and binary size. However, if a user runs Ollama behind a TLS reverse proxy, TLS support is needed. Mitigation: use default features (includes TLS) for now; optimize later if compile time is a concern.

Deviation Log

(Filled during implementation if deviations from plan occur)

Deviation	Reason

25 KiB Raw Blame History