Merge pull request 'docs: multi-machine deployment guide (#99)' (#209) from feature/issue-99-deployment-guide into main

2026-03-11 11:01:17 +01:00
parent 4bf495511a 2cf9a5473e
commit e94a07a50e
3 changed files with 400 additions and 0 deletions
--- a/docker/DEPLOYMENT.md
+++ b/docker/DEPLOYMENT.md
@@ -0,0 +1,361 @@
+# Multi-Machine Deployment Guide
+
+This guide covers deploying llm-multiverse across multiple machines using
+Docker Swarm with encrypted overlay networking.
+
+## Prerequisites
+
+### Hardware
+
+| Node | Role | Specs | Purpose |
+|---|---|---|---|
+| GPU machine | Manager | AMD Ryzen 7 2700x, 64GB DDR4, AMD RX 9070 XT (16GB VRAM) | Model inference, Ollama, D-Bus/keyring |
+| Server | Worker | 32GB DDR3, CPU only | Orchestration, memory, audit, search |
+
+### Software
+
+Both machines need:
+
+- Docker Engine 24.0+ with Swarm support
+- Docker Compose v2 (for local testing)
+- Open ports between nodes: TCP 2377 (management), TCP/UDP 7946 (node communication), UDP 4789 (overlay traffic)
+
+GPU machine additionally needs:
+
+- Ollama installed and running on the host (not in Docker)
+- D-Bus session bus accessible (for GNOME Keyring / KeePassXC)
+
+### Network
+
+- Both machines must be on the same network (or have routable IPs)
+- Firewall rules must allow Docker Swarm ports (see above)
+- DNS or static IPs for node addressing
+
+## Deployment Steps
+
+### 1. Install Docker on Both Machines
+
+```bash
+# Debian/Ubuntu
+curl -fsSL https://get.docker.com | sh
+sudo usermod -aG docker $USER
+# Log out and back in
+```
+
+### 2. Install Ollama on GPU Machine
+
+```bash
+curl -fsSL https://ollama.com/install.sh | sh
+ollama pull mistral        # or your preferred model
+ollama pull nomic-embed-text  # for embeddings
+```
+
+Verify Ollama is running:
+
+```bash
+curl http://localhost:11434/api/tags
+```
+
+### 3. Build and Push Images
+
+On the build machine (or CI):
+
+```bash
+# Build all images
+docker compose -f docker/docker-compose.yml build
+
+# Tag for your registry
+REGISTRY=your-registry.example.com/
+for svc in audit secrets memory model-gateway tool-broker search orchestrator; do
+    docker tag "docker-${svc}" "${REGISTRY}llm-multiverse/${svc}:latest"
+    docker push "${REGISTRY}llm-multiverse/${svc}:latest"
+done
+```
+
+Alternatively, build on each node or use `docker save`/`docker load` for
+air-gapped environments.
+
+### 4. Initialize Docker Swarm
+
+On the GPU machine (manager):
+
+```bash
+bash docker/scripts/swarm-init.sh init
+```
+
+This initializes the swarm and creates the encrypted overlay network
+(`llm-internal`).
+
+Note the join command printed in the output.
+
+### 5. Join Worker Node
+
+On the server machine:
+
+```bash
+bash docker/scripts/swarm-init.sh join <manager-ip> <join-token>
+```
+
+To retrieve the join token later (run on manager):
+
+```bash
+bash docker/scripts/swarm-init.sh token
+```
+
+### 6. Label Nodes
+
+On the manager:
+
+```bash
+# Label the GPU machine
+bash docker/scripts/label-nodes.sh gpu $(hostname)
+
+# Label the server (use its hostname as shown in `docker node ls`)
+bash docker/scripts/label-nodes.sh server <server-hostname>
+
+# Verify labels
+bash docker/scripts/label-nodes.sh show
+```
+
+### 7. Deploy the Stack
+
+```bash
+# Set your registry prefix (if using a registry)
+export REGISTRY=your-registry.example.com/
+export IMAGE_TAG=latest
+
+# Deploy
+docker stack deploy -c docker/docker-stack.yml llm
+```
+
+### 8. Verify Deployment
+
+```bash
+# Check all services are running
+docker stack services llm
+
+# Check service placement
+docker stack ps llm
+
+# Verify swarm and network
+bash docker/scripts/swarm-init.sh verify
+```
+
+Wait for all services to reach `Running` state, then test connectivity:
+
+```bash
+# From the manager node
+bash docker/scripts/verify-connectivity.sh
+```
+
+### 9. Validate Zero Code Changes
+
+Confirm that no service source code was modified for multi-machine deployment:
+
+```bash
+bash docker/scripts/validate-zero-changes.sh
+```
+
+## Service Placement
+
+Services are placed on nodes based on hardware requirements:
+
+| Service | Node | Reason |
+|---|---|---|
+| Model Gateway | GPU machine | Needs Ollama via `host.docker.internal` |
+| Secrets | GPU machine | Needs D-Bus socket for host keyring |
+| Audit | Server | Persistent storage for append-only logs |
+| Memory | Server | Persistent storage for DuckDB |
+| Orchestrator | Any | CPU-only coordination |
+| Tool Broker | Any | CPU-only enforcement |
+| Search | Any | CPU-only I/O-bound |
+| SearXNG | Any | CPU-only |
+| Caddy | Any | Edge proxy |
+
+See [SERVICE_TOPOLOGY.md](SERVICE_TOPOLOGY.md) for the full connection matrix.
+
+## Configuration
+
+### Environment Variables
+
+Set these before `docker stack deploy`:
+
+| Variable | Default | Purpose |
+|---|---|---|
+| `REGISTRY` | _(empty)_ | Image registry prefix (e.g., `registry.example.com/`) |
+| `IMAGE_TAG` | `latest` | Image tag for all services |
+| `DOMAIN` | `localhost` | Caddy domain (real domain enables Let's Encrypt) |
+| `TLS_MODE` | `internal` | Caddy TLS (`internal` = self-signed) |
+| `HTTPS_PORT` | `443` | Host HTTPS port |
+| `HTTP_PORT` | `80` | Host HTTP port |
+| `DBUS_SESSION_SOCKET` | `/run/user/1000/bus` | Host D-Bus session socket |
+| `SEARXNG_SECRET` | `dev-secret-...` | SearXNG instance secret |
+| `*_REPLICAS` | `1` | Per-service replica count |
+
+### Scaling Services
+
+Unconstrained services can be scaled:
+
+```bash
+# Scale orchestrator to 2 replicas
+docker service scale llm_orchestrator=2
+
+# Or set via environment before deploy
+export ORCHESTRATOR_REPLICAS=2
+docker stack deploy -c docker/docker-stack.yml llm
+```
+
+Services with placement constraints (model-gateway, secrets, audit, memory)
+should remain at 1 replica unless you configure shared storage.
+
+## Monitoring and Logs
+
+### View Service Logs
+
+```bash
+# All services
+docker stack services llm
+docker service logs llm_orchestrator --follow
+
+# Specific service
+docker service logs llm_model-gateway --tail 100
+```
+
+### Aggregate Logs
+
+For production, consider adding a log driver to the stack:
+
+```yaml
+# In docker-stack.yml, add to each service:
+logging:
+  driver: json-file
+  options:
+    max-size: "10m"
+    max-file: "3"
+```
+
+Or use a centralized logging stack (Loki + Promtail, ELK, etc.).
+
+### Health Monitoring
+
+- Caddy health: `curl -sk https://<domain>/healthz`
+- SearXNG health: Check via `docker service logs llm_searxng`
+- All services have Docker health checks — check with `docker stack ps llm`
+
+## Security Considerations
+
+### Encrypted Overlay Network
+
+All inter-node traffic is encrypted via IPsec (Docker Swarm `--opt encrypted`).
+This replaces mTLS between services — Docker handles key exchange automatically.
+
+Verify encryption is enabled:
+
+```bash
+docker network inspect llm-internal --format '{{.Options}}'
+# Should show: map[encrypted:]
+```
+
+### D-Bus Socket Exposure
+
+Only the Secrets service container has access to the host D-Bus socket.
+This is necessary for GNOME Keyring / KeePassXC integration. The mount
+is read-only.
+
+If the D-Bus socket is unavailable, the Secrets service falls back to
+the Linux kernel keyring.
+
+### External Access
+
+- Only Caddy is exposed externally (ports 80/443)
+- All internal services are on the overlay network only
+- Caddy terminates TLS at the edge; internal traffic uses h2c (HTTP/2 cleartext)
+- For production: set `DOMAIN` to your real domain for automatic Let's Encrypt
+
+### Secrets Management
+
+- Never commit secrets to the repository
+- Use `SEARXNG_SECRET` environment variable (change from default in production)
+- Service API keys are managed by the Secrets service via the host keyring
+
+## Troubleshooting
+
+### Services Not Starting
+
+```bash
+# Check service status
+docker stack ps llm --no-trunc
+
+# Common issues:
+# - "no suitable node" → Check node labels match placement constraints
+# - "image not found" → Ensure images are pushed to registry or loaded on all nodes
+# - "port already in use" → Another service is using port 443/80
+```
+
+### Node Labels Missing
+
+```bash
+# Check labels
+bash docker/scripts/label-nodes.sh show
+
+# Re-apply if needed
+bash docker/scripts/label-nodes.sh gpu <node>
+bash docker/scripts/label-nodes.sh server <node>
+```
+
+### Overlay Network Issues
+
+```bash
+# Verify network exists and is encrypted
+docker network inspect llm-internal
+
+# If missing, recreate
+bash docker/scripts/swarm-init.sh network
+```
+
+### Ollama Not Reachable
+
+The Model Gateway connects to Ollama via `host.docker.internal:11434`.
+This requires:
+
+1. Ollama is running on the GPU host: `systemctl status ollama`
+2. Ollama is listening on all interfaces or localhost
+3. The Model Gateway is scheduled on the GPU node (check placement)
+
+```bash
+# Verify from inside the container
+docker exec $(docker ps -q -f name=llm_model-gateway) \
+    wget -qO- http://host.docker.internal:11434/api/tags
+```
+
+### Cross-Node Communication Fails
+
+1. Check firewall rules allow Swarm ports (2377, 7946, 4789)
+2. Check overlay network is encrypted: `docker network inspect llm-internal`
+3. Run connectivity verification: `bash docker/scripts/verify-connectivity.sh`
+
+### Removing the Stack
+
+```bash
+# Remove all services
+docker stack rm llm
+
+# Leave swarm (on worker)
+bash docker/scripts/swarm-init.sh leave
+
+# Leave swarm (on manager, destroys swarm)
+bash docker/scripts/swarm-init.sh leave
+```
+
+## Single-Machine Fallback
+
+To run on a single machine without Swarm (development/testing):
+
+```bash
+docker compose -f docker/docker-compose.yml build
+docker compose -f docker/docker-compose.yml up -d
+bash docker/scripts/verify-connectivity.sh
+```
+
+This uses the bridge network instead of overlay and does not require
+swarm initialization or node labeling.
--- a/implementation-plans/_index.md
+++ b/implementation-plans/_index.md
@@ -102,6 +102,7 @@
 | #96 | Define service placement constraints | Phase 12 | `COMPLETED` | Shell / Markdown | [issue-096.md](issue-096.md) |
 | #97 | Convert docker-compose.yml to Swarm stack | Phase 12 | `COMPLETED` | Docker / YAML | [issue-097.md](issue-097.md) |
 | #98 | Validate zero service code changes | Phase 12 | `COMPLETED` | Shell | [issue-098.md](issue-098.md) |
+| #99 | Document multi-machine deployment guide | Phase 12 | `COMPLETED` | Markdown | [issue-099.md](issue-099.md) |

 ## Status Legend

--- a/implementation-plans/issue-099.md
+++ b/implementation-plans/issue-099.md
@@ -0,0 +1,38 @@
+# Issue #99: Document multi-machine deployment guide
+
+## Metadata
+
+| Field | Value |
+|---|---|
+| Issue | #99 |
+| Title | Document multi-machine deployment guide |
+| Milestone | Phase 12: Multi-Machine Extension |
+| Status | `COMPLETED` |
+| Language | Markdown |
+| Related Plans | issue-095.md, issue-096.md, issue-097.md, issue-098.md |
+| Blocked by | #98 |
+
+## Acceptance Criteria
+
+- [x] Prerequisites: hardware requirements, OS setup, Docker installation
+- [x] Swarm initialization walkthrough
+- [x] Node labeling guide
+- [x] Stack deployment step-by-step
+- [x] Monitoring and log aggregation setup
+- [x] Troubleshooting common issues
+- [x] Network topology diagram
+- [x] Security considerations (encrypted overlay, D-Bus exposure, secrets)
+
+## Files Created/Modified
+
+| File | Action | Purpose |
+|---|---|---|
+| `docker/DEPLOYMENT.md` | Create | Comprehensive multi-machine deployment guide |
+| `implementation-plans/issue-099.md` | Create | Plan |
+| `implementation-plans/_index.md` | Modify | Index entry |
+
+## Deviation Log
+
+| Deviation | Reason |
+|---|---|
+| None | — |