Compare commits

...

2 Commits

Author SHA1 Message Date
a578fa3c5b Merge pull request 'feat: implement SearXNG query client + snippet filter (#46)' (#144) from feature/issue-46-searxng-client into main 2026-03-10 15:31:34 +01:00
Pi Agent
c1aff33eb1 feat: implement SearXNG query client with snippet filter (issue #46)
- SearXNGClient: async HTTP client wrapping SearXNG JSON API
- Query param construction (categories, engines, language)
- Response parsing: extract title, URL, snippet, engine, score
- URL-based deduplication keeping highest-scoring entry
- HTML tag stripping and entity decoding for clean text
- Configurable max_results with per-call override
- 14 unit tests with aioresponses mocking
- Added aiohttp and aioresponses dependencies

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 15:31:18 +01:00
5 changed files with 496 additions and 0 deletions

View File

@@ -49,6 +49,7 @@
| #43 | Integration tests for Model Gateway | Phase 5 | `COMPLETED` | Rust | [issue-043.md](issue-043.md) |
| #44 | Set up SearXNG Docker container | Phase 6 | `COMPLETED` | Docker / YAML | [issue-044.md](issue-044.md) |
| #45 | Scaffold Search Service Python project | Phase 6 | `COMPLETED` | Python | [issue-045.md](issue-045.md) |
| #46 | Implement SearXNG query + snippet filter | Phase 6 | `COMPLETED` | Python | [issue-046.md](issue-046.md) |
## Status Legend

View File

@@ -0,0 +1,144 @@
# Implementation Plan — Issue #46: Implement SearXNG query + snippet filter
## Metadata
| Field | Value |
|---|---|
| Issue | [#46](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/46) |
| Title | Implement SearXNG query + snippet filter |
| Milestone | Phase 6: Search Service |
| Labels | — |
| Status | `COMPLETED` |
| Language | Python |
| Related Plans | issue-044.md, issue-045.md |
| Blocked by | #44, #45 |
## Acceptance Criteria
- [ ] HTTP client for SearXNG JSON API
- [ ] Query parameter construction (categories, engines, language)
- [ ] Response parsing: extract title, URL, snippet, engine source
- [ ] Snippet deduplication across engines
- [ ] Configurable max results and engine selection
## Architecture Analysis
### Service Context
- **Service:** Search Service (`services/search/`)
- **Component:** SearXNG client module — HTTP client wrapping the SearXNG JSON API
- SearXNG runs at `config.searxng_url` (default `http://localhost:8888`)
- This module is called by the Search gRPC endpoint (implemented in #49)
### Existing Patterns
- **Config:** `Config.searxng_url` already available
- **HTTP client:** Use `aiohttp` for async HTTP requests (consistent with async gRPC server)
- **SearXNG JSON API:** `GET /search?q=<query>&format=json&categories=<cat>&engines=<eng>&language=<lang>`
### Dependencies
- `aiohttp` — async HTTP client (add to pyproject.toml)
- SearXNG JSON API response format
## Implementation Steps
### 1. Add `aiohttp` dependency
Add to `services/search/pyproject.toml`:
```toml
"aiohttp>=3.10",
```
### 2. Create `searxng.py` module
`services/search/src/search_service/searxng.py`:
**Dataclasses:**
```python
@dataclass
class SearchSnippet:
title: str
url: str
snippet: str
engine: str
score: float # SearXNG relevance score (0.0-1.0)
```
**SearXNGClient class:**
```python
class SearXNGClient:
def __init__(self, base_url: str, max_results: int = 10, language: str = "en"):
self.base_url = base_url.rstrip("/")
self.max_results = max_results
self.language = language
async def search(
self,
query: str,
categories: list[str] | None = None,
engines: list[str] | None = None,
num_results: int | None = None,
) -> list[SearchSnippet]:
...
```
**search() implementation:**
1. Build query params: `q`, `format=json`, `categories` (comma-separated), `engines` (comma-separated), `language`
2. Send GET request to `{base_url}/search`
3. Parse JSON response — extract `results` array
4. Map each result to `SearchSnippet`:
- `title` = result["title"]
- `url` = result["url"]
- `snippet` = result["content"] (SearXNG uses "content" field)
- `engine` = result["engine"] (or first engine if multiple)
- `score` = result.get("score", 0.0)
5. Deduplicate by URL (keep highest-scoring entry per URL)
6. Sort by score descending
7. Truncate to `num_results` (or `self.max_results`)
**Deduplication logic:**
```python
def _deduplicate(snippets: list[SearchSnippet]) -> list[SearchSnippet]:
seen: dict[str, SearchSnippet] = {}
for s in snippets:
if s.url not in seen or s.score > seen[s.url].score:
seen[s.url] = s
return sorted(seen.values(), key=lambda s: s.score, reverse=True)
```
### 3. Tests
`services/search/tests/test_searxng.py`:
| Test | Description |
|---|---|
| `test_search_success` | Mock SearXNG response → verify SearchSnippet list |
| `test_search_with_categories` | Verify categories param in request URL |
| `test_search_with_engines` | Verify engines param in request URL |
| `test_deduplication` | Duplicate URLs → keep highest score |
| `test_max_results` | 20 results, max_results=5 → returns 5 |
| `test_empty_results` | Empty results array → returns [] |
| `test_searxng_error` | Non-200 response → raises exception |
| `test_missing_content_field` | Result without "content" → uses empty string |
Use `aiohttp` test utilities or `aioresponses` for mocking HTTP responses.
## Files to Create/Modify
| File | Action | Purpose |
|---|---|---|
| `services/search/src/search_service/searxng.py` | Create | SearXNG HTTP client with deduplication |
| `services/search/pyproject.toml` | Modify | Add aiohttp dependency |
| `services/search/tests/test_searxng.py` | Create | Unit tests with mocked HTTP |
## Risks and Edge Cases
- **SearXNG response format changes:** Pin to the specific SearXNG version from #44. The JSON format is stable.
- **Timeout on slow search engines:** aiohttp timeout handles this. SearXNG itself has engine-level timeouts.
- **HTML in snippet content:** SearXNG may return HTML fragments in the "content" field. Strip HTML tags for clean text.
- **Unicode handling:** Query and results may contain non-ASCII characters. aiohttp handles this natively.
## Deviation Log
_(Filled during implementation if deviations from plan occur)_
| Deviation | Reason |
|---|---|

View File

@@ -8,6 +8,7 @@ dependencies = [
"grpcio>=1.69",
"protobuf>=7.34",
"pyyaml>=6.0",
"aiohttp>=3.10",
]
[project.scripts]
@@ -29,5 +30,6 @@ testpaths = ["tests"]
dev = [
"pytest>=8.0",
"pytest-asyncio>=0.24",
"aioresponses>=0.7",
"ruff>=0.8",
]

View File

@@ -0,0 +1,133 @@
"""SearXNG HTTP client for querying the meta-search engine."""
from __future__ import annotations
import html
import logging
import re
from dataclasses import dataclass
import aiohttp
logger = logging.getLogger(__name__)
@dataclass
class SearchSnippet:
"""A single search result snippet from SearXNG."""
title: str
url: str
snippet: str
engine: str
score: float
class SearXNGError(Exception):
"""Raised when SearXNG returns a non-200 response."""
def __init__(self, status: int, message: str) -> None:
self.status = status
self.message = message
super().__init__(f"SearXNG error ({status}): {message}")
_HTML_TAG_RE = re.compile(r"<[^>]+>")
def _strip_html(text: str) -> str:
"""Remove HTML tags and decode entities."""
return html.unescape(_HTML_TAG_RE.sub("", text))
def _deduplicate(snippets: list[SearchSnippet]) -> list[SearchSnippet]:
"""Deduplicate snippets by URL, keeping the highest-scoring entry."""
seen: dict[str, SearchSnippet] = {}
for s in snippets:
if s.url not in seen or s.score > seen[s.url].score:
seen[s.url] = s
return sorted(seen.values(), key=lambda s: s.score, reverse=True)
class SearXNGClient:
"""Async HTTP client for the SearXNG JSON API."""
def __init__(
self,
base_url: str,
max_results: int = 10,
language: str = "en",
timeout: float = 15.0,
) -> None:
self.base_url = base_url.rstrip("/")
self.max_results = max_results
self.language = language
self.timeout = aiohttp.ClientTimeout(total=timeout)
async def search(
self,
query: str,
categories: list[str] | None = None,
engines: list[str] | None = None,
num_results: int | None = None,
) -> list[SearchSnippet]:
"""Query SearXNG and return deduplicated, scored snippets.
Args:
query: Search query string.
categories: SearXNG categories (e.g., ["general", "it"]).
engines: Specific engines to query (e.g., ["google", "duckduckgo"]).
num_results: Max results to return (overrides self.max_results).
Returns:
List of SearchSnippet sorted by score descending.
Raises:
SearXNGError: If SearXNG returns a non-200 response.
"""
params: dict[str, str] = {
"q": query,
"format": "json",
"language": self.language,
}
if categories:
params["categories"] = ",".join(categories)
if engines:
params["engines"] = ",".join(engines)
url = f"{self.base_url}/search"
async with aiohttp.ClientSession(timeout=self.timeout) as session:
async with session.get(url, params=params) as resp:
if resp.status != 200:
body = await resp.text()
raise SearXNGError(resp.status, body[:500])
data = await resp.json()
results = data.get("results", [])
snippets = []
for r in results:
title = _strip_html(r.get("title", ""))
result_url = r.get("url", "")
content = _strip_html(r.get("content", ""))
engine = r.get("engine", "unknown")
if isinstance(r.get("engines"), list) and r["engines"]:
engine = r["engines"][0]
score = float(r.get("score", 0.0))
if not result_url:
continue
snippets.append(
SearchSnippet(
title=title,
url=result_url,
snippet=content,
engine=engine,
score=score,
)
)
deduped = _deduplicate(snippets)
limit = num_results or self.max_results
return deduped[:limit]

View File

@@ -0,0 +1,216 @@
"""Tests for the SearXNG client."""
from __future__ import annotations
import re
import pytest
from aioresponses import aioresponses
from search_service.searxng import (
SearchSnippet,
SearXNGClient,
SearXNGError,
_deduplicate,
_strip_html,
)
BASE_URL = "http://searxng:8080"
SEARCH_PATTERN = re.compile(r"^http://searxng:8080/search\?.*$")
def make_result(
title: str = "Test",
url: str = "https://example.com",
content: str = "snippet text",
engine: str = "google",
score: float = 1.0,
engines: list[str] | None = None,
) -> dict:
r: dict = {
"title": title,
"url": url,
"content": content,
"engine": engine,
"score": score,
}
if engines is not None:
r["engines"] = engines
return r
def searxng_response(results: list[dict]) -> dict:
return {"results": results}
@pytest.fixture
def client() -> SearXNGClient:
return SearXNGClient(BASE_URL, max_results=10)
@pytest.mark.asyncio
async def test_search_success(client: SearXNGClient) -> None:
results = [
make_result(title="Rust Lang", url="https://rust-lang.org", content="Systems programming", score=2.0),
make_result(title="Crates", url="https://crates.io", content="Package registry", score=1.5),
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("rust programming")
assert len(snippets) == 2
assert snippets[0].title == "Rust Lang"
assert snippets[0].url == "https://rust-lang.org"
assert snippets[0].snippet == "Systems programming"
assert snippets[0].score == 2.0
assert snippets[1].title == "Crates"
@pytest.mark.asyncio
async def test_search_with_categories(client: SearXNGClient) -> None:
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response([]))
await client.search("test", categories=["general", "it"])
# Verify the request included categories param
assert len(m.requests) == 1
req_url = str(list(m.requests.keys())[0][1])
assert "categories=" in req_url
@pytest.mark.asyncio
async def test_search_with_engines(client: SearXNGClient) -> None:
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response([]))
await client.search("test", engines=["google", "duckduckgo"])
assert len(m.requests) == 1
req_url = str(list(m.requests.keys())[0][1])
assert "engines=" in req_url
@pytest.mark.asyncio
async def test_deduplication() -> None:
snippets = [
SearchSnippet("A", "https://a.com", "text a", "google", 1.0),
SearchSnippet("A dup", "https://a.com", "text a dup", "bing", 2.0),
SearchSnippet("B", "https://b.com", "text b", "google", 0.5),
]
deduped = _deduplicate(snippets)
assert len(deduped) == 2
a_result = next(s for s in deduped if s.url == "https://a.com")
assert a_result.score == 2.0
assert a_result.title == "A dup"
assert deduped[0].score >= deduped[1].score
@pytest.mark.asyncio
async def test_max_results(client: SearXNGClient) -> None:
results = [
make_result(url=f"https://example.com/{i}", score=float(20 - i))
for i in range(20)
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test")
assert len(snippets) == 10
@pytest.mark.asyncio
async def test_num_results_override(client: SearXNGClient) -> None:
results = [
make_result(url=f"https://example.com/{i}", score=float(20 - i))
for i in range(20)
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test", num_results=5)
assert len(snippets) == 5
@pytest.mark.asyncio
async def test_empty_results(client: SearXNGClient) -> None:
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response([]))
snippets = await client.search("nothing")
assert snippets == []
@pytest.mark.asyncio
async def test_searxng_error(client: SearXNGClient) -> None:
with aioresponses() as m:
m.get(SEARCH_PATTERN, status=500, body="Internal Error")
with pytest.raises(SearXNGError) as exc_info:
await client.search("test")
assert exc_info.value.status == 500
@pytest.mark.asyncio
async def test_missing_content_field(client: SearXNGClient) -> None:
results = [{"title": "No Content", "url": "https://example.com", "score": 1.0}]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test")
assert len(snippets) == 1
assert snippets[0].snippet == ""
@pytest.mark.asyncio
async def test_html_stripping(client: SearXNGClient) -> None:
results = [
make_result(
title="<b>Bold</b> Title",
content="<em>emphasized</em> &amp; <strong>strong</strong>",
url="https://example.com",
)
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test")
assert snippets[0].title == "Bold Title"
assert snippets[0].snippet == "emphasized & strong"
def test_strip_html_basic() -> None:
assert _strip_html("<b>bold</b>") == "bold"
assert _strip_html("no tags") == "no tags"
assert _strip_html("&lt;escaped&gt;") == "<escaped>"
assert _strip_html("") == ""
@pytest.mark.asyncio
async def test_engines_field_preferred(client: SearXNGClient) -> None:
results = [
make_result(engine="fallback", engines=["primary_engine"], url="https://a.com")
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test")
assert snippets[0].engine == "primary_engine"
@pytest.mark.asyncio
async def test_results_without_url_skipped(client: SearXNGClient) -> None:
results = [
make_result(url=""),
make_result(url="https://valid.com"),
]
with aioresponses() as m:
m.get(SEARCH_PATTERN, payload=searxng_response(results))
snippets = await client.search("test")
assert len(snippets) == 1
assert snippets[0].url == "https://valid.com"
@pytest.mark.asyncio
async def test_trailing_slash_in_base_url() -> None:
client = SearXNGClient("http://searxng:8080/")
assert client.base_url == "http://searxng:8080"