Compare commits
2 Commits
014e2f2d04
...
a578fa3c5b
| Author | SHA1 | Date | |
|---|---|---|---|
| a578fa3c5b | |||
|
|
c1aff33eb1 |
@@ -49,6 +49,7 @@
|
||||
| #43 | Integration tests for Model Gateway | Phase 5 | `COMPLETED` | Rust | [issue-043.md](issue-043.md) |
|
||||
| #44 | Set up SearXNG Docker container | Phase 6 | `COMPLETED` | Docker / YAML | [issue-044.md](issue-044.md) |
|
||||
| #45 | Scaffold Search Service Python project | Phase 6 | `COMPLETED` | Python | [issue-045.md](issue-045.md) |
|
||||
| #46 | Implement SearXNG query + snippet filter | Phase 6 | `COMPLETED` | Python | [issue-046.md](issue-046.md) |
|
||||
|
||||
## Status Legend
|
||||
|
||||
|
||||
144
implementation-plans/issue-046.md
Normal file
144
implementation-plans/issue-046.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Implementation Plan — Issue #46: Implement SearXNG query + snippet filter
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Issue | [#46](https://git.shahondin1624.de/llm-multiverse/llm-multiverse/issues/46) |
|
||||
| Title | Implement SearXNG query + snippet filter |
|
||||
| Milestone | Phase 6: Search Service |
|
||||
| Labels | — |
|
||||
| Status | `COMPLETED` |
|
||||
| Language | Python |
|
||||
| Related Plans | issue-044.md, issue-045.md |
|
||||
| Blocked by | #44, #45 |
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] HTTP client for SearXNG JSON API
|
||||
- [ ] Query parameter construction (categories, engines, language)
|
||||
- [ ] Response parsing: extract title, URL, snippet, engine source
|
||||
- [ ] Snippet deduplication across engines
|
||||
- [ ] Configurable max results and engine selection
|
||||
|
||||
## Architecture Analysis
|
||||
|
||||
### Service Context
|
||||
- **Service:** Search Service (`services/search/`)
|
||||
- **Component:** SearXNG client module — HTTP client wrapping the SearXNG JSON API
|
||||
- SearXNG runs at `config.searxng_url` (default `http://localhost:8888`)
|
||||
- This module is called by the Search gRPC endpoint (implemented in #49)
|
||||
|
||||
### Existing Patterns
|
||||
- **Config:** `Config.searxng_url` already available
|
||||
- **HTTP client:** Use `aiohttp` for async HTTP requests (consistent with async gRPC server)
|
||||
- **SearXNG JSON API:** `GET /search?q=<query>&format=json&categories=<cat>&engines=<eng>&language=<lang>`
|
||||
|
||||
### Dependencies
|
||||
- `aiohttp` — async HTTP client (add to pyproject.toml)
|
||||
- SearXNG JSON API response format
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### 1. Add `aiohttp` dependency
|
||||
|
||||
Add to `services/search/pyproject.toml`:
|
||||
```toml
|
||||
"aiohttp>=3.10",
|
||||
```
|
||||
|
||||
### 2. Create `searxng.py` module
|
||||
|
||||
`services/search/src/search_service/searxng.py`:
|
||||
|
||||
**Dataclasses:**
|
||||
```python
|
||||
@dataclass
|
||||
class SearchSnippet:
|
||||
title: str
|
||||
url: str
|
||||
snippet: str
|
||||
engine: str
|
||||
score: float # SearXNG relevance score (0.0-1.0)
|
||||
```
|
||||
|
||||
**SearXNGClient class:**
|
||||
```python
|
||||
class SearXNGClient:
|
||||
def __init__(self, base_url: str, max_results: int = 10, language: str = "en"):
|
||||
self.base_url = base_url.rstrip("/")
|
||||
self.max_results = max_results
|
||||
self.language = language
|
||||
|
||||
async def search(
|
||||
self,
|
||||
query: str,
|
||||
categories: list[str] | None = None,
|
||||
engines: list[str] | None = None,
|
||||
num_results: int | None = None,
|
||||
) -> list[SearchSnippet]:
|
||||
...
|
||||
```
|
||||
|
||||
**search() implementation:**
|
||||
1. Build query params: `q`, `format=json`, `categories` (comma-separated), `engines` (comma-separated), `language`
|
||||
2. Send GET request to `{base_url}/search`
|
||||
3. Parse JSON response — extract `results` array
|
||||
4. Map each result to `SearchSnippet`:
|
||||
- `title` = result["title"]
|
||||
- `url` = result["url"]
|
||||
- `snippet` = result["content"] (SearXNG uses "content" field)
|
||||
- `engine` = result["engine"] (or first engine if multiple)
|
||||
- `score` = result.get("score", 0.0)
|
||||
5. Deduplicate by URL (keep highest-scoring entry per URL)
|
||||
6. Sort by score descending
|
||||
7. Truncate to `num_results` (or `self.max_results`)
|
||||
|
||||
**Deduplication logic:**
|
||||
```python
|
||||
def _deduplicate(snippets: list[SearchSnippet]) -> list[SearchSnippet]:
|
||||
seen: dict[str, SearchSnippet] = {}
|
||||
for s in snippets:
|
||||
if s.url not in seen or s.score > seen[s.url].score:
|
||||
seen[s.url] = s
|
||||
return sorted(seen.values(), key=lambda s: s.score, reverse=True)
|
||||
```
|
||||
|
||||
### 3. Tests
|
||||
|
||||
`services/search/tests/test_searxng.py`:
|
||||
|
||||
| Test | Description |
|
||||
|---|---|
|
||||
| `test_search_success` | Mock SearXNG response → verify SearchSnippet list |
|
||||
| `test_search_with_categories` | Verify categories param in request URL |
|
||||
| `test_search_with_engines` | Verify engines param in request URL |
|
||||
| `test_deduplication` | Duplicate URLs → keep highest score |
|
||||
| `test_max_results` | 20 results, max_results=5 → returns 5 |
|
||||
| `test_empty_results` | Empty results array → returns [] |
|
||||
| `test_searxng_error` | Non-200 response → raises exception |
|
||||
| `test_missing_content_field` | Result without "content" → uses empty string |
|
||||
|
||||
Use `aiohttp` test utilities or `aioresponses` for mocking HTTP responses.
|
||||
|
||||
## Files to Create/Modify
|
||||
|
||||
| File | Action | Purpose |
|
||||
|---|---|---|
|
||||
| `services/search/src/search_service/searxng.py` | Create | SearXNG HTTP client with deduplication |
|
||||
| `services/search/pyproject.toml` | Modify | Add aiohttp dependency |
|
||||
| `services/search/tests/test_searxng.py` | Create | Unit tests with mocked HTTP |
|
||||
|
||||
## Risks and Edge Cases
|
||||
|
||||
- **SearXNG response format changes:** Pin to the specific SearXNG version from #44. The JSON format is stable.
|
||||
- **Timeout on slow search engines:** aiohttp timeout handles this. SearXNG itself has engine-level timeouts.
|
||||
- **HTML in snippet content:** SearXNG may return HTML fragments in the "content" field. Strip HTML tags for clean text.
|
||||
- **Unicode handling:** Query and results may contain non-ASCII characters. aiohttp handles this natively.
|
||||
|
||||
## Deviation Log
|
||||
|
||||
_(Filled during implementation if deviations from plan occur)_
|
||||
|
||||
| Deviation | Reason |
|
||||
|---|---|
|
||||
@@ -8,6 +8,7 @@ dependencies = [
|
||||
"grpcio>=1.69",
|
||||
"protobuf>=7.34",
|
||||
"pyyaml>=6.0",
|
||||
"aiohttp>=3.10",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
@@ -29,5 +30,6 @@ testpaths = ["tests"]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-asyncio>=0.24",
|
||||
"aioresponses>=0.7",
|
||||
"ruff>=0.8",
|
||||
]
|
||||
|
||||
133
services/search/src/search_service/searxng.py
Normal file
133
services/search/src/search_service/searxng.py
Normal file
@@ -0,0 +1,133 @@
|
||||
"""SearXNG HTTP client for querying the meta-search engine."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import logging
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
|
||||
import aiohttp
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class SearchSnippet:
|
||||
"""A single search result snippet from SearXNG."""
|
||||
|
||||
title: str
|
||||
url: str
|
||||
snippet: str
|
||||
engine: str
|
||||
score: float
|
||||
|
||||
|
||||
class SearXNGError(Exception):
|
||||
"""Raised when SearXNG returns a non-200 response."""
|
||||
|
||||
def __init__(self, status: int, message: str) -> None:
|
||||
self.status = status
|
||||
self.message = message
|
||||
super().__init__(f"SearXNG error ({status}): {message}")
|
||||
|
||||
|
||||
_HTML_TAG_RE = re.compile(r"<[^>]+>")
|
||||
|
||||
|
||||
def _strip_html(text: str) -> str:
|
||||
"""Remove HTML tags and decode entities."""
|
||||
return html.unescape(_HTML_TAG_RE.sub("", text))
|
||||
|
||||
|
||||
def _deduplicate(snippets: list[SearchSnippet]) -> list[SearchSnippet]:
|
||||
"""Deduplicate snippets by URL, keeping the highest-scoring entry."""
|
||||
seen: dict[str, SearchSnippet] = {}
|
||||
for s in snippets:
|
||||
if s.url not in seen or s.score > seen[s.url].score:
|
||||
seen[s.url] = s
|
||||
return sorted(seen.values(), key=lambda s: s.score, reverse=True)
|
||||
|
||||
|
||||
class SearXNGClient:
|
||||
"""Async HTTP client for the SearXNG JSON API."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str,
|
||||
max_results: int = 10,
|
||||
language: str = "en",
|
||||
timeout: float = 15.0,
|
||||
) -> None:
|
||||
self.base_url = base_url.rstrip("/")
|
||||
self.max_results = max_results
|
||||
self.language = language
|
||||
self.timeout = aiohttp.ClientTimeout(total=timeout)
|
||||
|
||||
async def search(
|
||||
self,
|
||||
query: str,
|
||||
categories: list[str] | None = None,
|
||||
engines: list[str] | None = None,
|
||||
num_results: int | None = None,
|
||||
) -> list[SearchSnippet]:
|
||||
"""Query SearXNG and return deduplicated, scored snippets.
|
||||
|
||||
Args:
|
||||
query: Search query string.
|
||||
categories: SearXNG categories (e.g., ["general", "it"]).
|
||||
engines: Specific engines to query (e.g., ["google", "duckduckgo"]).
|
||||
num_results: Max results to return (overrides self.max_results).
|
||||
|
||||
Returns:
|
||||
List of SearchSnippet sorted by score descending.
|
||||
|
||||
Raises:
|
||||
SearXNGError: If SearXNG returns a non-200 response.
|
||||
"""
|
||||
params: dict[str, str] = {
|
||||
"q": query,
|
||||
"format": "json",
|
||||
"language": self.language,
|
||||
}
|
||||
if categories:
|
||||
params["categories"] = ",".join(categories)
|
||||
if engines:
|
||||
params["engines"] = ",".join(engines)
|
||||
|
||||
url = f"{self.base_url}/search"
|
||||
|
||||
async with aiohttp.ClientSession(timeout=self.timeout) as session:
|
||||
async with session.get(url, params=params) as resp:
|
||||
if resp.status != 200:
|
||||
body = await resp.text()
|
||||
raise SearXNGError(resp.status, body[:500])
|
||||
data = await resp.json()
|
||||
|
||||
results = data.get("results", [])
|
||||
snippets = []
|
||||
for r in results:
|
||||
title = _strip_html(r.get("title", ""))
|
||||
result_url = r.get("url", "")
|
||||
content = _strip_html(r.get("content", ""))
|
||||
engine = r.get("engine", "unknown")
|
||||
if isinstance(r.get("engines"), list) and r["engines"]:
|
||||
engine = r["engines"][0]
|
||||
score = float(r.get("score", 0.0))
|
||||
|
||||
if not result_url:
|
||||
continue
|
||||
|
||||
snippets.append(
|
||||
SearchSnippet(
|
||||
title=title,
|
||||
url=result_url,
|
||||
snippet=content,
|
||||
engine=engine,
|
||||
score=score,
|
||||
)
|
||||
)
|
||||
|
||||
deduped = _deduplicate(snippets)
|
||||
limit = num_results or self.max_results
|
||||
return deduped[:limit]
|
||||
216
services/search/tests/test_searxng.py
Normal file
216
services/search/tests/test_searxng.py
Normal file
@@ -0,0 +1,216 @@
|
||||
"""Tests for the SearXNG client."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
import pytest
|
||||
from aioresponses import aioresponses
|
||||
|
||||
from search_service.searxng import (
|
||||
SearchSnippet,
|
||||
SearXNGClient,
|
||||
SearXNGError,
|
||||
_deduplicate,
|
||||
_strip_html,
|
||||
)
|
||||
|
||||
BASE_URL = "http://searxng:8080"
|
||||
SEARCH_PATTERN = re.compile(r"^http://searxng:8080/search\?.*$")
|
||||
|
||||
|
||||
def make_result(
|
||||
title: str = "Test",
|
||||
url: str = "https://example.com",
|
||||
content: str = "snippet text",
|
||||
engine: str = "google",
|
||||
score: float = 1.0,
|
||||
engines: list[str] | None = None,
|
||||
) -> dict:
|
||||
r: dict = {
|
||||
"title": title,
|
||||
"url": url,
|
||||
"content": content,
|
||||
"engine": engine,
|
||||
"score": score,
|
||||
}
|
||||
if engines is not None:
|
||||
r["engines"] = engines
|
||||
return r
|
||||
|
||||
|
||||
def searxng_response(results: list[dict]) -> dict:
|
||||
return {"results": results}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def client() -> SearXNGClient:
|
||||
return SearXNGClient(BASE_URL, max_results=10)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_success(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(title="Rust Lang", url="https://rust-lang.org", content="Systems programming", score=2.0),
|
||||
make_result(title="Crates", url="https://crates.io", content="Package registry", score=1.5),
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("rust programming")
|
||||
|
||||
assert len(snippets) == 2
|
||||
assert snippets[0].title == "Rust Lang"
|
||||
assert snippets[0].url == "https://rust-lang.org"
|
||||
assert snippets[0].snippet == "Systems programming"
|
||||
assert snippets[0].score == 2.0
|
||||
assert snippets[1].title == "Crates"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_with_categories(client: SearXNGClient) -> None:
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response([]))
|
||||
await client.search("test", categories=["general", "it"])
|
||||
|
||||
# Verify the request included categories param
|
||||
assert len(m.requests) == 1
|
||||
req_url = str(list(m.requests.keys())[0][1])
|
||||
assert "categories=" in req_url
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_with_engines(client: SearXNGClient) -> None:
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response([]))
|
||||
await client.search("test", engines=["google", "duckduckgo"])
|
||||
|
||||
assert len(m.requests) == 1
|
||||
req_url = str(list(m.requests.keys())[0][1])
|
||||
assert "engines=" in req_url
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_deduplication() -> None:
|
||||
snippets = [
|
||||
SearchSnippet("A", "https://a.com", "text a", "google", 1.0),
|
||||
SearchSnippet("A dup", "https://a.com", "text a dup", "bing", 2.0),
|
||||
SearchSnippet("B", "https://b.com", "text b", "google", 0.5),
|
||||
]
|
||||
deduped = _deduplicate(snippets)
|
||||
assert len(deduped) == 2
|
||||
a_result = next(s for s in deduped if s.url == "https://a.com")
|
||||
assert a_result.score == 2.0
|
||||
assert a_result.title == "A dup"
|
||||
assert deduped[0].score >= deduped[1].score
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_max_results(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(url=f"https://example.com/{i}", score=float(20 - i))
|
||||
for i in range(20)
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test")
|
||||
|
||||
assert len(snippets) == 10
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_num_results_override(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(url=f"https://example.com/{i}", score=float(20 - i))
|
||||
for i in range(20)
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test", num_results=5)
|
||||
|
||||
assert len(snippets) == 5
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_empty_results(client: SearXNGClient) -> None:
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response([]))
|
||||
snippets = await client.search("nothing")
|
||||
|
||||
assert snippets == []
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_searxng_error(client: SearXNGClient) -> None:
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, status=500, body="Internal Error")
|
||||
with pytest.raises(SearXNGError) as exc_info:
|
||||
await client.search("test")
|
||||
|
||||
assert exc_info.value.status == 500
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_missing_content_field(client: SearXNGClient) -> None:
|
||||
results = [{"title": "No Content", "url": "https://example.com", "score": 1.0}]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test")
|
||||
|
||||
assert len(snippets) == 1
|
||||
assert snippets[0].snippet == ""
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_html_stripping(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(
|
||||
title="<b>Bold</b> Title",
|
||||
content="<em>emphasized</em> & <strong>strong</strong>",
|
||||
url="https://example.com",
|
||||
)
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test")
|
||||
|
||||
assert snippets[0].title == "Bold Title"
|
||||
assert snippets[0].snippet == "emphasized & strong"
|
||||
|
||||
|
||||
def test_strip_html_basic() -> None:
|
||||
assert _strip_html("<b>bold</b>") == "bold"
|
||||
assert _strip_html("no tags") == "no tags"
|
||||
assert _strip_html("<escaped>") == "<escaped>"
|
||||
assert _strip_html("") == ""
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_engines_field_preferred(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(engine="fallback", engines=["primary_engine"], url="https://a.com")
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test")
|
||||
|
||||
assert snippets[0].engine == "primary_engine"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_results_without_url_skipped(client: SearXNGClient) -> None:
|
||||
results = [
|
||||
make_result(url=""),
|
||||
make_result(url="https://valid.com"),
|
||||
]
|
||||
with aioresponses() as m:
|
||||
m.get(SEARCH_PATTERN, payload=searxng_response(results))
|
||||
snippets = await client.search("test")
|
||||
|
||||
assert len(snippets) == 1
|
||||
assert snippets[0].url == "https://valid.com"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_trailing_slash_in_base_url() -> None:
|
||||
client = SearXNGClient("http://searxng:8080/")
|
||||
assert client.base_url == "http://searxng:8080"
|
||||
Reference in New Issue
Block a user