AGENTS.md) is the single source of truth for the project's architecture, design decisions, and data contracts. It must always be updated whenever relevant architectural changes, new endpoints, or structural modifications are made during the development process. Additionally, you must always update the Hugo documentation site under docs/ with relevant user-facing information (setup instructions, provider behavior, MCP client setup, configuration, deployment notes, API behavior, observability, etc.) and keep README.md as the concise repository landing page. Always consult this file to understand the system context.
A self-hosted, highly efficient, privacy-focused Search Engine designed explicitly for AI Agents and LLMs. It bypasses the need for paid search APIs (like Tavily or Bing) and avoids the rate-limiting and formatting issues of standard SearXNG instances.
It provides clean, token-optimized Markdown ready for LLM context windows, and natively supports the Model Context Protocol (MCP) for plug-and-play integration with modern LLM UIs (OpenWebUI, Claude Desktop, Cursor, Opencode, Neovim CodeCompanion, etc.).
The system uses a Hybrid Microservice Architecture designed for Kubernetes:
The front-facing orchestrator. Built in Go for high concurrency, low memory footprint, and fast routing.
- Role: API Gateway, Search Scraper, Orchestrator, MCP Server.
- Responsibilities:
- Expose standard REST API (
/api/v1/searchand/api/v1/fetch). - Expose Swagger UI documentation (
/docs). - Expose MCP Server over SSE (
/mcp/sseand/mcp/message) and Streamable HTTP (/mcp/http). - Auto-generate OpenAPI specs.
- Scrape DuckDuckGo (HTML version:
html.duckduckgo.com) to extract top URLs when using the nativesearchbase_ddgprovider. - Call official provider APIs when configured, including Brave (
SEARCHBASE_SEARCH_PROVIDER=braveandSEARCHBASE_BRAVE_API_TOKEN) and Mojeek (SEARCHBASE_SEARCH_PROVIDER=mojeekandSEARCHBASE_MOJEEK_API_KEY). - Return compact search results (
title,url,snippet) without automatically fetching page content. - Fetch page content only through
/api/v1/fetchor the MCPfetch_urltool when the client/LLM explicitly requests it. - Configuration: Uses Viper library with centralized
settingspackage (internal/settings) enforcing a strictSEARCHBASE_prefix for all environment variables (e.g.,SEARCHBASE_PORT,SEARCHBASE_CRAWL_WORKER_ADDRESS,SEARCHBASE_LOG_LEVEL). Supports defaults and automatic env var binding.
- Expose standard REST API (
- Design Pattern: Uses Interfaces (e.g.,
SearchProvider) to allow easy plugging of search APIs and metasearch providers without changing core logic. Current providers include native DuckDuckGo HTML (searchbase_ddg), DDGS (ddgs), SearXNG (searxng), Brave Search API (brave), and Mojeek Search API (mojeek).
The internal heavy-lifter. Completely hidden from the outside world.
- Role: Headless browser, DOM cleaner, Markdown optimizer.
- Tech Stack: Python, FastAPI,
crawl4ai.
Privacy-first structured logging designed to prevent user tracking and correlation.
- Implementation: Gin middleware at
internal/middlewares/ginslogger.gousing Go's standardlog/slogpackage. - Configuration: Environment variables
SEARCHBASE_LOG_LEVEL(error/warn/info/debug) andSEARCHBASE_LOG_FORMAT(json/text). - Privacy Guarantees:
- No IP Logging: Raw IP addresses are never logged; only regional data from edge proxies (e.g., Cloudflare's
CF-IPCountryheader). - Minimal Context: Only essential request metadata (method, path, status code, latency) is captured.
Distributed tracing support for observability and debugging. Not fully tested yet.
- Implementation:
internal/telemetry/otel.gousing Go OTEL SDK with OTLP HTTP exporter. - Configuration: Environment variables
SEARCHBASE_TRACING_ENABLED(bool) andSEARCHBASE_TRACING_ENDPOINT(string). - Features: Trace propagation via W3C TraceContext/Baggage, service name attribution.
Metrics are planned but not implemented yet.
The internal heavy-lifter. Completely hidden from the outside world.
- Role: Headless browser, DOM cleaner, Markdown optimizer.
- Tech Stack: Python, FastAPI,
crawl4ai. - Responsibilities:
- Expose internal endpoint
POST /extract. - Fetch URLs concurrently.
- Execute JavaScript (if requested via
js_renderflag). - Strip boilerplate (navbars, footers, ads) and extract LLM-optimized Markdown.
- Return the Markdown string to the Go Gateway.
- Expose internal endpoint
Endpoint: POST /api/v1/search
Request:
{
"query": "kubernetes 1.30 release notes",
"engine": "auto",
"region": "wt-wt",
"timelimit": "d",
"safesearch": "moderate",
"page": 1
}Response:
[
{
"title": "Kubernetes 1.30: ...",
"url": "https://kubernetes.io/...",
"snippet": "Short description from the search engine..."
}
]Errors: 400 for invalid request payload or missing query; 502 for upstream provider failures with safe provider-specific messages that do not include search queries, request bodies, tokens, or raw upstream URLs; 500 for unexpected gateway failures.
Endpoint: POST /api/v1/fetch
Request:
{
"url": "https://kubernetes.io/blog/...",
"js_render": false
}Response:
{
"title": "",
"url": "https://kubernetes.io/blog/...",
"snippet": "",
"markdown": "# Kubernetes 1.30\n\nThe new release features..."
}web_search: Searches the web using the configured gateway search provider and returns the top results. Takesquery, optionallimit,engine(requiresddgsorsearxngprovider),region,timelimit,safesearch, andpagearguments.limitis an optional upper bound; omitted or0uses the provider or search engine default. Brave maps supported country-language styleregionvalues such asus-ento Bravecountry=USandsearch_lang=en; unsupported country or language parts are omitted. Mojeek mapslimittot, mapstimelimitvaluesd,m, andytosince, intentionally does not supportw, and intentionally does not supportpageyet. Provider failures are returned as safe provider-specific messages without exposing search queries, request bodies, tokens, or raw upstream URLs.fetch_url: Fetches the content of a single URL directly and extracts optimized markdown. Takesurlandjs_renderarguments.
Endpoint: POST /extract
Request:
{
"url": "https://kubernetes.io/...",
"js_render": false
}Response:
{
"markdown": "...",
"success": true,
"error": ""
}Standard YAML manifests (no Helm/Kustomize for now to avoid premature complexity).
search-gateway: Deployment (1 replica), Service (ClusterIP or LoadBalancer/Ingress for external access).crawl-worker: Deployment (scalable replicas for heavy lifting), Service (ClusterIP, strictly internal).
Compose files are demonstration-only examples for local testing and learning service wiring. They are not production deployment templates. Backend-specific examples live under examples/compose/ for searchbase_ddg, ddgs, and searxng; the root docker-compose.yml is for development and can start more services than a normal deployment requires.
User-facing project documentation lives in the Hugo site under docs/. The site uses custom local layouts and styling, not an external theme. Keep it updated whenever setup, configuration, provider behavior, MCP usage, deployment notes, observability, or API behavior changes. Use task docs:hugo to preview locally and hugo --destination /tmp/searchbase-docs-build from docs/ to verify builds without writing generated output into the repository.
All external contributors must accept CLA.md before a pull request can be merged. .github/workflows/cla.yaml uses CLA Assistant Lite to require a signature comment and stores accepted signatures on the cla-signatures branch at .github/cla/signatures/v1/cla.json. Keep CLA.md, CONTRIBUTING.md, the PR template, the CLA workflow, and the docs Contributing page in sync if the contribution process changes.
The project utilizes GitHub Actions to automatically version, build, and publish Docker images to the GitHub Container Registry (GHCR) using go-semantic-release. Taskfiles are used within these workflows to orchestrate the build and push processes.
- Trigger: Pushes to the
mainbranch or manual workflow dispatch. - Images:
ghcr.io/coolapso/searchbase/search-gatewayghcr.io/coolapso/searchbase/crawl-workerghcr.io/coolapso/searchbase/ddgs
- Workflow Location:
.github/workflows/release.yaml
To ensure reliable and fast multi-architecture builds, the project strictly adheres to the following pattern:
- Go Services (e.g.,
search-gateway): Use standard Docker Buildx cross-compilation on a singleubuntu-latestrunner (AMD64) targetingplatforms: linux/amd64,linux/arm64. Go's native cross-compiler handles this extremely efficiently. - Python Services (e.g.,
crawl-worker,ddgs): DO NOT use QEMU for cross-compiling Python images. It is painfully slow and often fails when compiling C extensions. Instead, use a Matrix Strategy with native runners:- Spawn one job on
ubuntu-latest(AMD64). - Spawn a second job on
ubuntu-24.04-arm(ARM64). - Build and push architecture-specific images appended with a flavor tag (e.g.,
-amd64,-arm64). - Use a final aggregation job (Manifest Job) to stitch the
-amd64and-arm64tags together into the final multi-arch tags (latest,v1.2.3) usingdocker buildx imagetools create.
- Spawn one job on
- Directory Scaffold: Setup Go module and Python environment.
- Worker Implementation: Build the Python FastAPI wrapper around
crawl4ai. - Search Provider: Implement the DuckDuckGo HTML scraper in Go.
- Worker Client: Implement the internal HTTP client in Go to talk to the Python worker.
- API Gateway: Build the REST API in Go.
- MCP Integration: Implement MCP over SSE in Go.
- Kubernetes Manifests: Write the deployment YAMLs.
- Testing & Refinement: E2E testing and Dockerfile optimization.
- Consolidate Search Request: Unify search parameters into a single
search.Requeststruct. - Settings Centralization: Centralize environment variable parsing and configuration management.
- DDGS Microservice: Integrate
ddgsas an internal Python API for fallback and multi-engine search. - Brave Search API Provider: Implement native Go support for the official Brave Search API. Configure with
SEARCHBASE_SEARCH_PROVIDER=braveandSEARCHBASE_BRAVE_API_TOKEN. - Mojeek Search API Provider: Implement native Go support for the official Mojeek Search API. Configure with
SEARCHBASE_SEARCH_PROVIDER=mojeekandSEARCHBASE_MOJEEK_API_KEY. - Additional Native Cloud Providers: Implement native Go support for other major search engine APIs when viable (Google Search API, Bing, etc.).
- Engine Selection: Allow clients to select specific search engines via the API.
- Result Aggregation: Support searching from multiple search engines concurrently and aggregating/deduplicating the results.
- Structured Logging: Implemented privacy-focused structured JSON logging via Gin middleware at
internal/middlewares/ginslogger.go. Logs exclude all personal data (IP addresses, user agents, query parameters) and prevent user correlation. - Authentication: Secure the Gateway API/MCP. The gateway will act strictly as a validator (verifying API keys/tokens), delegating user management, billing, and key generation to a separate, dedicated Authentication Service.
- Deep Crawling: Extend the
/fetchendpoint to allow LLMs to follow links and crawl deeper into sites autonomously. - Instrumentation: Add OpenTelemetry tracing, spans, metrics, and response time tracking.
- CLI Tool: Create a CLI utility (
searchandget) for interacting with the API natively from the terminal.