chore: bump v0.4.3 — citation coherence, BM25 pipeline refinements, admin valves

x-hannibal · x-hannibal · commit a2a3efa850f6 · 2026-04-24T03:01:04.000+02:00
- Citation labels unified to [N] format (1:1 with inline marker)
- BM25 zero-score drop on both fetched sources and snippet pool
- BM25 budget computed against pre-drop count (preserves char allowance)
- redistribute_budget hard-caps zero-score sources at floor
- Reasoning models report visible reply length (strips &lt;think&gt; + &lt;details type="reasoning"&gt;)
- Prompt restructured: task framing before search context, output rules after
- New admin valve inject_snippet_pool (default ON)
- Pipeline stats line gated by debug valve (default hidden)
- User-Agent rotation pool expanded from 20 to 40 unique strings

See CHANGELOG.md for full details.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,15 @@
+* 2026-04-23: v0.4.3 - Citation Coherence, BM25 Pipeline Refinements & Admin Valves (Hannibal)
+  * **fix:** Source label format unified to `--- [N] Title ---` so the block label maps 1:1 to the inline `[N]` citation marker the model is asked to emit. Removes the cognitive jump `Source N → [N]` that small models (Gemma3, smaller Mistral, Phi) were skipping, producing inline markers that mapped to nothing. The `<search_results>` XML wrapper and the SECURITY instruction from v0.3.5 are preserved.
+  * **fix:** `redistribute_budget` was adding score=0.0 sources to the "hungry" list, causing them to receive surplus budget via the equal-share fallback — a gocsgo.com and csgoskins.gg page were getting 5937/6080 chars instead of the 200-char floor. Sources with `score=0.0` are now excluded from surplus redistribution entirely; they stay hard-capped at `BM25_FLOOR_CHARS` regardless of available surplus.
+  * **fix:** Zero-score fetched sources (no BM25 overlap with the user query) are now dropped entirely in Phase B instead of being kept at the 200-char floor. Even floor-capped blocks still produced a `[N]` context block and a citation event, so the LLM would try to weave off-topic content into the answer (e.g. Zhihu "Attention is All You Need" threads and Python `self` tutorials appearing in a reply about autonomous vehicles). Degenerate case where *all* scores are zero falls through to the existing flat equal-share branch so the model still receives some context.
+  * **fix:** BM25 budget is now computed against the **pre-drop** source count, so the surviving sources inherit the budget that would have been spent on the dropped noise. Previously dropping zero-score sources silently shrunk `total_budget = max_len * len(sources)` proportional to the number of survivors, leaving the good sources with less context window than the user's `max_result_length` valve implied.
+  * **fix:** Snippet-only pool (`remaining_pool`) is now passed through the same BM25 + zero-score filter as the fetched sources before being injected into the LLM context. Previously the pool reached the prompt unfiltered, leaking ~40+ off-topic results per query (Python `self` tutorials, Arabic Hamza pages, random topic news) — both into the model's reasoning material and into the UI's citation panel as ~50 noisy pills. Survivors are rendered as `--- [N] Title (snippet only) ---`, contiguous numbering after fetched sources, each with its own `emit_citation`. No hard cap — the pool self-limits via lexical pertinence.
+  * **fix:** Reply-length stat (`📊 ... → Nk reply`) now strips `<think>`, `<thinking>` and OWUI's `<details type="reasoning" done="true">…</details>` wrappers before counting, so reasoning models (Qwen3 thinking, DeepSeek-R1, Phi-reasoning) report the visible answer length, not the visible answer + hidden chain-of-thought.
+  * **changed:** Prompt structure split by function. Task-framing sections (`INSTRUCTION`, `CRITICAL`, `RELIABILITY`) move **before** `<search_results>` so the model reads the context already knowing the goal; output-format sections (`CITATIONS`, `SECURITY`) move **after** `<search_results>` so they are the last tokens before generation, fighting "lost in the middle" attention loss on 40k+ token contexts. `INSTRUCTION` was also strengthened with "Provide a comprehensive, well-structured response that synthesises the key findings", which restored verbosity that was being throttled when all rules sat at the head of the prompt.
+  * **added:** Admin valve `inject_snippet_pool` (default `True`) — when OFF, the BM25-ranked snippet-only pool is dropped from the LLM context and only fully-fetched pages reach the model. Useful for short factual lookups where pool clutter outweighs grounding signal; recommended ON for analytical/exploratory queries.
+  * **changed:** User-Agent rotation pool expanded from 20 to 40 unique strings, doubling fingerprint diversity to further reduce per-domain rotation collisions on heavy `??:N` queries.
+  * **changed:** The `📊 ... raw → lxml → clean → ctx → reply` stats line is now appended to the model response **only when the admin `debug` valve is ON**. With debug OFF (the default) it goes only to the EasySearch debug log. Removes operator-only telemetry from the end-user view without adding a dedicated valve — the existing debug toggle already gates it.
+
 * 2026-04-23: v0.4.2 - OWUI Compatibility & Citation Fixes (Hannibal)
   * **fix:** `Users.get_user_by_id()` is synchronous in OWUI ≤0.8.12 and asynchronous in OWUI ≥0.9.x. Using a bare `await` on the sync version raises `object UserModel can't be used in 'await' expression`. Replaced all three call sites with a `_get_user()` helper that uses `inspect.isawaitable()` to branch at runtime — no minimum OWUI version requirement.
   * **fix:** Snippet-only `remaining_pool` entries were assigned `[N]` IDs in the search context but had no corresponding `emit_citation` call. OWUI silently drops inline `[N]` markers with no registered citation, so any fact the model cited from a snippet-only source would produce a dangling marker. All pool entries now emit a citation event.
diff --git a/PLAN.md b/PLAN.md
@@ -162,5 +162,12 @@ Context: BM25 is keyword-only and structurally blind to semantic matches (synony
 
 Candidate follow-ups evaluated but explicitly deferred:
 
-- **Snippet-pool reranking** — rerank `remaining_pool` (snippet-only) entries. Low ROI because the pool is secondary signal already; revisit if users report snippet-pool order complaints.
 - **Round-robin sub-query interleaving** — replicate `mcp-webgate/tools/query.py:110-128` to avoid one sub-query dominating the candidate pool. Requires N separate `process_web_search` calls (one per sub-query) instead of OWUI's batch call — significant refactor for moderate ROI.
+
+---
+
+## Resolved Out-of-Plan Work
+
+Items that were not scheduled milestones but landed during a fix cycle.
+
+- **Snippet-pool BM25 reranking** (shipped v0.4.3) — originally listed as low-ROI backlog, but the v0.4.3 citation-coherence work made it necessary: unfiltered pool entries were leaking off-topic citations into the UI and confusing models into citing slugs (`[REF]…[/REF]`). The pool now passes through the same BM25 + zero-score filter as the fetched sources before injection, with `emit_citation` per surviving entry. Admin valve `inject_snippet_pool` controls whether the pool reaches the LLM context at all.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-## 🌐 EasySearch v0.4.0: High-Performance Web Search Filter
+## 🌐 EasySearch v0.4.3: High-Performance Web Search Filter
 
 An intelligent, context-aware web search filter for Open WebUI. EasySearch bypasses noisy standard web scrapers, utilizing parallel fetching, structural HTML cleaning, and dynamic context-awareness to feed your LLM only the highest quality data.
 
@@ -8,6 +8,40 @@ An intelligent, context-aware web search filter for Open WebUI. EasySearch bypas
 
 ---
 
+### 🆕 What's New in v0.4.3 — Citation Coherence & Pipeline Refinements
+
+The v0.4.x line introduced relevance-based budget allocation; v0.4.3 closes the
+loop on what reaches the model and what shows up in the UI.
+
+- **🎯 Inline `[N]` citations now reliably map to UI sources.** Source blocks
+  switched to `[N] Title` so the label matches the inline marker 1:1, and the
+  full snippet pool is now BM25-filtered and individually citation-registered —
+  no more dangling `[3]` or `[REF]…[/REF]` artefacts from models that improvise
+  their own citation format.
+- **🧹 Off-topic snippets are dropped before they reach the LLM.** Both the
+  fetched sources and the snippet-only oversampling pool are filtered through
+  the same BM25 zero-score gate, so Python `self` tutorials, Arabic Hamza
+  pages, and random topic news no longer leak into a query about autonomous
+  vehicles. Surviving sources inherit the budget the noise would have wasted.
+- **🧠 Prompt restructured around model attention.** Task framing
+  (`INSTRUCTION`, language anchor, snippet-priority hint) sits *before* the
+  search context; output rules (`CITATIONS`, `SECURITY`) sit *after*, where
+  recency bias makes them the last thing the model reads before generating.
+  This produced noticeably more verbose, well-structured answers across
+  Mistral, Gemma3, and Qwen3 thinking.
+- **📊 Reasoning models report honest reply lengths.** Qwen3 thinking,
+  DeepSeek-R1, Phi-reasoning chain-of-thought is now stripped before counting
+  the "reply" stat, so a 2k visible answer no longer reports as 14k.
+- **🎛️ New admin valve `inject_snippet_pool`** (default ON) — flip it OFF to
+  keep the LLM context tight to fully-fetched pages only, useful for short
+  factual lookups where pool clutter outweighs grounding signal.
+- **🔇 Pipeline stats are quiet by default.** The `📊 src · raw → lxml → clean
+  → ctx → reply` line is now appended to the response only when the admin
+  `debug` valve is ON; otherwise it goes only to the EasySearch debug log,
+  removing operator-only telemetry from the end-user view.
+
+---
+
 ### 🆕 What's New in v0.4.0 — Smart Context Allocation
 
 Answers are only as good as the context the model receives. Until now, EasySearch
@@ -41,7 +75,7 @@ The answer you get back is noticeably more focused.
 - **Deep Contextual Awareness:** Automatically analyzes your recent conversation history to infer exactly what you want to search for, allowing for zero-prompt searches using just the `??` trigger.
 - **Multi-Modifier Syntax:** Chain modifiers effortlessly to dictate search behavior. Force specific languages, context depth, and result limits on the fly (e.g., `??:en:10:c3`).
 - **Pure Text Extraction:** Utilizes `lxml` for surgical HTML cleaning. It strips away useless navigation menus, cookie banners, and footers, feeding the LLM only the pure, relevant article text to save tokens and improve accuracy.
-- **Anti-Scraping Stealth & Resilience:** Concurrently fetches pages while rotating through 20 unique browser User-Agents. If a website blocks the request (403 Forbidden), the "Gap-Filler" mechanism automatically fetches backup links in the background.
+- **Anti-Scraping Stealth & Resilience:** Concurrently fetches pages while rotating through 40 unique browser User-Agents. If a website blocks the request (403 Forbidden), the "Gap-Filler" mechanism automatically fetches backup links in the background.
 - **Smart Context Allocation (BM25 + Dynamic Budget):** Automatically ranks fetched pages by relevance and gives the best ones more context budget while shrinking marginal ones. Same total size, far more signal per token. Deterministic and zero-cost.
 - **RAG & Context Lockdown:** Temporarily disables native document retrieval (RAG) and standard searches during its execution round to prevent Open WebUI from polluting the prompt with conflicting background data.
 
@@ -120,6 +154,8 @@ EasySearch is highly customizable. Administrators can set global safety limits,
 | **Oversampling Factor** | `2` | Multiplier for search requests. If set to 2, and the target is 10 pages, EasySearch fetches 20 links from the search engine, deduplicates them, and only downloads the top 10 valid ones. |
 | **Max Results Per Query** | `20` | Hard cap on results requested per query to the search API. Default 20 is safe for all backends; raise up to 100 if your engine (e.g. SerpAPI, Exa, Serper) supports it. |
 | **Enable BM25 Rerank** | `True` | Rerank fetched sources by BM25 keyword relevance before building the LLM context. Deterministic, zero-cost. |
+| **Inject Snippet Pool** | `True` | Inject the snippet-only pool of unread search results into the LLM context, in addition to the fully-fetched and BM25-ranked sources. ON gives the model more grounding material at the cost of more citation pills in the UI; OFF keeps the context tight to fully-fetched pages only. Recommended ON for analytical/exploratory queries, OFF for short factual lookups. |
+| **Debug** | `False` | Enables verbose logging to the backend console *and* surfaces the `📊 src · raw → lxml → clean → ctx → reply` pipeline-stats line at the bottom of each response. Leave OFF for end users; turn ON when tuning or troubleshooting. |
 
 ---
 
@@ -140,7 +176,7 @@ Once the search context is built, EasySearch dynamically forces `body["features"
 Websites increasingly block bots with `403 Forbidden` errors. EasySearch combats this by:
 * Stripping tracking parameters (`utm_source`, `gclid`) from URLs to improve deduplication.
 * Utilizing `httpx` to fetch URLs concurrently, drastically reducing wait times.
-* Rotating through a carefully curated list of 20 unique, modern browser User-Agents (Windows, macOS, Linux, iOS, Android) per request.
+* Rotating through a carefully curated list of 40 unique, modern browser User-Agents (Windows, macOS, Linux, iOS, Android) per request.
 
 #### 4. The Gap-Filler (Auto-Recovery)
 If the user requests 10 pages, but 3 of them result in timeouts or 403s, traditional scrapers return only 7 results. If `Auto Recovery Fetch` is enabled, EasySearch detects the gap and dynamically executes a secondary parallel fetch utilizing the "leftovers" from the Oversampling pool, guaranteeing the requested payload size.
diff --git a/easysearch.py b/easysearch.py