[feature] Index-first Lucene search: ft:query-scope (live nodes) and ft:search-scope (ES-shaped map) by joewiz · Pull Request #6455 · eXist-db/exist

joewiz · 2026-06-08T20:34:06Z

[This PR was co-authored with Claude Code. -Joe]

Names are provisional. ft:query-scope / ft:search-scope are working names — happy to rename before merge.

Summary

Two new Lucene functions for index-first, name-independent full-text search over a collection scope:

ft:query-scope($scope, $query, $options?) → live nodes. The index-first sibling of ft:query: it searches the Lucene index directly over $scope (collection/document URIs, recursive) and returns every matching indexed element — of any element type — as a live node carrying its score and matches. ft:score/ft:field/ft:facets/ft:highlight-field-matches compose on the result as usual.
ft:search-scope($scope, $query, $options?) → one Elasticsearch _search-shaped map(*). The detached, map-returning companion for API builders: { total, max-score, hits[], facets }, each hit { uri, node-id, score, source, highlight }. Plain result data, no node-set to walk and re-serialize.

Motivation

Two recurring needs:

Index-first, name-independent search. To search "all indexed content under a collection regardless of element name" today, you write collection(...)//*[ft:query(., 'field:(…)')]. The descendant-wildcard form loses ft:score for field queries (a known artifact), and it forces callers to enumerate the contributing element names. ft:query-scope queries the index directly over the scope and returns every matching indexed node with correct relevance — no wildcard, no element-name union to maintain.
An Elasticsearch _search-style result for API builders. For something like existdb-openapi's /api/search, what you want is plain result data — total, hits with fields, facets — not a node-set to walk and re-serialize. ft:search-scope returns exactly that, assembled natively.

Why two functions (and the naming)

It helps to place these against the existing functions on two axes — detached vs live result, and XML vs map output:

	XML output	map output
detached (a snapshot, no live nodes)	`ft:search` (legacy report)	`ft:search-scope` (new)
live nodes	`ft:query` (context-scoped), `ft:query-scope` (scope-scoped, new)	(n/a — flattening live nodes to a map defeats their purpose)

The two new functions are not variants of one another and shouldn't be folded into one: they have divergent return contracts (live node-set vs detached map) and sit on different LuceneIndexWorker methods. The query / search pair mirrors eXist's existing split (ft:query returns live nodes; the legacy ft:search returns a detached report).
The -scope suffix signals "scoped by a collection, not by an XPath context node-set." I avoided ft:search-index: from an Elasticsearch mindset "index" is the corpus you search (a collection here), not part of a verb, so it reads ambiguously.

Hit granularity (the one decision worth calling out)

"Document" means three things in this stack, and conflating them produces real count bugs:

an eXist collection — the storage container $scope names;
an eXist document — an XML resource (intro.xml), an XmldbURI;
an indexed element — eXist creates one Lucene document per indexed element occurrence (per <text qname="…">), so one document with two <para> and one <caption> yields three Lucene documents.

So $scope filters at eXist-document granularity, but hits/counts/facets are at indexed-element granularity. Elasticsearch guarantees one Lucene document per ES document (1:1); eXist does not. ft:search-scope therefore defaults to indexed-element granularity (honest to the index, sub-document precision), with a collapse option for the ES-faithful one-hit-per-document view (group by document, best-scoring element, total = distinct documents) — the analog of ES field-collapse / top_hits.

`ft:search-scope` options

$options carries a filter that restricts the query, plus keys that shape the result:

key	type	meaning
`"filter"`	map(*)	facet drill-down `{ dimension: value(s) }` restricting the search (ES post-filter analog; keeps `total`/paging consistent)
`"fields"`	xs:string*	stored fields to include in each hit's `source` (`_source`)
`"highlight"`	xs:string*	fields to highlight; adds a per-hit `highlight` map of `exist:field`/`exist:match` nodes
`"facets"`	xs:string*	facet dimensions to aggregate over the full match set
`"collapse"`	xs:boolean	group hits to one-per-document (best-scoring element; `total` = distinct documents)
`"offset"` / `"limit"`	xs:integer	page the ranked hits

Note: ft:search-scope returns a single envelope map, so read the hit count from ?total (or count(?hits?*)) — count(ft:search-scope(...)) is always 1.

What changed (`extensions/indexes/lucene`)

QueryScope.java (ft:query-scope) and SearchScope.java (ft:search-scope) — the two functions; SearchScope assembles the ES map natively from the same index-first query, reading scores/fields/facets the way ft:score/ft:field/ft:facets do, and reusing the ft:highlight-field-matches engine for highlight.
LuceneScope.java — shared scope-resolution + index-first query execution, used by both functions (no duplication).
LuceneModule.java — registers the four signatures.
Field.java — highlightMatches made package-private static so ft:search-scope can reuse it on the live node it materializes; no behavior change to ft:highlight-field-matches.
Tests: ft-query-scope.xqm, ft-search-scope.xqm, scope-dls.xqm, and the LuceneQueryScopeTests runner.

Companion fix (faceted highlighting)

ft:search-scope's highlight works on its own for unfaceted queries. Combining filter (facet drill-down) with highlight additionally requires a small fix to a pre-existing defect — a facet DrillDownQuery silently disabled ft:highlight-field-matches term extraction — submitted as a separate bugfix PR, #6454 (it affects ft:query faceted highlighting independent of these functions, so it's based on develop, not this branch). This PR doesn't depend on it except for that combined path.

Test plan

42 XQSuite tests across the suite: ft:query-scope (name-independence, nested-element scoring, composition with ft:score/ft:field/ft:facets/ft:highlight-field-matches, live nodes, sortable-by-score); ft:search-scope (envelope shape, element vs collapse granularity, source/facets/highlight/filter/offset/limit).
Document-level security (scope-dls.xqm): a guest never gets nodes or hits from documents it cannot read (verified via system:as-user against mixed read permissions) — the restricted document is absent from both hits and total, so the count does not leak its existence.
Validated against a real corpus (eXist's own function docs) behind an /api/search-style endpoint: identical hit set to the collection(...)//el[ft:query(...)] approach with ft:score preserved; faceted highlighting end-to-end (with [bugfix] Preserve ft:highlight-field-matches under facet drill-down #6454 in place).
Codacy/PMD clean on the new files.

ft:search-index($scope, $query, $options?) queries the Lucene index directly over the documents in $scope and returns ALL matching nodes — of any indexed element type — with their Lucene scores and match highlighting attached, exactly as ft:query results carry them. Unlike ft:query it does not evaluate relative to an XPath context node set, so: - relevance is correct for every hit regardless of how deeply the matched element is nested (it avoids the //* descendant-wildcard ft:score-loss artifact by never using an XPath node set as the query unit), and - it is element-name independent — no need to enumerate or union the contributing element types, so content producers stay decoupled from the search aggregator. The result is an ordinary node set, so ft:score, ft:facets, ft:field and ft:highlight-field-matches compose on it as usual. This is the focused native primitive underpinning the field-first ("eXlasticSearch") search design; the ES _search-style result map (hits/fields/facets/highlights/live-node) is assembled in XQuery on top of this node set. Implementation reuses the existing scored XML-field search path: it builds a DocumentSet from the scope collections and calls LuceneIndexWorker.query(...) with a null contextSet (index-first, no descendant-of constraint) and null qnames (all defined indexes). Tests (ft-search-index.xqm): searchable content in NESTED elements (para/caption) — the case where //* loses ft:score — proving search-index finds them across element types, scores each > 0, is name-independent, composes with ft:facets/ft:field/ft:score/ ft:highlight-field-matches, returns live nodes, sorts by score, and matches all on an empty query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback on the ft:search-index draft: - Add the missing LGPL license header to LuceneSearchIndexTests.java so the org CI RAT/license check passes (the sibling LuceneAnalyzersTests has it). - Cover the 3-argument $options form, which was advertised but untested: facet drill-down (OPTION_FACETS, restricting "content:(array)" hits to the para vs caption facet value) and default-operator (flipping eXist's AND default to OR widens "array map" from 2 hits to 3, proving the options arg passes through). A 2-arg control documents the AND default. - Comment SearchIndex.eval to explain that options is positionally the 3rd argument and parseOptions short-circuits to defaults when argCount < 3, so the 2-arg form never dereferences a missing argument. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… companion Rename the index-first live-node function ft:search-index -> ft:query-scope (class SearchIndex -> QueryScope, with the test module/runner renamed to match). The name places it in the ft:query family it actually belongs to: same LuceneIndexWorker.query() path, live nodes, composes with ft:score/ ft:field/ft:facets/ft:highlight-field-matches. "search-index" misread from an Elasticsearch mindset, where "index" is the corpus, not part of the verb. Add ft-search-scope-map.xqm: the executable spec for an ES _search-shaped, map-returning companion (proposed native ft:search-scope), assembled in XQuery over ft:query-scope. It returns total/max-score/hits[]/facets, where each hit carries uri, node-id, score, a "source" map (requested stored fields), and an optional "highlight" snippet. Hit granularity defaults to the indexed element (honest to the index); a collapse option gives the ES-faithful one-hit-per-document view (group by document URI, best-scoring element), modeling the element-vs-document count discrepancy seen in /api/search. 10 tests pin the shape and both granularities; 23 tests total across the query-scope suite, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…panion Replace the XQuery reference module with a native ft:search-scope function, so the ES _search-shaped, map-returning companion lives in the ft: namespace alongside ft:query-scope. It returns map { total, max-score, hits[], facets }, where each hit carries uri, node-id, score, and a "source" map of requested stored fields. The $options map shapes the result: fields, facets (dimensions to aggregate), collapse, limit. Hit granularity defaults to the indexed element; collapse=true() groups to one-hit-per-document (best-scoring element, total = distinct documents), modeling the element-vs-document count discrepancy. Score is summed from the node's Lucene matches (as ft:score does); fields come from the worker's stored-field lookup; facets from each match's FacetsCollector, merged across queries (as ft:facets does). Highlighting and a stored-fields-only fast path (no node materialization) are noted follow-ups. Factor the shared scope-resolution and index-first query execution out of QueryScope into LuceneScope, used by both functions. 26 XQSuite tests across the suite (13 query-scope + 13 search-scope), all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address the two blockers from the existdb-openapi trial against a real corpus (223 docs / 2637 indexed function elements): - "highlight" option (xs:string*): adds a per-hit "highlight" map whose values are the exist:field/exist:match nodes produced by the existing ft:highlight-field-matches engine. ft:search-scope already materializes the live node internally, so it highlights before detaching to the map. Field.highlightMatches is made package-private static for reuse. - "offset" option (alias "from"): pages the ranked hits as ranked[offset, offset+limit). limit alone capped only the first page; total still reports the full count, so APIs can page past page 1. Naming and element-default granularity were confirmed by the same trial. The stored-fields-only fast path (the map form is currently the slowest of the three options) remains the documented follow-up. 31 XQSuite tests across the suite (13 query-scope + 18 search-scope), all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion Thread a facet drill-down into the search so callers can restrict by a facet value (e.g. a "section" within an app). The "filter" option is a map { dimension: value(s) } that becomes a Lucene DrillDownQuery on the search -- the ES post-filter analog. This must live in the query rather than be applied caller-side: filtering here keeps total/limit/paging consistent, which post-hoc filtering of the hit list cannot. "filter" restricts the query; the other options (fields/highlight/facets/ collapse/offset/limit) still shape the result. Facet aggregation continues to run over the (now filtered) match set. Other Lucene query options (default-operator, ...) are not yet threaded -- a follow-up. Tests: drill-down restricts total and the hits array (kind=para drops the caption hit, 3 -> 2; kind=caption keeps 1). 34 XQSuite tests across the suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…issions Pin the document-level security guarantee that existdb-openapi's field-permission model relies on: neither scope function may return nodes or hits from documents the caller cannot read. They resolve scope through broker.allDocs(...) and materialize hits as persistent nodes through the broker, both of which enforce read permissions -- the same guarantee any collection()//x query honors. scope-dls.xqm stores a public doc (world-readable) and a secret doc (rw-------) both matching a shared term, then queries as guest vs admin via system:as-user: a guest gets only the public hit (count 1, total 1), admin gets both (2); a term indexed only in the secret doc is unreachable to the guest (0) but visible to admin (1); the guest's single search-scope hit is always the public document. Mirrors the visibility checks ft-search-binary.xqm makes for legacy ft:search. 43 XQSuite tests across the suite, all green -- DLS confirmed, not assumed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joewiz and others added 7 commits June 7, 2026 03:31

joewiz requested a review from a team as a code owner June 8, 2026 20:34

joewiz added the enhancement new features, suggestions, etc. label Jun 8, 2026

joewiz mentioned this pull request Jun 9, 2026

[feature] ft:fields: introspect configured Lucene fields/facets in a scope #6459

Open

5 tasks

duncdrum added the Lucene issue is related to Lucene or its integration label Jun 9, 2026

joewiz mentioned this pull request Jun 10, 2026

[feature] Search: discover searchable fields via /api/search/fields (Search-in picker) joewiz/existdb-oxygen-plugin#37

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Index-first Lucene search: ft:query-scope (live nodes) and ft:search-scope (ES-shaped map)#6455

[feature] Index-first Lucene search: ft:query-scope (live nodes) and ft:search-scope (ES-shaped map)#6455
joewiz wants to merge 7 commits into
eXist-db:developfrom
joewiz:feature/lucene-search-index

joewiz commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joewiz commented Jun 8, 2026

Summary

Motivation

Why two functions (and the naming)

Hit granularity (the one decision worth calling out)

ft:search-scope options

What changed (extensions/indexes/lucene)

Companion fix (faceted highlighting)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`ft:search-scope` options

What changed (`extensions/indexes/lucene`)