Arnej/add field splitter 10#36143
Conversation
… instances After view resolution, query terms may reference multiple index fields. Downstream blueprint building prefers single-field terms. FieldSplitter bridges this gap by splitting multi-field terms into OR of per-field copies early in the query processing pipeline. Handles all query node types: simple terms, Phrase, Equiv, SameElement, NEAR/ONEAR, and multi-term nodes (WeightedSet, DotProduct, WandTerm, InTerm, WordAlternatives). Falls back to the original tree on errors. Add debug logging in BlueprintBuilder for empty blueprint cases. Includes comprehensive unit tests for all node types and edge cases.
Phrase children (individual words) don't carry field metadata - they inherit the field from their parent Phrase node. When a Phrase is nested inside an Equiv, Near, ONear, or SameElement, _force_field_id is set, and visiting phrase children with it set would cause field lookup to fail. Fix by saving/clearing/restoring _force_field_id around the child visit, consistent with the same pattern used elsewhere in the visitor.
When a SameElement has multiple fields but children have different field sets, split the SameElement into per-field copies without forcing the children, letting each child handle its own field splitting independently.
Wire up FieldSplitter::split_terms in the query build step, placed after field resolution (ResolveViewVisitor) and before ranking term extraction (NeedsRankingVisitor). This ensures multi-field query terms are normalized into per-field instances before the rest of the matching pipeline sees them, enabling field-specific optimizations.
There was a problem hiding this comment.
Pull request overview
This PR adds a FieldSplitter utility to the Vespa search query processing pipeline. It transforms query trees so that multi-field terms (terms that can match multiple fields, e.g., via view resolution) are split into separate per-field term instances connected with OR nodes. The splitter runs after ResolveViewVisitor resolves views to fields and before match data allocation, ensuring each term operates on exactly one field for simplified query execution.
Changes:
- Introduces
FieldSplitterwith a visitor-based tree transformation that splits multi-field query terms into per-field instances connected by OR nodes, handling various node types including terms, phrases, NEAR/ONEAR, Equiv, SameElement, and multi-term nodes. - Integrates the field splitter into the query processing pipeline in
Query::buildTree()after view resolution. - Adds debug logging to the blueprint builder for
EmptyBlueprintcreation cases.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
searchcore/src/vespa/searchcore/proton/matching/field_splitter.h |
New header declaring the FieldSplitter struct with a split_terms static method |
searchcore/src/vespa/searchcore/proton/matching/field_splitter.cpp |
New 824-line implementation with FieldSplitterVisitor handling all query node types |
searchcore/src/vespa/searchcore/proton/matching/query.cpp |
Integrates FieldSplitter::split_terms into Query::buildTree after view resolution |
searchcore/src/vespa/searchcore/proton/matching/blueprintbuilder.cpp |
Adds debug log statements for EmptyBlueprint creation cases |
searchcore/src/vespa/searchcore/proton/matching/CMakeLists.txt |
Adds field_splitter.cpp to the build |
searchcore/src/tests/proton/matching/field_splitter_test.cpp |
New comprehensive test file with 20+ test cases |
searchcore/src/tests/proton/matching/CMakeLists.txt |
Adds new test executable to the build |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| void splitSameElementByFieldsNoForce(ProtonSameElement &node, const std::set<uint32_t> &fields) { | ||
| _builder.addOr(fields.size()); | ||
| for ([[maybe_unused]] uint32_t field_id : fields) { | ||
| createSameElementReplica(node); | ||
| visitNodes(node.getChildren()); | ||
| } | ||
| } |
There was a problem hiding this comment.
splitSameElementByFieldsNoForce creates fields.size() identical SameElement copies under an OR, each visiting the same children without any field filtering or forcing. Since field_id is [[maybe_unused]], each iteration produces an identical subtree. This is semantically incorrect for variant (c) — the intent is presumably to create per-field SameElements, but the children aren't constrained to the corresponding field. If the intent is indeed to produce identical copies, this is redundant and wasteful. If the intent is per-field splitting, the field_id should be used (e.g., passed to splitAndVisitChildrenForField). Consider either: (1) forcing children like variant (b), or (2) collapsing this to a single SameElement without OR if truly no per-field distinction is needed.
| ProtonSameElement& createSameElementReplica(ProtonSameElement &node) { | ||
| auto &replica = _builder.addSameElement(node.getChildren().size(), node.getView(), | ||
| node.getId(), node.getWeight()); | ||
| replica.set_expensive(node.is_expensive()); | ||
| return replica; | ||
| } |
There was a problem hiding this comment.
createSameElementReplica is missing two properties from the original SameElement node:
-
element_filter: TheaddSameElementbuilder method accepts an optionalelement_filterparameter (seequerybuilder.h:305), but the original node'sget_element_filter()is not passed. If the original SameElement has a non-empty element_filter, the replicas will lose this property. -
expose_match_data_for_same_element: TheSameElementModifierruns before the field splitter (seequery.cpp:196-202) and may set this flag tofalseon the original node. The replica always gets the default valuetrue, which could cause incorrect match data exposure.
Both should be copied from the original node to the replica.
| void visit(ProtonAndNot &node) override { | ||
| _builder.addAndNot(node.getChildren().size()); | ||
| visitNodes(node.getChildren()); | ||
| } |
There was a problem hiding this comment.
visit(ProtonAndNot) does not copy the elementwise flag from the original node to the replica. The SameElementModifier (which runs before the field splitter, see query.cpp:196-197) sets elementwise = true on ProtonAndNot nodes inside SameElement (see sameelementmodifier.cpp:76). The replica ProtonAndNot will always have elementwise = false (the default), leading to incorrect behavior for AndNot nodes inside SameElement after splitting.
| } | ||
| } | ||
|
|
||
| // Helper to build field-to-children mapping for Equiv nodes |
There was a problem hiding this comment.
The comment says "Helper to build field-to-children mapping for Equiv nodes" but buildFieldToChildrenMap is also used for Near and ONear nodes. The comment should be more general, e.g., "Helper to build field-to-children mapping for nodes that group children by field".
- Copy element_filter and expose_match_data_for_same_element in createSameElementReplica - Copy elementwise flag in visit(ProtonAndNot) - Generalize comment on buildFieldToChildrenMap
…ring For SameElement variant (c) - multiple fields with mixed-field children - create one SameElement replica per parent field and filter each child's fields to only those whose names match the parent field name (exact or dot-separated sub-field prefix). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ld filtering" This reverts commit 0b69f90.
Two bugs fixed in FieldSplitter's SameElement handling: 1. splitSameElementByFields was creating per-field SameElement replicas without copying the field entry (ProtonTermData) from the original node. Added the missing copyProtonTermDataForField call so each replica gets its correct field. The method now derives the field list from node.numFields() directly instead of requiring a separate fields set parameter. 2. The variant-(c) path (splitSameElementByFieldsNoForce) was structurally wrong: it split the SameElement into per-field replicas but then visited children without forcing a field, so children would be split across all their own fields independently -- producing query trees where the per-field SameElements contained children bound to the wrong fields. Since this case (mixed child field sets inside a multi-field SameElement) should not arise in valid queries, it is now treated as an error with a safe pass-through fallback.
Note that I chose different semantics for NEAR with asymmetric children, but otherwise I don’t know any reason why we can’t use this.