Skip to content

Arnej/add field splitter 10#36143

Open
arnej27959 wants to merge 9 commits into
masterfrom
arnej/add-field-splitter-10
Open

Arnej/add field splitter 10#36143
arnej27959 wants to merge 9 commits into
masterfrom
arnej/add-field-splitter-10

Conversation

@arnej27959

Copy link
Copy Markdown
Member

Note that I chose different semantics for NEAR with asymmetric children, but otherwise I don’t know any reason why we can’t use this.

… instances

After view resolution, query terms may reference multiple index fields.
Downstream blueprint building prefers single-field terms. FieldSplitter
bridges this gap by splitting multi-field terms into OR of per-field
copies early in the query processing pipeline.

Handles all query node types: simple terms, Phrase, Equiv, SameElement,
NEAR/ONEAR, and multi-term nodes (WeightedSet, DotProduct, WandTerm,
InTerm, WordAlternatives). Falls back to the original tree on errors.

Add debug logging in BlueprintBuilder for empty blueprint cases.
Includes comprehensive unit tests for all node types and edge cases.
Phrase children (individual words) don't carry field metadata - they
inherit the field from their parent Phrase node. When a Phrase is nested
inside an Equiv, Near, ONear, or SameElement, _force_field_id is set,
and visiting phrase children with it set would cause field lookup to fail.
Fix by saving/clearing/restoring _force_field_id around the child visit,
consistent with the same pattern used elsewhere in the visitor.
When a SameElement has multiple fields but children have different
field sets, split the SameElement into per-field copies without
forcing the children, letting each child handle its own field
splitting independently.
Wire up FieldSplitter::split_terms in the query build step, placed
after field resolution (ResolveViewVisitor) and before ranking term
extraction (NeedsRankingVisitor). This ensures multi-field query
terms are normalized into per-field instances before the rest of
the matching pipeline sees them, enabling field-specific optimizations.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a FieldSplitter utility to the Vespa search query processing pipeline. It transforms query trees so that multi-field terms (terms that can match multiple fields, e.g., via view resolution) are split into separate per-field term instances connected with OR nodes. The splitter runs after ResolveViewVisitor resolves views to fields and before match data allocation, ensuring each term operates on exactly one field for simplified query execution.

Changes:

  • Introduces FieldSplitter with a visitor-based tree transformation that splits multi-field query terms into per-field instances connected by OR nodes, handling various node types including terms, phrases, NEAR/ONEAR, Equiv, SameElement, and multi-term nodes.
  • Integrates the field splitter into the query processing pipeline in Query::buildTree() after view resolution.
  • Adds debug logging to the blueprint builder for EmptyBlueprint creation cases.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
searchcore/src/vespa/searchcore/proton/matching/field_splitter.h New header declaring the FieldSplitter struct with a split_terms static method
searchcore/src/vespa/searchcore/proton/matching/field_splitter.cpp New 824-line implementation with FieldSplitterVisitor handling all query node types
searchcore/src/vespa/searchcore/proton/matching/query.cpp Integrates FieldSplitter::split_terms into Query::buildTree after view resolution
searchcore/src/vespa/searchcore/proton/matching/blueprintbuilder.cpp Adds debug log statements for EmptyBlueprint creation cases
searchcore/src/vespa/searchcore/proton/matching/CMakeLists.txt Adds field_splitter.cpp to the build
searchcore/src/tests/proton/matching/field_splitter_test.cpp New comprehensive test file with 20+ test cases
searchcore/src/tests/proton/matching/CMakeLists.txt Adds new test executable to the build

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +251
void splitSameElementByFieldsNoForce(ProtonSameElement &node, const std::set<uint32_t> &fields) {
_builder.addOr(fields.size());
for ([[maybe_unused]] uint32_t field_id : fields) {
createSameElementReplica(node);
visitNodes(node.getChildren());
}
}

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitSameElementByFieldsNoForce creates fields.size() identical SameElement copies under an OR, each visiting the same children without any field filtering or forcing. Since field_id is [[maybe_unused]], each iteration produces an identical subtree. This is semantically incorrect for variant (c) — the intent is presumably to create per-field SameElements, but the children aren't constrained to the corresponding field. If the intent is indeed to produce identical copies, this is redundant and wasteful. If the intent is per-field splitting, the field_id should be used (e.g., passed to splitAndVisitChildrenForField). Consider either: (1) forcing children like variant (b), or (2) collapsing this to a single SameElement without OR if truly no per-field distinction is needed.

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +220
ProtonSameElement& createSameElementReplica(ProtonSameElement &node) {
auto &replica = _builder.addSameElement(node.getChildren().size(), node.getView(),
node.getId(), node.getWeight());
replica.set_expensive(node.is_expensive());
return replica;
}

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createSameElementReplica is missing two properties from the original SameElement node:

  1. element_filter: The addSameElement builder method accepts an optional element_filter parameter (see querybuilder.h:305), but the original node's get_element_filter() is not passed. If the original SameElement has a non-empty element_filter, the replicas will lose this property.

  2. expose_match_data_for_same_element: The SameElementModifier runs before the field splitter (see query.cpp:196-202) and may set this flag to false on the original node. The replica always gets the default value true, which could cause incorrect match data exposure.

Both should be copied from the original node to the replica.

Copilot uses AI. Check for mistakes.
Comment on lines +405 to +408
void visit(ProtonAndNot &node) override {
_builder.addAndNot(node.getChildren().size());
visitNodes(node.getChildren());
}

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visit(ProtonAndNot) does not copy the elementwise flag from the original node to the replica. The SameElementModifier (which runs before the field splitter, see query.cpp:196-197) sets elementwise = true on ProtonAndNot nodes inside SameElement (see sameelementmodifier.cpp:76). The replica ProtonAndNot will always have elementwise = false (the default), leading to incorrect behavior for AndNot nodes inside SameElement after splitting.

Copilot uses AI. Check for mistakes.
}
}

// Helper to build field-to-children mapping for Equiv nodes

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Helper to build field-to-children mapping for Equiv nodes" but buildFieldToChildrenMap is also used for Near and ONear nodes. The comment should be more general, e.g., "Helper to build field-to-children mapping for nodes that group children by field".

Copilot uses AI. Check for mistakes.
arnej27959 and others added 5 commits March 11, 2026 10:55
- Copy element_filter and expose_match_data_for_same_element in createSameElementReplica
- Copy elementwise flag in visit(ProtonAndNot)
- Generalize comment on buildFieldToChildrenMap
…ring

For SameElement variant (c) - multiple fields with mixed-field children -
create one SameElement replica per parent field and filter each child's
fields to only those whose names match the parent field name (exact or
dot-separated sub-field prefix).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed in FieldSplitter's SameElement handling:

1. splitSameElementByFields was creating per-field SameElement replicas without
   copying the field entry (ProtonTermData) from the original node. Added the
   missing copyProtonTermDataForField call so each replica gets its correct field.
   The method now derives the field list from node.numFields() directly instead of
   requiring a separate fields set parameter.

2. The variant-(c) path (splitSameElementByFieldsNoForce) was structurally wrong:
   it split the SameElement into per-field replicas but then visited children
   without forcing a field, so children would be split across all their own fields
   independently -- producing query trees where the per-field SameElements
   contained children bound to the wrong fields. Since this case (mixed child
   field sets inside a multi-field SameElement) should not arise in valid queries,
   it is now treated as an error with a safe pass-through fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants