Skip to content

Fix GROUPBY to implicitly load group key fields without explicit LOAD#997

Open
cnuthalapati wants to merge 2 commits into
valkey-io:mainfrom
cnuthalapati:bugfix/groupby-implicit-load-keys
Open

Fix GROUPBY to implicitly load group key fields without explicit LOAD#997
cnuthalapati wants to merge 2 commits into
valkey-io:mainfrom
cnuthalapati:bugfix/groupby-implicit-load-keys

Conversation

@cnuthalapati

Copy link
Copy Markdown

Summary

  • Fix FT.AGGREGATE GROUPBY to implicitly load group key fields into the output when no explicit LOAD clause covers them
  • Only implicitly loads fields that exist in the index schema, so chained GROUPBYs using derived fields (reducer outputs) don't break
  • Without this fix, GROUPBY results omit the group key fields unless the user redundantly specifies them in LOAD

Test plan

  • Run existing compatibility tests with regenerated answers from the issue's test additions
  • Verify FT.AGGREGATE idx * GROUPBY 1 @field REDUCE COUNT 0 AS count returns both field and count
  • Verify partial LOAD case: FT.AGGREGATE idx * LOAD 1 @price GROUPBY 1 @category REDUCE COUNT 0 AS count returns all three fields
  • Verify chained GROUPBY: FT.AGGREGATE idx * GROUPBY 1 @category REDUCE COUNT 0 AS count GROUPBY 1 @count REDUCE COUNT 0 AS num doesn't error
  • Verify explicit LOAD of GROUPBY key still works (no duplication)

Fixes #919

@allenss-amazon

Copy link
Copy Markdown
Member

I wonder if this is a more generic problem. Will fields used in an expression (like in an APPLY) also require being in the LOAD?

@cnuthalapati

Copy link
Copy Markdown
Author

Yes, APPLY expressions referencing fields not in LOAD will evaluate against empty values. However, the two cases warrant different treatment.

GROUPBY keys are required to interpret the data. They define what each output row represents. Without the grouping key in output, results are uninterpretable: you get aggregates with no labels. Implicitly loading them is the correct behavior because the user cannot make sense of the output otherwise.

APPLY inputs are only computational. The user asked for the computed result (the AS field), not the source fields. Whether source fields also appear in output is an explicit choice via LOAD.

This PR fixes the GROUPBY case. The APPLY behavior (requiring explicit LOAD for expression inputs) is by design. We should not implicitly load computation fields.

When no LOAD clause covers GROUPBY key fields, the serializer skips them
because ManipulateReturnsClause sets no_content=true. This adds GROUPBY
key fields to the load list automatically, matching expected behavior.

Only schema fields are implicitly loaded, so chained GROUPBYs using
derived fields (reducer outputs) are handled safely.

Fixes valkey-io#919

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Chaitanya Nuthalapati <cnu@amazon.com>
@cnuthalapati cnuthalapati force-pushed the bugfix/groupby-implicit-load-keys branch from 3311b48 to 70ba8f4 Compare April 29, 2026 22:22
@greptile-apps

greptile-apps Bot commented May 11, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes a bug where FT.AGGREGATE ... GROUPBY 1 @field would silently drop the group key field from the output unless the user also specified an explicit LOAD @field clause. The fix pre-populates loads_to_process with any GROUPBY group key that (a) exists in the index schema and (b) is not already covered by an explicit LOAD.

  • Iterates all pipeline stages, collects group key names from every GroupBy stage, and appends them to the existing loads_ vector (with deduplication via std::find) before the attribute-loading loop runs.
  • Guards against chained-GROUPBY reducer outputs (e.g., AS count) by checking index_schema->GetIndex(name).ok() — derived fields not in the schema are correctly skipped, preventing parse errors on the second GROUPBY.
  • Explicit LOAD of a GROUPBY key is handled without duplication by the std::find deduplication check.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to ManipulateReturnsClause and correctly handles all described edge cases.

The fix is logically sound: deduplication via std::find prevents double-processing when a field appears in both LOAD and GROUPBY; the GetIndex guard correctly skips reducer outputs in chained GROUPBYs; and AddRecordAttribute's idempotency check ensures fields already registered during parsing are not re-registered with conflicting state. The existing processing loop at line 90 already calls GetIndex unconditionally on every entry in loads_to_process, so the implicit entries added by the new code go through the same validation path as explicit LOAD entries.

No files require special attention.

Important Files Changed

Filename Overview
src/commands/ft_aggregate.cc Adds implicit loading of GROUPBY group key fields in ManipulateReturnsClause; logic is correct, deduplication and schema-guard work as intended.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ManipulateReturnsClause called] --> B{loadall_?}
    B -->|yes| C[Return OK — all fields fetched]
    B -->|no| D[loads_to_process = copy of params.loads_]
    D --> E[Iterate all pipeline stages]
    E --> F{Stage is GroupBy?}
    F -->|no| G[Skip stage]
    G --> E
    F -->|yes| H[Iterate group key attributes]
    H --> I{name == __key or score_as?}
    I -->|yes| J[Skip attribute]
    J --> H
    I -->|no| K{index_schema→GetIndex ok?}
    K -->|no — derived/reducer field| L[Skip attribute]
    L --> H
    K -->|yes — real indexed field| M{Already in loads_to_process?}
    M -->|yes| N[Skip — no duplicate]
    N --> H
    M -->|no| O[Append name to loads_to_process]
    O --> H
    H -->|done| E
    E -->|done| P[Process loads_to_process loop]
    P --> Q[For each load: GetIndex + build return_attributes]
    Q --> R[Set params.no_content accordingly]
    R --> S[Return OK]
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into bugfix/groupby-..." | Re-trigger Greptile

Comment on lines +70 to +74
if (!params.index_schema->GetIndex(name).ok()) continue;
if (std::find(loads_to_process.begin(), loads_to_process.end(), name) ==
loads_to_process.end()) {
loads_to_process.push_back(name);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant GetIndex lookup per implicitly-added field

GetIndex(name) is called here to guard against non-schema fields, but it is called a second time unconditionally at line 90 (VMSDK_ASSIGN_OR_RETURN(auto indexer, params.index_schema->GetIndex(load))) for the same field. Because every name added to loads_to_process in this block has already passed the ok() check, the second lookup is guaranteed to succeed and is wasted work. Consider caching the result from the first call or restructuring so the lookup is performed only once.

Fix in Claude Code

@KarthikSubbarao

Copy link
Copy Markdown
Member

Can we add a test in test_non_vector.py with the commands we want to run here? @cnuthalapati

@coderabbitai

coderabbitai Bot commented May 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e6c5c55e-bc2e-4955-a5dd-a2e8bfe2ee9b

📥 Commits

Reviewing files that changed from the base of the PR and between da3872f and 5ddb98c.

📒 Files selected for processing (1)
  • src/commands/ft_aggregate.cc

📝 Walkthrough

Walkthrough

This PR fixes a bug where FT.AGGREGATE ... GROUPBY without an explicit LOAD clause omits group key fields from output. The fix augments ManipulateReturnsClause to implicitly load GROUPBY key attributes by scanning aggregation stages and adding group field names to the return set, matching Redis Stack behavior.

Changes

GROUPBY without LOAD returns

Layer / File(s) Summary
Implicit GROUPBY key field loading
src/commands/ft_aggregate.cc
Added <algorithm> and <vector> headers. Enhanced ManipulateReturnsClause to construct a loads_to_process list from params.loads_ and augment it with group attributes from GroupBy stages (excluding __key, score alias, and attributes without available indices), then deduplicate and use it for return attribute population.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing GROUPBY to implicitly load group key fields without explicit LOAD, which directly matches the code modification in the raw summary.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about what was fixed, test scenarios, and the referenced issue #919.
Linked Issues check ✅ Passed The code changes fully implement the proposed fix from issue #919: detecting GROUPBY stages, extracting group key field names, and adding them to loads_to_process when no LOAD clause exists, with proper exclusions for __key and score fields.
Out of Scope Changes check ✅ Passed All changes are within scope of issue #919: adding headers for container operations, extracting group keys from GROUPBY stages, and deduplicating the loads list—no unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog
Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] FT.AGGREGATE GROUPBY Without LOAD Produces Missing Group Key Fields

3 participants