Skip to content

[zephyr] Honor num_output_shards when input has more shards#5166

Merged
ravwojdyla merged 2 commits intomainfrom
agent/20260424-fix-5162
Apr 25, 2026
Merged

[zephyr] Honor num_output_shards when input has more shards#5166
ravwojdyla merged 2 commits intomainfrom
agent/20260424-fix-5162

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

Scatter routes records into exactly num_output_shards buckets via hash(key) % num_output_shards. The reduce stage previously spawned max(input_shards, num_output_shards) tasks; with M input shards and N output shards where M > N, the extra reduce tasks ran on a shard_idx no record hashes to and emitted empty output files. For coderforge normalize this produced 643 empty parquet shards alongside 29 real ones. Adds regression test.

Fixes #5162

Scatter routes records into exactly num_output_shards buckets via
hash(key) % num_output_shards. The reduce stage previously spawned
max(input_shards, num_output_shards) tasks; with M input shards and
N output shards where M > N, the extra M-N reduce tasks ran on a
shard_idx no record hashes to and emitted empty output files.

For marin#5162: coderforge normalize had 672 input files and a
configured 29 output shards, producing 643 empty parquet shards
alongside 29 real ones.
Comment thread lib/zephyr/src/zephyr/execution.py Outdated
else:
num_output = max(max(result_refs.keys(), default=0) + 1, input_shard_count)
if output_shard_count is not None:
num_output = max(num_output, output_shard_count)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior here is a little weird for non-scatter cases. It seems like for non-scatter we should just ignore output shards? If a user wants a different shard count we should be going through a reshard?

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - that's cleaner and more direct - will update

stage.output_shards is only ever set on Scatter (via group_by) and
Reshard stages, and Reshard short-circuits before reaching
_regroup_result_refs. The non-scatter branch's max-with-output_shard_count
was dead code suggesting semantics that don't exist; resharding belongs
to ReshardOp.
@ravwojdyla ravwojdyla enabled auto-merge (squash) April 25, 2026 00:25
@ravwojdyla ravwojdyla merged commit ad5927e into main Apr 25, 2026
36 checks passed
@ravwojdyla ravwojdyla deleted the agent/20260424-fix-5162 branch April 25, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[datakit] Normalize produces empty parquet shards for 5 sources

3 participants