[zephyr] Honor num_output_shards when input has more shards by ravwojdyla-agent · Pull Request #5166 · marin-community/marin

ravwojdyla-agent · 2026-04-24T23:56:01Z

Scatter routes records into exactly num_output_shards buckets via hash(key) % num_output_shards. The reduce stage previously spawned max(input_shards, num_output_shards) tasks; with M input shards and N output shards where M > N, the extra reduce tasks ran on a shard_idx no record hashes to and emitted empty output files. For coderforge normalize this produced 643 empty parquet shards alongside 29 real ones. Adds regression test.

Fixes #5162

Scatter routes records into exactly num_output_shards buckets via hash(key) % num_output_shards. The reduce stage previously spawned max(input_shards, num_output_shards) tasks; with M input shards and N output shards where M > N, the extra M-N reduce tasks ran on a shard_idx no record hashes to and emitted empty output files. For marin#5162: coderforge normalize had 672 input files and a configured 29 output shards, producing 643 empty parquet shards alongside 29 real ones.

rjpower · 2026-04-25T00:11:19Z

+    else:
+        num_output = max(max(result_refs.keys(), default=0) + 1, input_shard_count)
+        if output_shard_count is not None:
+            num_output = max(num_output, output_shard_count)


The behavior here is a little weird for non-scatter cases. It seems like for non-scatter we should just ignore output shards? If a user wants a different shard count we should be going through a reshard?

good point - that's cleaner and more direct - will update

stage.output_shards is only ever set on Scatter (via group_by) and Reshard stages, and Reshard short-circuits before reaching _regroup_result_refs. The non-scatter branch's max-with-output_shard_count was dead code suggesting semantics that don't exist; resharding belongs to ReshardOp.

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 24, 2026

ravwojdyla requested a review from rjpower April 24, 2026 23:59

ravwojdyla-agent mentioned this pull request Apr 25, 2026

[datakit] Normalize produces empty parquet shards for 5 sources #5162

Closed

rjpower approved these changes Apr 25, 2026

View reviewed changes

ravwojdyla enabled auto-merge (squash) April 25, 2026 00:25

ravwojdyla merged commit ad5927e into main Apr 25, 2026
36 checks passed

ravwojdyla deleted the agent/20260424-fix-5162 branch April 25, 2026 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zephyr] Honor num_output_shards when input has more shards#5166

[zephyr] Honor num_output_shards when input has more shards#5166
ravwojdyla merged 2 commits intomainfrom
agent/20260424-fix-5162

ravwojdyla-agent commented Apr 24, 2026

Uh oh!

rjpower Apr 25, 2026

Uh oh!

ravwojdyla Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 24, 2026

Uh oh!

rjpower Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravwojdyla Apr 25, 2026 •

edited

Loading