Skip to content

[dedup] Replace side-effect mutation with reduce in fuzzy dedup map#3938

Merged
rjpower merged 4 commits intomainfrom
fix/fuzzy-dedup-local-state
Mar 23, 2026
Merged

[dedup] Replace side-effect mutation with reduce in fuzzy dedup map#3938
rjpower merged 4 commits intomainfrom
fix/fuzzy-dedup-local-state

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 21, 2026

_load_fuzzy_dupe_map_shard built a dict by mutating a closure variable
via .map() side effect, which only works with LocalClient (shared memory).
Replace with .reduce() so the dict flows through the Zephyr pipeline as a
proper return value. Also removes a stale comment from _compute_fuzzy_dedup_stats.

…th reduce

Use Zephyr's reduce() to build the duplicate map dict as a pipeline
return value instead of mutating a closure dict via map() side effect.
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 21, 2026
@rjpower rjpower requested a review from yonromai March 21, 2026 01:31
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 21, 2026

Claude finished @rjpower's task in 55s —— View job


PR Review: [dedup] Replace side-effect mutation with reduce in fuzzy dedup map

  • Read the changed file and diff
  • Analyze correctness
  • Post review

Replaces a .map() side-effect pattern (mutating a closure dict) in _load_fuzzy_dupe_map_shard with .reduce(local_reducer, global_reducer), making the data flow explicit. Also removes a stale comment from _compute_fuzzy_dedup_stats. The change fulfills its stated objective.

Correctness: The reduce API (dataset.py:857) takes local_reducer: Callable[[Iterator[T]], R] and global_reducer: Callable[[Iterator[R]], R]. The PR's lambdas match these signatures — local_reducer builds a dict from an iterator of records, global_reducer merges an iterator of dicts. The [0] indexing to unwrap the single-element result list matches the existing pattern in _compute_fuzzy_dedup_stats (line 49).

No issues found.

Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice, I'll rebase the other PR.

@rjpower rjpower merged commit 18f02a4 into main Mar 23, 2026
38 checks passed
@rjpower rjpower deleted the fix/fuzzy-dedup-local-state branch March 23, 2026 22:28
yonromai added a commit that referenced this pull request Mar 23, 2026
…4076)

## Summary

Four zephyr tests relied on closure mutation (CallCounter, nonlocal
counters) to verify execution via side effects. This pattern only works
when the pipeline runs in-process with no serialization boundary — it
breaks under any cloudpickle round-trip (distributed backends, or
config-to-disk as in #3910).

Replace with assertions on output file contents and modification times.

Companion to #3938 which fixed the same pattern in production code
(`_load_fuzzy_dupe_map_shard`).

## Test plan

- [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py
-k "test_lazy_evaluation or test_skip_existing"` — 4 passed


🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…3938)

_load_fuzzy_dupe_map_shard built a dict by mutating a closure variable
via .map() side effect, which only works with LocalClient (shared
memory).
Replace with .reduce() so the dict flows through the Zephyr pipeline as
a
proper return value. Also removes a stale comment from
_compute_fuzzy_dedup_stats.

Co-authored-by: Romain Yon <yonromai@users.noreply.github.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…4076)

## Summary

Four zephyr tests relied on closure mutation (CallCounter, nonlocal
counters) to verify execution via side effects. This pattern only works
when the pipeline runs in-process with no serialization boundary — it
breaks under any cloudpickle round-trip (distributed backends, or
config-to-disk as in #3910).

Replace with assertions on output file contents and modification times.

Companion to #3938 which fixed the same pattern in production code
(`_load_fuzzy_dupe_map_shard`).

## Test plan

- [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py
-k "test_lazy_evaluation or test_skip_existing"` — 4 passed


🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants