[dedup] Replace side-effect mutation with reduce in fuzzy dedup map#3938
[dedup] Replace side-effect mutation with reduce in fuzzy dedup map#3938
Conversation
…th reduce Use Zephyr's reduce() to build the duplicate map dict as a pipeline return value instead of mutating a closure dict via map() side effect.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Claude finished @rjpower's task in 55s —— View job PR Review: [dedup] Replace side-effect mutation with reduce in fuzzy dedup map
Replaces a Correctness: The No issues found. |
yonromai
left a comment
There was a problem hiding this comment.
Noice, I'll rebase the other PR.
…4076) ## Summary Four zephyr tests relied on closure mutation (CallCounter, nonlocal counters) to verify execution via side effects. This pattern only works when the pipeline runs in-process with no serialization boundary — it breaks under any cloudpickle round-trip (distributed backends, or config-to-disk as in #3910). Replace with assertions on output file contents and modification times. Companion to #3938 which fixed the same pattern in production code (`_load_fuzzy_dupe_map_shard`). ## Test plan - [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py -k "test_lazy_evaluation or test_skip_existing"` — 4 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…3938) _load_fuzzy_dupe_map_shard built a dict by mutating a closure variable via .map() side effect, which only works with LocalClient (shared memory). Replace with .reduce() so the dict flows through the Zephyr pipeline as a proper return value. Also removes a stale comment from _compute_fuzzy_dedup_stats. Co-authored-by: Romain Yon <yonromai@users.noreply.github.com>
…4076) ## Summary Four zephyr tests relied on closure mutation (CallCounter, nonlocal counters) to verify execution via side effects. This pattern only works when the pipeline runs in-process with no serialization boundary — it breaks under any cloudpickle round-trip (distributed backends, or config-to-disk as in #3910). Replace with assertions on output file contents and modification times. Companion to #3938 which fixed the same pattern in production code (`_load_fuzzy_dupe_map_shard`). ## Test plan - [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py -k "test_lazy_evaluation or test_skip_existing"` — 4 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_load_fuzzy_dupe_map_shard built a dict by mutating a closure variable
via .map() side effect, which only works with LocalClient (shared memory).
Replace with .reduce() so the dict flows through the Zephyr pipeline as a
proper return value. Also removes a stale comment from _compute_fuzzy_dedup_stats.