[zephyr/iris] Fix ConfigMap size limit for large pipelines#3910
Merged
[zephyr/iris] Fix ConfigMap size limit for large pipelines#3910
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
4c0ab0f to
e7fd726
Compare
9536e33 to
1af1f9e
Compare
e7fd726 to
4ca8892
Compare
6cf2cd8 to
dfdc21f
Compare
yonromai
commented
Mar 21, 2026
| # state (e.g. _load_fuzzy_dupe_map_shard). | ||
| from fray.v2.local_backend import LocalClient | ||
|
|
||
| if isinstance(self.client, LocalClient): |
Contributor
Author
There was a problem hiding this comment.
@rjpower this part sucks, didn't see it coming :/
Collaborator
There was a problem hiding this comment.
oh man, let's file an issue to fix load_fuzzy_dupe_map_shard.
we shouldn't support things that break the zephyr abstraction.
rjpower
approved these changes
Mar 21, 2026
Collaborator
rjpower
left a comment
There was a problem hiding this comment.
lgtm, let's fix the fuzzy dedup that seems not good to support in general
Collaborator
|
🤖 Opened #3938 to fix the local-state mutation in _load_fuzzy_dupe_map_shard — replaces the side-effect .map() with .reduce() so the dict flows through Zephyr properly. |
dfdc21f to
c28ae61
Compare
1 task
c28ae61 to
3268aad
Compare
yonromai
added a commit
that referenced
this pull request
Mar 23, 2026
…4076) ## Summary Four zephyr tests relied on closure mutation (CallCounter, nonlocal counters) to verify execution via side effects. This pattern only works when the pipeline runs in-process with no serialization boundary — it breaks under any cloudpickle round-trip (distributed backends, or config-to-disk as in #3910). Replace with assertions on output file contents and modification times. Companion to #3938 which fixed the same pattern in production code (`_load_fuzzy_dupe_map_shard`). ## Test plan - [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py -k "test_lazy_evaluation or test_skip_existing"` — 4 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… pickle The entire _CoordinatorJobConfig (including the full PhysicalPlan) is uploaded to object storage as job-config.pkl. The Entrypoint pickle contains only two string URLs (config_path, result_path), keeping the K8s ConfigMap payload trivially small regardless of dataset size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When kubectl apply fails (e.g. ConfigMap exceeds 3 MiB), catch the KubectlError and return it as a TASK_STATE_FAILED update with the real error message. Previously the error propagated out of sync(), crashed the cycle, and the task later surfaced as misleading "Pod not found". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3268aad to
21f34e1
Compare
Helw150
pushed a commit
that referenced
this pull request
Apr 8, 2026
…4076) ## Summary Four zephyr tests relied on closure mutation (CallCounter, nonlocal counters) to verify execution via side effects. This pattern only works when the pipeline runs in-process with no serialization boundary — it breaks under any cloudpickle round-trip (distributed backends, or config-to-disk as in #3910). Replace with assertions on output file contents and modification times. Companion to #3938 which fixed the same pattern in production code (`_load_fuzzy_dupe_map_shard`). ## Test plan - [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py -k "test_lazy_evaluation or test_skip_existing"` — 4 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150
pushed a commit
that referenced
this pull request
Apr 8, 2026
## Summary Fixes #3908 — coordinator-as-job fails with misleading "Pod not found" when pipelines have >~20K source items, because the serialized `PhysicalPlan` exceeds the K8s ConfigMap 3 MiB limit. Two commits: 1. **[zephyr] Upload coordinator config to object storage.** The full `_CoordinatorJobConfig` is written to `{chunk_storage_prefix}/{execution_id}/job-config.pkl`. The `Entrypoint` pickle contains only two string URLs (`config_path`, `result_path`), keeping the ConfigMap payload trivially small regardless of dataset size. 2. **[iris] Propagate `KubectlError` from pod apply as a `TaskUpdate`** instead of crashing the sync cycle and later surfacing as "Pod not found". Includes a transition test verifying the ASSIGNED → FAILED path through the real `TransitionManager`. ## Test plan - [ ] `uv run --package iris pytest lib/iris/tests/cluster/providers/k8s/test_provider.py` — new test verifies `KubectlError` produces a `TASK_STATE_FAILED` update with the real error string - [ ] `uv run --package iris pytest lib/iris/tests/cluster/controller/test_direct_controller.py::test_apply_failed_directly_from_assigned` — verifies ASSIGNED → FAILED transition - [ ] Full zephyr test suite 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3908 — coordinator-as-job fails with misleading "Pod not found" when pipelines have >~20K source items, because the serialized
PhysicalPlanexceeds the K8s ConfigMap 3 MiB limit.Two commits:
_CoordinatorJobConfigis written to{chunk_storage_prefix}/{execution_id}/job-config.pkl. TheEntrypointpickle contains only two string URLs (config_path,result_path), keeping the ConfigMap payload trivially small regardless of dataset size.KubectlErrorfrom pod apply as aTaskUpdateinstead of crashing the sync cycle and later surfacing as "Pod not found". Includes a transition test verifying the ASSIGNED → FAILED path through the realTransitionManager.Test plan
uv run --package iris pytest lib/iris/tests/cluster/providers/k8s/test_provider.py— new test verifiesKubectlErrorproduces aTASK_STATE_FAILEDupdate with the real error stringuv run --package iris pytest lib/iris/tests/cluster/controller/test_direct_controller.py::test_apply_failed_directly_from_assigned— verifies ASSIGNED → FAILED transition🤖 Generated with Claude Code