Skip to content

[zephyr/iris] Fix ConfigMap size limit for large pipelines#3910

Merged
yonromai merged 2 commits intomainfrom
configmap-limit
Mar 23, 2026
Merged

[zephyr/iris] Fix ConfigMap size limit for large pipelines#3910
yonromai merged 2 commits intomainfrom
configmap-limit

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

@yonromai yonromai commented Mar 20, 2026

Summary

Fixes #3908 — coordinator-as-job fails with misleading "Pod not found" when pipelines have >~20K source items, because the serialized PhysicalPlan exceeds the K8s ConfigMap 3 MiB limit.

Two commits:

  1. [zephyr] Upload coordinator config to object storage. The full _CoordinatorJobConfig is written to {chunk_storage_prefix}/{execution_id}/job-config.pkl. The Entrypoint pickle contains only two string URLs (config_path, result_path), keeping the ConfigMap payload trivially small regardless of dataset size.
  2. [iris] Propagate KubectlError from pod apply as a TaskUpdate instead of crashing the sync cycle and later surfacing as "Pod not found". Includes a transition test verifying the ASSIGNED → FAILED path through the real TransitionManager.

Test plan

  • uv run --package iris pytest lib/iris/tests/cluster/providers/k8s/test_provider.py — new test verifies KubectlError produces a TASK_STATE_FAILED update with the real error string
  • uv run --package iris pytest lib/iris/tests/cluster/controller/test_direct_controller.py::test_apply_failed_directly_from_assigned — verifies ASSIGNED → FAILED transition
  • Full zephyr test suite

🤖 Generated with Claude Code

@yonromai yonromai added bug Something isn't working agent-generated Created by automation/agent labels Mar 20, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@yonromai yonromai requested review from rjpower and removed request for rjpower March 20, 2026 18:00
Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yonromai yonromai changed the base branch from main to rav/fix-fuzzy-dedup-ray March 20, 2026 19:24
@ravwojdyla ravwojdyla force-pushed the rav/fix-fuzzy-dedup-ray branch 2 times, most recently from 9536e33 to 1af1f9e Compare March 20, 2026 20:06
Base automatically changed from rav/fix-fuzzy-dedup-ray to main March 20, 2026 20:24
@yonromai yonromai force-pushed the configmap-limit branch 4 times, most recently from 6cf2cd8 to dfdc21f Compare March 21, 2026 00:47
Comment thread lib/zephyr/src/zephyr/execution.py Outdated
# state (e.g. _load_fuzzy_dupe_map_shard).
from fray.v2.local_backend import LocalClient

if isinstance(self.client, LocalClient):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjpower this part sucks, didn't see it coming :/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man, let's file an issue to fix load_fuzzy_dupe_map_shard.

we shouldn't support things that break the zephyr abstraction.

Copy link
Copy Markdown
Contributor Author

@yonromai yonromai Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravwojdyla Thoughts ^

Edit: See #3938

@yonromai yonromai requested a review from rjpower March 21, 2026 01:10
Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's fix the fuzzy dedup that seems not good to support in general

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 21, 2026

🤖 Opened #3938 to fix the local-state mutation in _load_fuzzy_dupe_map_shard — replaces the side-effect .map() with .reduce() so the dict flows through Zephyr properly.

@yonromai yonromai changed the base branch from main to fix-closure-tests March 23, 2026 22:55
Base automatically changed from fix-closure-tests to main March 23, 2026 22:56
yonromai added a commit that referenced this pull request Mar 23, 2026
…4076)

## Summary

Four zephyr tests relied on closure mutation (CallCounter, nonlocal
counters) to verify execution via side effects. This pattern only works
when the pipeline runs in-process with no serialization boundary — it
breaks under any cloudpickle round-trip (distributed backends, or
config-to-disk as in #3910).

Replace with assertions on output file contents and modification times.

Companion to #3938 which fixed the same pattern in production code
(`_load_fuzzy_dupe_map_shard`).

## Test plan

- [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py
-k "test_lazy_evaluation or test_skip_existing"` — 4 passed


🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yoblin and others added 2 commits March 23, 2026 22:57
… pickle

The entire _CoordinatorJobConfig (including the full PhysicalPlan) is
uploaded to object storage as job-config.pkl. The Entrypoint pickle
contains only two string URLs (config_path, result_path), keeping the
K8s ConfigMap payload trivially small regardless of dataset size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When kubectl apply fails (e.g. ConfigMap exceeds 3 MiB), catch the
KubectlError and return it as a TASK_STATE_FAILED update with the real
error message. Previously the error propagated out of sync(), crashed
the cycle, and the task later surfaced as misleading "Pod not found".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai merged commit ab588aa into main Mar 23, 2026
38 checks passed
@yonromai yonromai deleted the configmap-limit branch March 23, 2026 23:10
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…4076)

## Summary

Four zephyr tests relied on closure mutation (CallCounter, nonlocal
counters) to verify execution via side effects. This pattern only works
when the pipeline runs in-process with no serialization boundary — it
breaks under any cloudpickle round-trip (distributed backends, or
config-to-disk as in #3910).

Replace with assertions on output file contents and modification times.

Companion to #3938 which fixed the same pattern in production code
(`_load_fuzzy_dupe_map_shard`).

## Test plan

- [ ] `uv run --package zephyr pytest lib/zephyr/tests/test_dataset.py
-k "test_lazy_evaluation or test_skip_existing"` — 4 passed


🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
## Summary

Fixes #3908 — coordinator-as-job fails with misleading "Pod not found"
when pipelines have >~20K source items, because the serialized
`PhysicalPlan` exceeds the K8s ConfigMap 3 MiB limit.

Two commits:

1. **[zephyr] Upload coordinator config to object storage.** The full
`_CoordinatorJobConfig` is written to
`{chunk_storage_prefix}/{execution_id}/job-config.pkl`. The `Entrypoint`
pickle contains only two string URLs (`config_path`, `result_path`),
keeping the ConfigMap payload trivially small regardless of dataset
size.
2. **[iris] Propagate `KubectlError` from pod apply as a `TaskUpdate`**
instead of crashing the sync cycle and later surfacing as "Pod not
found". Includes a transition test verifying the ASSIGNED → FAILED path
through the real `TransitionManager`.

## Test plan

- [ ] `uv run --package iris pytest
lib/iris/tests/cluster/providers/k8s/test_provider.py` — new test
verifies `KubectlError` produces a `TASK_STATE_FAILED` update with the
real error string
- [ ] `uv run --package iris pytest
lib/iris/tests/cluster/controller/test_direct_controller.py::test_apply_failed_directly_from_assigned`
— verifies ASSIGNED → FAILED transition
- [ ] Full zephyr test suite


🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[zephyr/iris] Coordinator-as-job fails with 'Pod not found' for >~20K source items (ConfigMap 3MB limit)

3 participants