Skip to content

[iris] Ignore unknown inherited constraint fields#4168

Closed
dlwh wants to merge 23 commits intomainfrom
codex/20260326-fix-4167
Closed

[iris] Ignore unknown inherited constraint fields#4168
dlwh wants to merge 23 commits intomainfrom
codex/20260326-fix-4167

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Mar 26, 2026

Ignore unknown fields when reconstructing inherited constraints in iris.cluster.client.get_job_info(), so inherited IRIS_JOB_CONSTRAINTS entries with newer schema fields like mode do not crash startup. Add a regression test for the environment parsing path.

Fixes #4167

@dlwh dlwh added bug Something isn't working agent-generated Created by automation/agent labels Mar 26, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @dlwh's task in 1m 39s —— View job


PR Review: [iris] Ignore unknown inherited constraint fields

  • Gather context and understand the PR
  • Review the core fix (job_info.py)
  • Review the test changes
  • Check for unrelated changes
  • Post review feedback

The Fix (commit 39d1c86)

The actual fix is clean and correct. It adds ignore_unknown_fields=True to the json_format.ParseDict call when deserializing IRIS_JOB_CONSTRAINTS from the environment. This prevents child jobs from crashing when the parent's serialized constraints contain newer schema fields (like mode) that the child's protobuf definition doesn't know about yet.

lib/iris/src/iris/cluster/client/job_info.py:112-116 — The change is minimal and well-targeted:

constraint_proto = json_format.ParseDict(
    item,
    cluster_pb2.Constraint(),
    ignore_unknown_fields=True,  # ← the fix
)

The regression test at lib/iris/tests/cluster/client/test_job_info.py:82-96 correctly validates the scenario: a constraint JSON with an unknown "mode" field parses without error, and the known fields (key, op) are preserved.

Issue: Unrelated Changes in the PR

This branch was created on top of codex/research/resilient-tpu-training, which pulls in ~22 unrelated commits and ~5,000 lines of elastic training / resilient TPU research code:

  • .agents/logbooks/resilient-tpu-training.md (+926)
  • .agents/projects/resilient-tpu-training.md (+205)
  • lib/levanter/src/levanter/elastic.py (+1167)
  • lib/levanter/src/levanter/trainer.py (+167/-2)
  • lib/levanter/src/levanter/main/train_lm.py (+1/-0)
  • lib/levanter/tests/test_elastic.py (+414)
  • lib/levanter/tests/test_elastic_jax_transfer.py (+96)
  • lib/marin/src/marin/training/elastic_*.py (4 new files, ~1500+ lines)
  • lib/marin/src/marin/training/training.py (+234/-7)
  • lib/marin/src/marin/training/validation_sets.py (+15)
  • tests/test_training.py (+260/-1)

The actual fix is just 2 files, 23 insertions, 1 deletion. The PR should be rebased onto origin/main to drop the unrelated research branch changes before merging. Otherwise this will merge all the resilient-tpu-training work as a side-effect.

Verdict

The fix itself is correct and well-tested. The only action needed is rebasing onto main to exclude the unrelated elastic/resilient-training changes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39d1c86213

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

outer_learning_rate=config.outer_learning_rate,
outer_optimizer=config.outer_optimizer,
outer_momentum=config.outer_momentum,
model_sync_mix=config.model_sync_mix,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsupported DiLoCo config argument

Passing model_sync_mix into DiLoCoSyncConfig(...) will raise TypeError at runtime because that dataclass does not define a model_sync_mix field. Any run_elastic_budget_compare execution that reaches the elastic branch will fail before launching training, so the new benchmark path is currently unusable.

Useful? React with 👍 / 👎.

Comment on lines +256 to +260
config.train_config.trainer,
id=worker_run_id,
elastic=elastic,
tracker=_tracker_for_elastic_worker(config.train_config.trainer.tracker, logical_run_id, worker_id),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Isolate per-worker checkpoint paths in elastic workers

This worker assignment rewrites trainer.id per worker but reuses the same checkpointer settings for every worker. With the default launch flow (impute_run_id_from_output_path=True), workers keep a shared checkpoint base path, so separate jobs can write the same checkpoint directories/steps concurrently; that creates nondeterministic overwrites and can corrupt resume state for multi-worker elastic runs.

Useful? React with 👍 / 👎.

@rjpower rjpower closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] get_job_info crashes on constraint JSON with mode field

2 participants