[iris] Ignore unknown inherited constraint fields by dlwh · Pull Request #4168 · marin-community/marin

dlwh · 2026-03-26T07:11:20Z

Ignore unknown fields when reconstructing inherited constraints in iris.cluster.client.get_job_info(), so inherited IRIS_JOB_CONSTRAINTS entries with newer schema fields like mode do not crash startup. Add a regression test for the environment parsing path.

Fixes #4167

…ent-tpu-training

claude · 2026-03-26T07:11:37Z

Claude finished @dlwh's task in 1m 39s —— View job

PR Review: [iris] Ignore unknown inherited constraint fields

Gather context and understand the PR
Review the core fix (job_info.py)
Review the test changes
Check for unrelated changes
Post review feedback

The Fix (commit `39d1c86`)

The actual fix is clean and correct. It adds ignore_unknown_fields=True to the json_format.ParseDict call when deserializing IRIS_JOB_CONSTRAINTS from the environment. This prevents child jobs from crashing when the parent's serialized constraints contain newer schema fields (like mode) that the child's protobuf definition doesn't know about yet.

lib/iris/src/iris/cluster/client/job_info.py:112-116 — The change is minimal and well-targeted:

constraint_proto = json_format.ParseDict(
    item,
    cluster_pb2.Constraint(),
    ignore_unknown_fields=True,  # ← the fix
)

The regression test at lib/iris/tests/cluster/client/test_job_info.py:82-96 correctly validates the scenario: a constraint JSON with an unknown "mode" field parses without error, and the known fields (key, op) are preserved.

Issue: Unrelated Changes in the PR

This branch was created on top of codex/research/resilient-tpu-training, which pulls in ~22 unrelated commits and ~5,000 lines of elastic training / resilient TPU research code:

.agents/logbooks/resilient-tpu-training.md (+926)
.agents/projects/resilient-tpu-training.md (+205)
lib/levanter/src/levanter/elastic.py (+1167)
lib/levanter/src/levanter/trainer.py (+167/-2)
lib/levanter/src/levanter/main/train_lm.py (+1/-0)
lib/levanter/tests/test_elastic.py (+414)
lib/levanter/tests/test_elastic_jax_transfer.py (+96)
lib/marin/src/marin/training/elastic_*.py (4 new files, ~1500+ lines)
lib/marin/src/marin/training/training.py (+234/-7)
lib/marin/src/marin/training/validation_sets.py (+15)
tests/test_training.py (+260/-1)

The actual fix is just 2 files, 23 insertions, 1 deletion. The PR should be rebased onto origin/main to drop the unrelated research branch changes before merging. Otherwise this will merge all the resilient-tpu-training work as a side-effect.

Verdict

The fix itself is correct and well-tested. The only action needed is rebasing onto main to exclude the unrelated elastic/resilient-training changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39d1c86213

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T07:19:55Z

+                        outer_learning_rate=config.outer_learning_rate,
+                        outer_optimizer=config.outer_optimizer,
+                        outer_momentum=config.outer_momentum,
+                        model_sync_mix=config.model_sync_mix,


Remove unsupported DiLoCo config argument

Passing model_sync_mix into DiLoCoSyncConfig(...) will raise TypeError at runtime because that dataclass does not define a model_sync_mix field. Any run_elastic_budget_compare execution that reaches the elastic branch will fail before launching training, so the new benchmark path is currently unusable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-26T07:19:55Z

+        config.train_config.trainer,
+        id=worker_run_id,
+        elastic=elastic,
+        tracker=_tracker_for_elastic_worker(config.train_config.trainer.tracker, logical_run_id, worker_id),
+    )


Isolate per-worker checkpoint paths in elastic workers

This worker assignment rewrites trainer.id per worker but reuses the same checkpointer settings for every worker. With the default launch flow (impute_run_id_from_output_path=True), workers keep a shared checkpoint base path, so separate jobs can write the same checkpoint directories/steps concurrently; that creates nondeterministic overwrites and can corrupt resume state for multi-worker elastic runs.

Useful? React with 👍 / 👎.

dlwh added 23 commits March 11, 2026 22:08

Add elastic TPU training transport and benchmarks

b06589a

Add DiLoCo-style elastic sync mode

6158b4e

Fix Iris nested job context detection

1148c15

Harden elastic DiLoCo rerun instrumentation

00cba33

Use standard validation mixes in elastic benchmarks

142d72d

Document elastic validation rerun

86d53b8

Launch elastic budget compare via executor

798c7bb

Log executor-dispatched elastic rerun

e47d75c

Treat deleted transfer buffers as transient publish misses

1679185

Log deleted-array elastic publish fix

db8c017

Stage elastic transfer payloads before publish

e21fc76

Log staged-publish rollout to 0312g

46d6063

Decouple DiLoCo anchor from donated model buffers

7f644cd

Log DiLoCo anchor fix and 0312h relaunch

fe55611

Stop sharing DiLoCo outer state across peers

0c378ba

Stabilize DiLoCo sync with staleness gating and update clipping

c33afb0

Add MaxText-style Nesterov DiLoCo outer optimizer

d63014a

Log elastic Adam vs Nesterov A/B launch

58dc350

Fix elastic compare launch path for default validation

ea0d0b4

Merge remote-tracking branch 'origin/main' into codex/research/resili…

8211557

…ent-tpu-training

Merge remote-tracking branch 'origin/main' into codex/research/resili…

4195b6b

…ent-tpu-training

Merge remote-tracking branch 'origin/main' into codex/research/resili…

1729aeb

…ent-tpu-training

[iris] Ignore unknown inherited constraint fields

39d1c86

dlwh added bug Something isn't working agent-generated Created by automation/agent labels Mar 26, 2026

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

rjpower closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Ignore unknown inherited constraint fields#4168

[iris] Ignore unknown inherited constraint fields#4168
dlwh wants to merge 23 commits intomainfrom
codex/20260326-fix-4167

dlwh commented Mar 26, 2026

Uh oh!

claude Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlwh commented Mar 26, 2026

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: [iris] Ignore unknown inherited constraint fields

The Fix (commit 39d1c86)

Issue: Unrelated Changes in the PR

Verdict

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Mar 26, 2026 •

edited

Loading

The Fix (commit `39d1c86`)