Describe the bug
iris.cluster.client.get_job_info() crashes when IRIS_JOB_CONSTRAINTS contains a constraint JSON object with a mode field. This breaks executor-managed child jobs before user code runs. In this thread it caused the baseline arm of resilient-1e18-0325a-dpequiv to fail before training started.
To Reproduce
- Launch an executor-managed TPU step that calls
iris.cluster.client.get_job_info() during startup, for example run_elastic_budget_compare_baseline_direct in lib/marin/src/marin/training/elastic_budget_compare.py:326.
- Ensure the inherited
IRIS_JOB_CONSTRAINTS environment contains a constraint JSON object with a mode field.
- Observe startup fail before training begins.
Expected behavior
get_job_info() should tolerate the current serialized constraint shape and return JobInfo instead of crashing on protobuf JSON parsing.
Additional context
Failure is in lib/iris/src/iris/cluster/client/job_info.py:101:
google.protobuf.json_format.ParseError: Message type "iris.cluster.Constraint" has no field named "mode" at "Constraint".
Available Fields(except extensions): "['key', 'op', 'value', 'values']"
The current code still does json_format.ParseDict(item, cluster_pb2.Constraint()), so older/newer serialized forms of IRIS_JOB_CONSTRAINTS are not schema-tolerant. A minimal fix is to normalize or ignore unknown fields when reconstructing inherited constraints in get_job_info().
Describe the bug
iris.cluster.client.get_job_info()crashes whenIRIS_JOB_CONSTRAINTScontains a constraint JSON object with amodefield. This breaks executor-managed child jobs before user code runs. In this thread it caused the baseline arm ofresilient-1e18-0325a-dpequivto fail before training started.To Reproduce
iris.cluster.client.get_job_info()during startup, for examplerun_elastic_budget_compare_baseline_directinlib/marin/src/marin/training/elastic_budget_compare.py:326.IRIS_JOB_CONSTRAINTSenvironment contains a constraint JSON object with amodefield.Expected behavior
get_job_info()should tolerate the current serialized constraint shape and returnJobInfoinstead of crashing on protobuf JSON parsing.Additional context
Failure is in
lib/iris/src/iris/cluster/client/job_info.py:101:The current code still does
json_format.ParseDict(item, cluster_pb2.Constraint()), so older/newer serialized forms ofIRIS_JOB_CONSTRAINTSare not schema-tolerant. A minimal fix is to normalize or ignore unknown fields when reconstructing inherited constraints inget_job_info().