Skip to content

[iris] get_job_info crashes on constraint JSON with mode field #4167

@dlwh

Description

@dlwh

Describe the bug
iris.cluster.client.get_job_info() crashes when IRIS_JOB_CONSTRAINTS contains a constraint JSON object with a mode field. This breaks executor-managed child jobs before user code runs. In this thread it caused the baseline arm of resilient-1e18-0325a-dpequiv to fail before training started.

To Reproduce

  1. Launch an executor-managed TPU step that calls iris.cluster.client.get_job_info() during startup, for example run_elastic_budget_compare_baseline_direct in lib/marin/src/marin/training/elastic_budget_compare.py:326.
  2. Ensure the inherited IRIS_JOB_CONSTRAINTS environment contains a constraint JSON object with a mode field.
  3. Observe startup fail before training begins.

Expected behavior

get_job_info() should tolerate the current serialized constraint shape and return JobInfo instead of crashing on protobuf JSON parsing.

Additional context

Failure is in lib/iris/src/iris/cluster/client/job_info.py:101:

google.protobuf.json_format.ParseError: Message type "iris.cluster.Constraint" has no field named "mode" at "Constraint".
Available Fields(except extensions): "['key', 'op', 'value', 'values']"

The current code still does json_format.ParseDict(item, cluster_pb2.Constraint()), so older/newer serialized forms of IRIS_JOB_CONSTRAINTS are not schema-tolerant. A minimal fix is to normalize or ignore unknown fields when reconstructing inherited constraints in get_job_info().

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions