Skip to content

[iris] Use lightweight GetJobState for Fray actor polling#5021

Merged
rjpower merged 2 commits intomainfrom
iris-getjobstate-polling
Apr 21, 2026
Merged

[iris] Use lightweight GetJobState for Fray actor polling#5021
rjpower merged 2 commits intomainfrom
iris-getjobstate-polling

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 21, 2026

FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to 5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state through it. The full JobStatus is now fetched only on the terminal-error path where the error message is actually needed.

FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on
every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from
wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to
5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state
through it. Only fetches the full JobStatus on the terminal-error path where
the error message is actually needed.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 21, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @rjpower's task in 2m 31s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified

  • rigging.timing.ExponentialBackoff(initial=0.1, maximum=5.0).next_interval() matches the actual API in lib/rigging/src/rigging/timing.py. Using rigging.timing aligns with the Iris AGENTS.md rule (lib/iris/AGENTS.md L56).
  • ClusterClient.get_job_states(list[JobName]) -> dict[str, int] exists in protocol.py; the new IrisClient.job_state is a thin wrapper that mirrors the existing Job.state_only().
  • is_job_finished(state: int) accepts the raw int returned by the lightweight RPC, so both wait_ready and is_done continue to work.
  • Job.state now returns cast(job_pb2.JobState, self.state_only()). Proto enums are ints in generated Python, so callers like list_jobs' job.state in states comparisons keep working.
  • Full status() is only fetched on the terminal error branch in wait_ready, which matches the PR description; the extra RPC there is a one-shot on a terminal transition, not per-poll.

Minor observation (not flagged)

  • state and status().error are two separate RPCs on the error path; in practice the job is already in a terminal state so the window is benign.
    · branch: iris-getjobstate-polling

@rjpower rjpower requested a review from yonromai April 21, 2026 23:01
@rjpower rjpower enabled auto-merge (squash) April 21, 2026 23:01
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Requesting changes. The GetJobState polling change itself looks reasonable, but I found two blocking issues that should be fixed before auto-merge lands this PR:

  • the new lightweight client path changes missing-job failures from ConnectError(Code.NOT_FOUND) to KeyError, which is a public API regression for Job.state / state_only();
  • the PR also carries an unrelated uv.lock refresh that bumps wandb to 0.26.0.

Validation note: test_job_wait_with_stream_logs timed out both on this PR and on main, so I did not treat that failure as caused by this change.

Generated with Codex.

states = self._cluster_client.get_job_states([job_id])
wire_id = job_id.to_wire()
if wire_id not in states:
raise KeyError(f"Job {wire_id} not found")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P1: Job.state and state_only() now route through IrisClient.job_state(), but the new helper raises bare KeyError when the job is missing. Before this PR, the same call path surfaced ConnectError(Code.NOT_FOUND) via get_job_status(), which is also what RemoteClusterClient._poll_job_state still does. That makes the lightweight path a behavioral API regression for callers that already handle missing jobs as Connect errors.

Recommended fix: raise ConnectError(Code.NOT_FOUND, ...) from IrisClient.job_state() (mirroring _poll_job_state) so Job.state, state_only(), and status() stay consistent on missing jobs.

Generated with Codex.

Comment thread uv.lock
[[package]]
name = "wandb"
version = "0.24.0"
version = "0.26.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P1: This hunk refreshes the repo lockfile and bumps wandb to 0.26.0, but the resolver inputs for that dependency are already wandb>0.24.0 in lib/marin/pyproject.toml and lib/levanter/pyproject.toml, and neither file changed in this PR. Keeping this diff lands an unrelated dependency update alongside the Iris polling change.

Recommended fix: drop the uv.lock hunk from this PR, or split the lock refresh into its own dependency-update PR and validate it separately.

Generated with Codex.

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 21, 2026

@claude fix the KeyError -> ConnectError regression mentioned in the review. the uv.lock change is fine for now

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @rjpower's task in 7m 10s —— View job


Fixed KeyError → ConnectError regression

  • Read IrisClient.job_state and the RemoteClusterClient._poll_job_state pattern
  • Apply fix: raise ConnectError(Code.NOT_FOUND, ...) from IrisClient.job_state when the job is missing
  • Run ./infra/pre-commit.py --all-files --fix — OK
  • Run relevant tests
  • Commit & push

Change

lib/iris/src/iris/client/client.pyIrisClient.job_state() now raises ConnectError(Code.NOT_FOUND, ...) instead of bare KeyError when the wire id is missing from the lightweight states response. This matches RemoteClusterClient._poll_job_state and restores the pre-PR public API for Job.state / state_only(). Added a Raises: note to the docstring.

Per your instructions, the uv.lock hunk is left as-is.

Test results

  • lib/iris/tests/e2e/test_local_client.py: 8 passed (full job-lifecycle coverage that exercises Job.state / state_only via the lightweight RPC).
  • lib/iris/tests/e2e/test_smoke.py: 22 passed, 2 skipped, only Playwright-related failures (test_dashboard_job_expand, test_dashboard_login_flow — missing browser binary), unrelated to this change.
  • lib/fray/tests/test_v2_iris.py + tests/test_v2_actor.py: 32 passed (covers FrayActorJob.wait_ready / is_done paths that consume job_state).
  • ./infra/pre-commit.py --all-files --fix: clean.

I did not run test_job_wait_with_stream_logs (the test you noted times out on main as well).

Pushed as bdffee123 to iris-getjobstate-polling.
· branch: iris-getjobstate-polling

@rjpower rjpower merged commit 28f2170 into main Apr 21, 2026
50 of 51 checks passed
@rjpower rjpower deleted the iris-getjobstate-polling branch April 21, 2026 23:31
@yonromai
Copy link
Copy Markdown
Contributor

(dang, forgot about the unsafe automerge)

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 21, 2026

Oh weird, I was confused, that's fine, we'll follow up in a separte PR

rjpower added a commit that referenced this pull request Apr 21, 2026
Restore the pre-#5021 public API: Job.state / state_only() now surface
ConnectError(Code.NOT_FOUND) when the job is missing, matching
RemoteClusterClient._poll_job_state. The bare KeyError introduced by the
lightweight polling path was a regression for callers that rely on
ConnectError semantics.

Follow-up to #5021.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants