[iris] Use lightweight GetJobState for Fray actor polling#5021
Conversation
FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to 5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state through it. Only fetches the full JobStatus on the terminal-error path where the error message is actually needed.
|
Claude finished @rjpower's task in 2m 31s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. What I verified
Minor observation (not flagged)
|
yonromai
left a comment
There was a problem hiding this comment.
🤖 Requesting changes. The GetJobState polling change itself looks reasonable, but I found two blocking issues that should be fixed before auto-merge lands this PR:
- the new lightweight client path changes missing-job failures from
ConnectError(Code.NOT_FOUND)toKeyError, which is a public API regression forJob.state/state_only(); - the PR also carries an unrelated
uv.lockrefresh that bumpswandbto0.26.0.
Validation note: test_job_wait_with_stream_logs timed out both on this PR and on main, so I did not treat that failure as caused by this change.
Generated with Codex.
| states = self._cluster_client.get_job_states([job_id]) | ||
| wire_id = job_id.to_wire() | ||
| if wire_id not in states: | ||
| raise KeyError(f"Job {wire_id} not found") |
There was a problem hiding this comment.
🤖 P1: Job.state and state_only() now route through IrisClient.job_state(), but the new helper raises bare KeyError when the job is missing. Before this PR, the same call path surfaced ConnectError(Code.NOT_FOUND) via get_job_status(), which is also what RemoteClusterClient._poll_job_state still does. That makes the lightweight path a behavioral API regression for callers that already handle missing jobs as Connect errors.
Recommended fix: raise ConnectError(Code.NOT_FOUND, ...) from IrisClient.job_state() (mirroring _poll_job_state) so Job.state, state_only(), and status() stay consistent on missing jobs.
Generated with Codex.
| [[package]] | ||
| name = "wandb" | ||
| version = "0.24.0" | ||
| version = "0.26.0" |
There was a problem hiding this comment.
🤖 P1: This hunk refreshes the repo lockfile and bumps wandb to 0.26.0, but the resolver inputs for that dependency are already wandb>0.24.0 in lib/marin/pyproject.toml and lib/levanter/pyproject.toml, and neither file changed in this PR. Keeping this diff lands an unrelated dependency update alongside the Iris polling change.
Recommended fix: drop the uv.lock hunk from this PR, or split the lock refresh into its own dependency-update PR and validate it separately.
Generated with Codex.
|
@claude fix the KeyError -> ConnectError regression mentioned in the review. the uv.lock change is fine for now |
|
Claude finished @rjpower's task in 7m 10s —— View job Fixed KeyError → ConnectError regression
Change
Per your instructions, the Test results
I did not run Pushed as |
|
(dang, forgot about the unsafe automerge) |
|
Oh weird, I was confused, that's fine, we'll follow up in a separte PR |
Restore the pre-#5021 public API: Job.state / state_only() now surface ConnectError(Code.NOT_FOUND) when the job is missing, matching RemoteClusterClient._poll_job_state. The bare KeyError introduced by the lightweight polling path was a regression for callers that rely on ConnectError semantics. Follow-up to #5021.
FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to 5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state through it. The full JobStatus is now fetched only on the terminal-error path where the error message is actually needed.