[iris] Use lightweight GetJobState for Fray actor polling by rjpower · Pull Request #5021 · marin-community/marin

rjpower · 2026-04-21T22:57:40Z

FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to 5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state through it. The full JobStatus is now fetched only on the terminal-error path where the error message is actually needed.

FrayActorJob.wait_ready and is_done were calling the heavy GetJobStatus RPC on every 0.5s tick just to read .state, pairing 1:1 with the GetJobState polls from wait_for_job. Switch both to the lightweight state RPC, back off from 0.1s to 5s between polls, and add IrisClient.job_state(job_id) plus wire Job.state through it. Only fetches the full JobStatus on the terminal-error path where the error message is actually needed.

claude · 2026-04-21T22:57:55Z

Claude finished @rjpower's task in 2m 31s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified

rigging.timing.ExponentialBackoff(initial=0.1, maximum=5.0).next_interval() matches the actual API in lib/rigging/src/rigging/timing.py. Using rigging.timing aligns with the Iris AGENTS.md rule (lib/iris/AGENTS.md L56).
ClusterClient.get_job_states(list[JobName]) -> dict[str, int] exists in protocol.py; the new IrisClient.job_state is a thin wrapper that mirrors the existing Job.state_only().
is_job_finished(state: int) accepts the raw int returned by the lightweight RPC, so both wait_ready and is_done continue to work.
Job.state now returns cast(job_pb2.JobState, self.state_only()). Proto enums are ints in generated Python, so callers like list_jobs' job.state in states comparisons keep working.
Full status() is only fetched on the terminal error branch in wait_ready, which matches the PR description; the extra RPC there is a one-shot on a terminal transition, not per-poll.

Minor observation (not flagged)

state and status().error are two separate RPCs on the error path; in practice the job is already in a terminal state so the window is benign.
· branch: iris-getjobstate-polling

yonromai

🤖 Requesting changes. The GetJobState polling change itself looks reasonable, but I found two blocking issues that should be fixed before auto-merge lands this PR:

the new lightweight client path changes missing-job failures from ConnectError(Code.NOT_FOUND) to KeyError, which is a public API regression for Job.state / state_only();
the PR also carries an unrelated uv.lock refresh that bumps wandb to 0.26.0.

Validation note: test_job_wait_with_stream_logs timed out both on this PR and on main, so I did not treat that failure as caused by this change.

Generated with Codex.

yonromai · 2026-04-21T23:17:49Z

+        states = self._cluster_client.get_job_states([job_id])
+        wire_id = job_id.to_wire()
+        if wire_id not in states:
+            raise KeyError(f"Job {wire_id} not found")


🤖 P1: Job.state and state_only() now route through IrisClient.job_state(), but the new helper raises bare KeyError when the job is missing. Before this PR, the same call path surfaced ConnectError(Code.NOT_FOUND) via get_job_status(), which is also what RemoteClusterClient._poll_job_state still does. That makes the lightweight path a behavioral API regression for callers that already handle missing jobs as Connect errors.

Recommended fix: raise ConnectError(Code.NOT_FOUND, ...) from IrisClient.job_state() (mirroring _poll_job_state) so Job.state, state_only(), and status() stay consistent on missing jobs.

Generated with Codex.

yonromai · 2026-04-21T23:17:49Z

 [[package]]
 name = "wandb"
-version = "0.24.0"
+version = "0.26.0"


🤖 P1: This hunk refreshes the repo lockfile and bumps wandb to 0.26.0, but the resolver inputs for that dependency are already wandb>0.24.0 in lib/marin/pyproject.toml and lib/levanter/pyproject.toml, and neither file changed in this PR. Keeping this diff lands an unrelated dependency update alongside the Iris polling change.

Recommended fix: drop the uv.lock hunk from this PR, or split the lock refresh into its own dependency-update PR and validate it separately.

Generated with Codex.

rjpower · 2026-04-21T23:30:14Z

@claude fix the KeyError -> ConnectError regression mentioned in the review. the uv.lock change is fine for now

claude · 2026-04-21T23:30:46Z

Claude finished @rjpower's task in 7m 10s —— View job

Fixed KeyError → ConnectError regression

Read IrisClient.job_state and the RemoteClusterClient._poll_job_state pattern
Apply fix: raise ConnectError(Code.NOT_FOUND, ...) from IrisClient.job_state when the job is missing
Run ./infra/pre-commit.py --all-files --fix — OK
Run relevant tests
Commit & push

Change

lib/iris/src/iris/client/client.py — IrisClient.job_state() now raises ConnectError(Code.NOT_FOUND, ...) instead of bare KeyError when the wire id is missing from the lightweight states response. This matches RemoteClusterClient._poll_job_state and restores the pre-PR public API for Job.state / state_only(). Added a Raises: note to the docstring.

Per your instructions, the uv.lock hunk is left as-is.

Test results

lib/iris/tests/e2e/test_local_client.py: 8 passed (full job-lifecycle coverage that exercises Job.state / state_only via the lightweight RPC).
lib/iris/tests/e2e/test_smoke.py: 22 passed, 2 skipped, only Playwright-related failures (test_dashboard_job_expand, test_dashboard_login_flow — missing browser binary), unrelated to this change.
lib/fray/tests/test_v2_iris.py + tests/test_v2_actor.py: 32 passed (covers FrayActorJob.wait_ready / is_done paths that consume job_state).
./infra/pre-commit.py --all-files --fix: clean.

I did not run test_job_wait_with_stream_logs (the test you noted times out on main as well).

Pushed as bdffee123 to iris-getjobstate-polling.
· branch: iris-getjobstate-polling

yonromai · 2026-04-21T23:32:10Z

(dang, forgot about the unsafe automerge)

rjpower · 2026-04-21T23:32:36Z

Oh weird, I was confused, that's fine, we'll follow up in a separte PR

Restore the pre-#5021 public API: Job.state / state_only() now surface ConnectError(Code.NOT_FOUND) when the job is missing, matching RemoteClusterClient._poll_job_state. The bare KeyError introduced by the lightweight polling path was a regression for callers that rely on ConnectError semantics. Follow-up to #5021.

rjpower added the agent-generated Created by automation/agent label Apr 21, 2026

[iris] Return typed JobState from job_state / state_only

de986db

rjpower requested a review from yonromai April 21, 2026 23:01

rjpower enabled auto-merge (squash) April 21, 2026 23:01

yonromai requested changes Apr 21, 2026

View reviewed changes

yonromai approved these changes Apr 21, 2026

View reviewed changes

rjpower merged commit 28f2170 into main Apr 21, 2026
50 of 51 checks passed

rjpower deleted the iris-getjobstate-polling branch April 21, 2026 23:31

rjpower mentioned this pull request Apr 21, 2026

[iris] Raise ConnectError(NOT_FOUND) from IrisClient.job_state #5024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Use lightweight GetJobState for Fray actor polling#5021

[iris] Use lightweight GetJobState for Fray actor polling#5021
rjpower merged 2 commits intomainfrom
iris-getjobstate-polling

rjpower commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

yonromai left a comment

Uh oh!

yonromai Apr 21, 2026

Uh oh!

yonromai Apr 21, 2026

Uh oh!

rjpower commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

yonromai commented Apr 21, 2026

Uh oh!

rjpower commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixed KeyError → ConnectError regression

Change

Test results

Uh oh!

Uh oh!

yonromai commented Apr 21, 2026

Uh oh!

rjpower commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 21, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading