Skip to content

[iris] Fix _list_descendant_jobs to use ListJobs instead of deprecated GetTaskLogs#5037

Merged
rjpower merged 1 commit intomainfrom
rjpower/20260422-migrate-bug-report-to-list-jobs
Apr 22, 2026
Merged

[iris] Fix _list_descendant_jobs to use ListJobs instead of deprecated GetTaskLogs#5037
rjpower merged 1 commit intomainfrom
rjpower/20260422-migrate-bug-report-to-list-jobs

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

GetTaskLogs never populated child_job_statuses on the server side, so _list_descendant_jobs in bug_report.py always returned an empty list. Replace with ListJobs(scope=CHILDREN, parent_job_id=...) which correctly fetches child job statuses and unblocks removal of the deprecated GetTaskLogs RPC (scheduled for 2026-05-01).

…d GetTaskLogs

GetTaskLogs never populated child_job_statuses, so _list_descendant_jobs
always returned an empty list. Replace with ListJobs(scope=CHILDREN) which
actually returns the child job statuses needed for the bug report.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @rjpower's task in 2m 11s —— View job


Code Review — PR #5037

Summary: Narrowly-scoped bug fix that correctly replaces a broken GetTaskLogs-based lookup with the equivalent ListJobs(scope=CHILDREN) call. The replacement matches the server contract and is a reasonable unblock for retiring GetTaskLogs.

  • Read the changed file (lib/iris/src/iris/cli/bug_report.py)
  • Diff against origin/main
  • Examine ListJobs / JobQuery semantics on server and client
  • Verify the replacement preserves prior behavior
  • Post review feedback

Correctness

  • ControllerServiceClientSync.list_jobs(ListJobsRequest) is the correct RPC entry point (lib/iris/src/iris/rpc/controller_connect.py:1762), and the server honors scope=JOB_QUERY_SCOPE_CHILDREN + parent_job_id exactly as used here (lib/iris/src/iris/cluster/controller/service.py:608-615).
  • The try/except Exception + warn-and-return-empty behavior is preserved, so failure semantics don't regress.
  • Unfiltered state_filter on the server means all USER_JOB_STATES are returned (service.py:144-153) — i.e. the same set a human would see in the dashboard. This matches the intent of the bug-report dump.
  • _build_descendant_job_report reads job.job_id, state, exit_code, error, finished_at — all populated on the JobStatus rows that ListJobs returns. No field-level regression.

Issue: silent truncation at 500 children

_query_from_list_jobs_request clamps limit to MAX_LIST_JOBS_LIMIT = 500 when unset (service.py:703-704), and this call passes no limit. The response's has_more / total_count are discarded.

Result: if a parent job has more than 500 direct children, the bug report silently shows only the 500 most recent ones (default sort is JOB_SORT_FIELD_DATE DESC). The old broken code returned [] in all cases, so this is still a strict improvement — but the new code can under-report for high-fan-out jobs without any indication in the report.

Options, roughly in order of cost:

  1. Read list_resp.has_more / list_resp.total_count and surface a "showing N of M children" note in the report when truncated.
  2. Page through until has_more == False (see the reference helper at lib/iris/src/iris/cluster/client/remote_client.py:376-407, which already does this for RemoteControllerClient; the logic would need to be duplicated locally since bug_report.py uses the raw ControllerServiceClientSync).
  3. Accept the 500 cap and add an inline comment here noting it.

For a diagnostic bug-report CLI, option (1) is probably the right balance — cheap and keeps the report honest. Fix this →

Nits (non-blocking)

  • Sort order not explicit. Server currently defaults to JOB_SORT_FIELD_DATE DESC, which is fine, but making it explicit in the query would pin the ordering of the bug report against future server default changes. One line change.
  • No test coverage added. The original bug (always-empty list) was silent — there's no existing test for _list_descendant_jobs, and nothing was added here. An integration-style test that creates a parent + child and asserts _list_descendant_jobs returns the child would have caught the original regression and would guard this path going forward. Not strictly blocking given the fix itself is mechanical, but worth considering since the bug was silent for a while.

Verdict

LGTM once the truncation-awareness question is addressed (either option 1 above or an explicit "accept 500 cap" decision with a comment). Everything else is minor polish.
· Branch: rjpower/20260422-migrate-bug-report-to-list-jobs

@rjpower rjpower merged commit a8fc469 into main Apr 22, 2026
37 checks passed
@rjpower rjpower deleted the rjpower/20260422-migrate-bug-report-to-list-jobs branch April 22, 2026 03:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0197a02f2e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return []

return [_build_descendant_job_report(job) for job in log_resp.child_job_statuses]
return [_build_descendant_job_report(job) for job in list_resp.jobs]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Paginate ListJobs when collecting descendant jobs

ListJobs is a paginated API, but this helper returns only list_resp.jobs from the first response and never checks list_resp.has_more. In controllers with high fan-out child-job trees (more than the server page cap), bug reports will silently omit later children and can miss failed descendants in the summary. This is reachable because the server clamps JobQuery.limit to MAX_LIST_JOBS_LIMIT (500) in service._query_from_list_jobs_request, so a single call cannot represent all children.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant