[ci] Implement workflow_scripts: extract logic, rename workflows#5354
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 497f08e3bc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
74e62d0 to
3a82909
Compare
yonromai
left a comment
There was a problem hiding this comment.
Approved. I left one net-new comment on the Iris monitor state model; the existing review comments about the dupekit auto-PR branch flow and CoreWeave label sanitization still look relevant, so I did not repost them.
Validation notes: the new monitor script compiles, its help path runs under uv --package marin-iris, and all workflow/action YAML files load. actionlint only reported shellcheck/custom-runner warnings. The current Iris - Smoke - GCP check is failing from a pytest timeout in the live smoke test, with diagnostics uploaded.
Generated with Codex.
|
@claude pls rebase to main and resolve conflicts |
|
Claude finished @rjpower's task in 5m 57s —— View job Rebase resolved locally — push blocked by GitHub App permissions
Conflicts (both in
|
Move per-workflow logic into scripts/workflows/ (changes.py, github_actions.py, iris_monitor.py with status/wait/collect), replace peter-evans/create-pull-request and dorny/paths-filter, SHA-pin non-trusted actions, and rename all workflows to domain-type[-variant] with matching job ids. Audit goes from 38 to 0 failures. Branch protection swap is documented in .agents/projects/workflow_scripts/branch-protection-rollout.md and must be applied before merge. Fixes #5067
…collapse iris_monitor Replace scripts/workflows/changes.py with SHA-pinned dorny/paths-filter@d1c1ffe across the 8 unit/docs workflows. The path-filter logic was never invoked outside CI, so a third-party action with a SHA pin gets the same security posture without 248 LoC + 404 LoC of tests to maintain. Delete scripts/workflows/github_actions.py: the audit was never wired into pre-commit or any GHA workflow. Of its 5 checks, only "non-trusted actions must be SHA-pinned" has ongoing value, and that's a 5-line grep. Verified all 4 conditions still hold on the live repo before deleting. Collapse the four iris_monitor files into one: inline _iris_cli.py, _iris_diagnostics_gcp.py, _iris_diagnostics_coreweave.py and drop the sys.path.insert hack. Strip Args/Returns/Raises blocks on self-documenting helpers, banner comments, defensive try/except wrappers around CLI commands (Click prints uncaught exceptions and exits non-zero already), the unused signal handler, the dead -v/--verbose flag and logger setup, and the test-only sleep/monotonic/run injections. Net: 2354 deletions, ~140 LoC reduction in iris_monitor.py.
iris_monitor.py collect now SSHes every managed GCE VM (not just the controller) and dumps controller+worker pod logs/describes/events on CoreWeave, with --include-cluster-context for nodepools+autoscaler+ scheduler-state. Adds a port-forward command that replaces the inline rollout/wait/forward/health-probe loop in iris-smoke-coreweave and marin-canary-ferry-coreweave. notify-slack and claude-triage move into composite actions so the four/three call sites stop drifting.
Long-lived kubectl port-forward dies on CW konnectivity ~4-30min into runs. Each iris call now establishes its own ephemeral tunnel via the iris CLI, surviving any number of API-server stream drops.
Same per-call-tunnel fix as the CW canary ferry. The pytest steps still consume --controller-url against the long-lived port-forward; making those resilient would touch the test conftest and is a follow-up.
When auto/update-dupekit-wheels already exists, the pyproject.toml write from rust_package.py was being clobbered by the subsequent 'git checkout SHA -- .' that reset the working tree to the trigger SHA. Reorder: branch checkout + reset first, then run rust_package on top, then commit and push.
Use iris.cluster.types.is_job_finished + iris.rpc.job_pb2 directly instead of mirroring a subset of JobState values (which was missing JOB_STATE_KILLED / WORKER_FAILED / UNSCHEDULABLE and listed a non-existent JOB_STATE_CANCELLED). For the kubectl iris.job_id selector, reuse iris's controller-side _sanitize_label_value so interior '/' in hierarchical job IDs maps to '.' the same way the controller writes the label. The previous code only stripped a leading '/' and mangled '_' to '-', so failure diagnostics on CW silently returned zero pods.
The iris CLI already establishes a tunnel via the provider bundle (kubectl port-forward on K8s, IAP/SSH on GCP) and prints the URL. Replace the bespoke kubectl-port-forward + rollout-wait logic in iris_monitor.py with a thin wrapper that runs `iris cluster dashboard`, parses the URL, and probes /health. Drop the now-unneeded port-forward step from marin-canary-ferry-coreweave: every iris call there uses --config which lazy-tunnels per-call. Strip slop comments per AGENTS.md across the script and workflows.
87ff4e9 to
959605b
Compare
* follow up from #5354 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move per-workflow logic into scripts/workflows/ (changes.py, github_actions.py, iris_monitor.py with status/wait/collect), replace peter-evans/create-pull-request and dorny/paths-filter, SHA-pin non-trusted actions, and rename all workflows to domain-type[-variant] with matching job ids. Audit goes from 38 to 0 failures. Branch protection swap is documented in .agents/projects/workflow_scripts/branch-protection-rollout.md and must be applied before merge.
Fixes #5067