Skip to content

[ci] Implement workflow_scripts: extract logic, rename workflows#5354

Merged
rjpower merged 9 commits into
mainfrom
workflow-scripts-impl
May 5, 2026
Merged

[ci] Implement workflow_scripts: extract logic, rename workflows#5354
rjpower merged 9 commits into
mainfrom
workflow-scripts-impl

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented May 1, 2026

Move per-workflow logic into scripts/workflows/ (changes.py, github_actions.py, iris_monitor.py with status/wait/collect), replace peter-evans/create-pull-request and dorny/paths-filter, SHA-pin non-trusted actions, and rename all workflows to domain-type[-variant] with matching job ids. Audit goes from 38 to 0 failures. Branch protection swap is documented in .agents/projects/workflow_scripts/branch-protection-rollout.md and must be applied before merge.

Fixes #5067

@rjpower rjpower added the agent-generated Created by automation/agent label May 1, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 497f08e3bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread .github/workflows/dupekit-release-wheels.yaml Outdated
Comment thread scripts/workflows/_iris_diagnostics_coreweave.py Outdated
@rjpower rjpower requested a review from yonromai May 1, 2026 22:53
@rjpower rjpower force-pushed the workflow-scripts-impl branch from 74e62d0 to 3a82909 Compare May 1, 2026 23:25
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I left one net-new comment on the Iris monitor state model; the existing review comments about the dupekit auto-PR branch flow and CoreWeave label sanitization still look relevant, so I did not repost them.

Validation notes: the new monitor script compiles, its help path runs under uv --package marin-iris, and all workflow/action YAML files load. actionlint only reported shellcheck/custom-runner warnings. The current Iris - Smoke - GCP check is failing from a pytest timeout in the live smoke test, with diagnostics uploaded.

Generated with Codex.

Comment thread scripts/workflows/iris_monitor.py Outdated
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented May 5, 2026

@claude pls rebase to main and resolve conflicts

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Claude finished @rjpower's task in 5m 57s —— View job


Rebase resolved locally — push blocked by GitHub App permissions

  • Fetched origin/main and unshallowed (29 commits ahead of branch base, 8 branch commits)
  • Ran git rebase origin/main — all 8 commits replayed
  • Resolved both conflicts in .github/workflows/marin-canary-ferry-coreweave.yaml
  • ./infra/pre-commit.py --all-files --fix → clean (ruff + black + pyrefly + license headers all pass)
  • Smoke-checked scripts/workflows/iris_monitor.py (imports clean, --help works, all four subcommands exposed); YAML-loaded the five rewritten workflows
  • Cannot push — see below

Conflicts (both in marin-canary-ferry-coreweave.yaml)

1. Port-forward to iris-ci controller step. main (a92b2a6) had grown an inline kubectl rollout status + port-forward + retry-loop + curl /health block. The branch already replaces all of that with a single call into scripts/workflows/iris_monitor.py port-forward, which (per iris_monitor.py:703-768) does the equivalent rollout-wait, spawns the forwarder, polls /health, and writes IRIS_CONTROLLER_URL / LOCAL_PORT / PF_PID to $GITHUB_ENV. → kept the branch version, dropped the inline block.

2. Submit canary ferry step. main added echo "Multi-host canary: ${CANARY_MULTI_HOST:-0}"; the branch's commit 6c9126f317 switched iris --controller-url="$IRIS_CONTROLLER_URL" to iris --config="$IRIS_CONFIG" (per-call tunnel rather than a long-lived port-forward URL). → kept both changes.

Resulting block (lines 109-145):

      - name: Port-forward to iris-ci controller
        run: |
          uv run python scripts/workflows/iris_monitor.py port-forward \
            --namespace "$IRIS_NAMESPACE" \
            --service iris-ci-controller-svc \
            --kubeconfig "$HOME/.kube/coreweave-iris"

      - name: Submit canary ferry
        id: submit
        shell: bash -l {0}
        run: |
          echo "Multi-host canary: ${CANARY_MULTI_HOST:-0}"
          JOB_ID=$(.venv/bin/iris --config="$IRIS_CONFIG" \
            job run --no-wait \
            ...

Why I can't push

The Claude Code GitHub App's installation token does not carry the workflows permission, so git push --force-with-lease is rejected by the GitHub API:

! [remote rejected]  workflow-scripts-impl -> workflow-scripts-impl
  (refusing to allow a GitHub App to create or update workflow
   `.github/workflows/marin-canary-ferry-coreweave.yaml` without `workflows` permission)

The wrapper git-push.sh available on the runner doesn't accept --force-with-lease (it allows only origin <ref> for security), so a non-fast-forward push isn't possible through it either. This is a Claude Code Action / App-permission limitation, not a repo-side branch protection — see the FAQ.

How to apply this rebase locally

The conflict resolution is mechanical; run this on your laptop:

git fetch origin
git checkout workflow-scripts-impl
git rebase origin/main
# Two conflicts in .github/workflows/marin-canary-ferry-coreweave.yaml.
# Conflict 1 (Port-forward step): keep the `iris_monitor.py port-forward` call,
#   drop the inline kubectl rollout/port-forward/curl block from main.
# Conflict 2 (Submit canary ferry step): keep BOTH the new
#   `echo "Multi-host canary: ${CANARY_MULTI_HOST:-0}"` line AND
#   the `iris --config="$IRIS_CONFIG"` invocation.
git add .github/workflows/marin-canary-ferry-coreweave.yaml
git rebase --continue
./infra/pre-commit.py --all-files --fix
git push --force-with-lease origin workflow-scripts-impl

• Branch: workflow-scripts-impl

rjpower added 9 commits May 5, 2026 12:02
Move per-workflow logic into scripts/workflows/ (changes.py, github_actions.py,
iris_monitor.py with status/wait/collect), replace peter-evans/create-pull-request
and dorny/paths-filter, SHA-pin non-trusted actions, and rename all workflows
to domain-type[-variant] with matching job ids. Audit goes from 38 to 0
failures. Branch protection swap is documented in
.agents/projects/workflow_scripts/branch-protection-rollout.md and must be
applied before merge.

Fixes #5067
…collapse iris_monitor

Replace scripts/workflows/changes.py with SHA-pinned dorny/paths-filter@d1c1ffe
across the 8 unit/docs workflows. The path-filter logic was never invoked
outside CI, so a third-party action with a SHA pin gets the same security
posture without 248 LoC + 404 LoC of tests to maintain.

Delete scripts/workflows/github_actions.py: the audit was never wired into
pre-commit or any GHA workflow. Of its 5 checks, only "non-trusted actions
must be SHA-pinned" has ongoing value, and that's a 5-line grep. Verified all
4 conditions still hold on the live repo before deleting.

Collapse the four iris_monitor files into one: inline _iris_cli.py,
_iris_diagnostics_gcp.py, _iris_diagnostics_coreweave.py and drop the
sys.path.insert hack. Strip Args/Returns/Raises blocks on self-documenting
helpers, banner comments, defensive try/except wrappers around CLI commands
(Click prints uncaught exceptions and exits non-zero already), the unused
signal handler, the dead -v/--verbose flag and logger setup, and the test-only
sleep/monotonic/run injections.

Net: 2354 deletions, ~140 LoC reduction in iris_monitor.py.
iris_monitor.py collect now SSHes every managed GCE VM (not just the
controller) and dumps controller+worker pod logs/describes/events on
CoreWeave, with --include-cluster-context for nodepools+autoscaler+
scheduler-state. Adds a port-forward command that replaces the inline
rollout/wait/forward/health-probe loop in iris-smoke-coreweave and
marin-canary-ferry-coreweave. notify-slack and claude-triage move into
composite actions so the four/three call sites stop drifting.
Long-lived kubectl port-forward dies on CW konnectivity ~4-30min into
runs. Each iris call now establishes its own ephemeral tunnel via the
iris CLI, surviving any number of API-server stream drops.
Same per-call-tunnel fix as the CW canary ferry. The pytest steps still
consume --controller-url against the long-lived port-forward; making
those resilient would touch the test conftest and is a follow-up.
When auto/update-dupekit-wheels already exists, the pyproject.toml
write from rust_package.py was being clobbered by the subsequent
'git checkout SHA -- .' that reset the working tree to the trigger
SHA. Reorder: branch checkout + reset first, then run rust_package
on top, then commit and push.
Use iris.cluster.types.is_job_finished + iris.rpc.job_pb2 directly
instead of mirroring a subset of JobState values (which was missing
JOB_STATE_KILLED / WORKER_FAILED / UNSCHEDULABLE and listed a
non-existent JOB_STATE_CANCELLED).

For the kubectl iris.job_id selector, reuse iris's controller-side
_sanitize_label_value so interior '/' in hierarchical job IDs maps
to '.' the same way the controller writes the label. The previous
code only stripped a leading '/' and mangled '_' to '-', so failure
diagnostics on CW silently returned zero pods.
The iris CLI already establishes a tunnel via the provider bundle
(kubectl port-forward on K8s, IAP/SSH on GCP) and prints the URL.
Replace the bespoke kubectl-port-forward + rollout-wait logic in
iris_monitor.py with a thin wrapper that runs `iris cluster dashboard`,
parses the URL, and probes /health.

Drop the now-unneeded port-forward step from marin-canary-ferry-coreweave:
every iris call there uses --config which lazy-tunnels per-call.

Strip slop comments per AGENTS.md across the script and workflows.
@rjpower rjpower force-pushed the workflow-scripts-impl branch from 87ff4e9 to 959605b Compare May 5, 2026 19:16
@rjpower rjpower merged commit 5fd3d1b into main May 5, 2026
44 of 45 checks passed
@rjpower rjpower deleted the workflow-scripts-impl branch May 5, 2026 19:39
ravwojdyla added a commit that referenced this pull request May 5, 2026
* follow up from #5354

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Github workflow cleanup

2 participants