Skip to content

Add release prebuilt artifact reuse mode#5095

Open
marbre wants to merge 10 commits intomainfrom
users/marbre/release-prebuilt-artifact-reuse
Open

Add release prebuilt artifact reuse mode#5095
marbre wants to merge 10 commits intomainfrom
users/marbre/release-prebuilt-artifact-reuse

Conversation

@marbre
Copy link
Copy Markdown
Member

@marbre marbre commented May 6, 2026

Motivation

In rare cases release re-runs need to reuse already-built artifacts. Those might be manually patched or we restrict the re-run to a subset of targets and solely use the release run to repackage for this limited subset. The release workflows will build tarballs, Python packages and native Linux packages without starting a source build.

Technical Details

Test Plan

Test Result

Submission Checklist

marbre and others added 10 commits May 6, 2026 23:32
## Motivation

Some release reruns need to reuse already-built TheRock artifacts,
patch them manually, and reupload them to a fixed S3 prefix. The
release workflow should then point at that prefix and build tarballs,
Python packages, and native packages from those artifacts without
starting TheRock source builds.

## Changes

artifact_manager.py copy:
- Add --source-prefix-only that builds the source backend with
  lookup_workflow_run=False, reusing the dest backend's env-based
  bucket selection. Use this when the source artifacts live under a
  manually populated prefix in the active run's artifact namespace.
- Add --stage=all that unions every build stage's produced artifacts.
- Add --require-matches that fails when a requested family yields no
  matching source artifact. With --expand-family-to-targets, a family
  is satisfied by either a family-named artifact or any of its
  expanded target-named artifacts.
- Accept either ',' or ';' in --amdgpu-families so callers passing
  CSV workflow inputs can pass them through unchanged.

build_tools/github_actions/verify_artifacts_ready.py:
- New helper that picks between the source-build-result and
  prebuilt-copy-result based on the prebuilt_prefix value and exits
  non-zero when the active producer failed. Encapsulates the gate
  logic so it is testable and reusable across Linux and Windows
  release workflows.

.github/workflows/copy_prebuilt_artifacts.yml:
- New reusable + workflow_dispatch workflow that invokes
  artifact_manager.py copy. Real copy steps are gated on
  prebuilt_prefix != '' so the reusable workflow is a successful
  no-op in source-build mode and downstream jobs can keep plain
  needs: wiring.

multi_arch_release.yml, multi_arch_release_linux.yml,
multi_arch_release_windows.yml:
- Plumb prebuilt_prefix through both triggers and into the Linux
  and Windows sub-workflows.
- Split source-build vs prebuilt-copy producers with a
  build_artifacts verifier gate that runs `if: always()` so
  producer failures are not masked by skipped jobs.
- Skip the PyTorch dispatch in prebuilt mode.

release_portable_linux_packages.yml, release_windows_packages.yml:
- Add prebuilt_prefix input on workflow_call and workflow_dispatch.
- Add a top-level always-runs reusable copy job.
- Per-family job depends on the copy job.
- Source-build steps gated to source mode.
- Prebuilt mode runs two artifact_manager fetches sharing a download
  cache: one per-artifact layout into BUILD_DIR/artifacts for
  build_python_packages.py, one flattened into BUILD_DIR/dist/rocm
  for the dist tarball, then tars the flattened tree so the existing
  upload step works in both modes.
- PyTorch (and JAX, on Linux) dispatches gated to source mode.
- Native package dispatches kept in both modes.
- TODO note flagging that empty `inputs.families` (defaults from
  fetch_package_targets.py) does not work in prebuilt mode yet.

## Test Plan

- python -m pytest
  build_tools/tests/artifact_manager_tool_test.py
  build_tools/github_actions/tests/verify_artifacts_ready_test.py
- YAML parse for the six touched/added workflows.
- git diff --check.

## Test Result

- 48 tests pass (29 existing + 8 new copy + 11 new verifier tests).
- All workflows parse cleanly.
- No whitespace issues.

Co-Authored-By: Claude <noreply@anthropic.com>
The single-family release workflows passed `inputs.families` directly
to the copy_prebuilt job. When a caller leaves `families` empty and
relies on fetch_package_targets.py to fill in the default list, the
copy job receives an empty amdgpu_families and copy --require-matches
exits 1.

Resolve the family list from the already-computed package_targets
matrix output instead, which honors the same defaults the matrix
itself uses. Removes the corresponding TODO comments.

Changes:
- release_portable_linux_packages.yml and release_windows_packages.yml:
  amdgpu_families is now `join(fromJSON(needs.setup_metadata.outputs
  .package_targets).*.amdgpu_family, ';')`.

Co-Authored-By: Claude <noreply@anthropic.com>
The empty-copy_requests early return in do_copy short-circuited
before checking --require-matches. The flag's contract is "fail if
the source delivers no matching artifacts," and that contract has to
hold whether the empty result comes from a per-family miss or from
an empty source prefix. Without this, a prebuilt copy job can
succeed while delivering nothing - exactly the silent regression
--require-matches exists to prevent.

Move the require_matches check before the early return.

Test: new test covers --require-matches + no families + empty
source prefix, expecting SystemExit(1).

Co-Authored-By: Claude <noreply@anthropic.com>
`fetch --stage=all` returns set(topology.artifacts.keys()) - every
artifact in the topology. `copy --stage=all` was looping over every
build stage and unioning topology.get_produced_artifacts(stage), so
any artifact present in the topology but not produced by any stage
was silently skipped on copy.

Make `copy --stage=all` use the same direct topology-artifacts
assignment as fetch.

Test: extend the test topology with an orphan-group whose artifact
no build stage produces, and assert it is included in `copy --stage=
all` results.

Co-Authored-By: Claude <noreply@anthropic.com>
The prebuilt-mode "Fetch prebuilt artifacts" and "Build dist tarball
from prebuilt artifacts" steps were guarded with both
`prebuilt_prefix != ''` and `github.repository_owner == 'ROCm'`.
Source-build steps are guarded only on `prebuilt_prefix == ''`, so
in any non-ROCm-owned context (forks, manual reruns from a fork)
prebuilt mode silently broke: source build was skipped, fetch was
skipped, then "Build Python Packages" ran against an empty
${BUILD_DIR}/artifacts.

Make the prebuilt gates symmetric with the source-build gates: only
prebuilt_prefix == '' / != '' decides which producer runs.

Co-Authored-By: Claude <noreply@anthropic.com>
The Linux release workflow correctly skips the PyTorch wheel
dispatch when prebuilt_prefix is set (framework wheel builds do not
yet accept the prebuilt artifact source). Windows was missed - it
checked only github.repository_owner and expect_pytorch_failure.
Add the prebuilt_prefix == '' guard so the Windows gate matches.

Co-Authored-By: Claude <noreply@anthropic.com>
_create_source_backend in prefix-only mode previously re-derived the
source bucket via WorkflowOutputRoot.from_workflow_run with
lookup_workflow_run=False - i.e. the same env-only path. That was
equivalent to dest today only because both went through env. If
do_copy ever passes a CLI --run-github-repo override into
create_backend_from_env, dest would pick it up and source would not,
silently diverging.

Have prefix-only construct WorkflowOutputRoot directly from dest's
resolved bucket and external_repo (with the source's own run_id and
platform). Construct dest first in do_copy and pass its output_root
in. Bake the same invariant into FailingBackend so existing tests
that go through do_copy keep working.

Test: replace the lookup_workflow_run=False assertion with one that
gives dest a deliberately non-env bucket/external_repo and verifies
the source S3Backend is constructed with those exact values plus
the source run_id.

Co-Authored-By: Claude <noreply@anthropic.com>
The single-family release workflows had a static one-line run-name
that did not surface whether a run was in source-build or prebuilt-
copy mode, making it hard to spot prebuilt reruns in the run list.
Convert both run-names to folded scalars that append " | prebuilt:
<prefix>" when inputs.prebuilt_prefix is set, matching the pattern
already used by multi_arch_release.yml and the rockrel wrappers.

Co-Authored-By: Claude <noreply@anthropic.com>
Without an explicit run-name, workflow_dispatch runs of this
reusable workflow show only the workflow name in the run list, so
parameters (target platform, release type, source prefix) only
become visible after opening the run. Add a run-name that surfaces
those at the top level for easier triage.

Co-Authored-By: Claude <noreply@anthropic.com>
Two stale comments lied about behavior after the recent fixes, and
the Windows PyTorch dispatch was missing the clarifying comment its
Linux sibling has. Update artifact_manager.py copy --stage help to
say "all" copies every artifact in the topology (mirrors
fetch --stage=all) instead of "unions every build stage". Rewrite
the Linux release comment that claimed native package dispatches are
skipped in prebuilt mode - only PyTorch and JAX are skipped; native
packages still run from the copied artifacts. Add the same
explanation on the Windows PyTorch dispatch.

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this PR offline. Some ideas:

  • Frame as "release repackaging" or "release promotion" instead of "prebuilt artifacts"
  • Allow CI workflows to mix prebuilt and built-from-source artifacts, draw a harder line in release workflows to only either use all prebuilt (for repacking) or all built-from-source
  • Move to https://github.com/ROCm/rockrel if focused on promotion

Comment on lines +63 to +65
copy_prebuilt:
name: Copy Prebuilt Artifacts
uses: ./.github/workflows/copy_prebuilt_artifacts.yml
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to watch for:

the base_lib_generic.tar.zst contains dist_info.json with contents like

{
  "dist_amdgpu_targets" : "gfx942;gfx1100;gfx1101;gfx1102;gfx1103;gfx1151;gfx1200;gfx1201;gfx950"
}

just repackaging a subset of existing artifacts (e.g. choosing to release a subset of the targets that were built in a nightly release for a stable release) with a changed version will include the original list there, resulting in tools like rocm-sdk targets returning a list that may not match expectations (see also #4687)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

2 participants