Fix jax plugin build#1226
Conversation
marbre
left a comment
There was a problem hiding this comment.
Drive-by: I noticed jax_rocm7_pjrt-0.6.0-py3-none-manylinux_2_28_x86_64.whl and jax_rocm7_plugin-0.6.0-cp310-cp310-manylinux_2_28_x86_64.whl got uploaded to v2/gfx942 in the dev bucket but they must go into v2/gfx94X-dcgpu instead.
Gotcha. Looks like the |
5e6ffc9 to
61b1565
Compare
Other workflows incorporate scripts that use https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py. @geomin12 has more context here but normally we avoid to have to inputs. |
hi! yes, for inputs, we figured that the abbreviation would be easier for I would recommend creating a setup job/step, use GitHub outputs ( |
|
Any progress needed here? I understand Jax was critical to enable a specific set of users |
702870b to
8943ecd
Compare
|
Do you guys want me to set this up to run on a schedule? Nightly? Weekly? @gabeweisz |
Nightly please |
96dce9d to
73095e2
Compare
| @@ -247,6 +247,20 @@ jobs: | |||
| "rocm_version": "${{ needs.setup_metadata.outputs.version }}" | |||
| } | |||
|
|
|||
| - name: Trigger build JAX wheels | |||
There was a problem hiding this comment.
@marbre I think this is the right place to stick the trigger for the nightly build. Is there a way to get the URL of the latest TheRock .tar release though? I didn't see a simple way to do it through this workflow.
There was a problem hiding this comment.
The base URL right now is always https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/ (no CloudFront distribution planned). ROCm version is known by ${{ needs.setup_metadata.outputs.version }} and GPU family via ${{ matrix.target_bundle.amdgpu_family }}. Thus the URL should be
https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}-${{ needs.setup_metadata.outputs.version }}.tar.gz
There was a problem hiding this comment.
See https://github.com/ROCm/TheRock/blob/bb4372b9b502177915ade4d0d9f97397283212fd/.github/workflows/release_portable_linux_packages.yml#L117C53-L117C193 and maybe even better use
therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}${{ inputs.package_suffix }}-${{ needs.setup_metadata.outputs.version }}.tar.gz
for the filename to make it more robust.
There was a problem hiding this comment.
If I understand, that's the tarball that will sit on the worker's local filesystem, correct? I'm not super familiar with that specific workflow dispatch action, but is it guaranteed to run the wheel build workflow on the same runner as the workflow that called it? If not, we should keep using the URL
There was a problem hiding this comment.
Oh I was only referring to the where the variable is composed. The tarbal behind is uploaded to S3 here: https://github.com/ROCm/TheRock/blob/bb4372b9b502177915ade4d0d9f97397283212fd/.github/workflows/release_portable_linux_packages.yml#L210
Wanted to point out that the file name can have an the package_suffix in it if set. This does not apply to scheduled builds by default but can be the case for manually triggered builds.
There was a problem hiding this comment.
Ahh, gotcha. Also, right now this points to
https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}${{ inputs.package_suffix }}-${{ needs.setup_metadata.outputs.version }}.tar.gz
Could we also have
https://therock-dev-tarball.s3.us-east-2.amazonaws.com/therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}${{ inputs.package_suffix }}-${{ needs.setup_metadata.outputs.version }}.tar.gz
for dev builds? (Notice the change in the first part of the URL)
There was a problem hiding this comment.
I used to encode this as build/release type variable so that can create either or URL.
| - name: Install AWS CLI | ||
| if: always() | ||
| run: | | ||
| curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" | ||
| unzip awscliv2.zip | ||
| ./aws/install -i ./aws-cli-files -b ./aws-bin | ||
| sudo ./aws/install | ||
|
|
There was a problem hiding this comment.
if you use this container (
), it comes already installed with AWS! I would recommend replacing this and just using container :)There was a problem hiding this comment.
We cannot use the container that has AWS pre-installed. The build/ci_build script that does the JAX wheel build creates a docker image and does the actual build inside of that so that we're manylinux compliant.
There was a problem hiding this comment.
in that case, we do have this script (https://github.com/ROCm/TheRock/blob/main/dockerfiles/install_awscli.sh) that does the same thing here! less duplication!
There was a problem hiding this comment.
I'll give that a shot
| @@ -104,5 +119,6 @@ jobs: | |||
| - name: (Re-)Generate Python package release index | |||
| if: ${{ github.repository_owner == 'ROCm' }} | |||
| run: | | |||
| pip install boto3 packaging | |||
| python ./build_tools/third_party/s3_management/manage.py ${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }} | |||
| sudo apt install python3-venv python3-pip -y | |||
There was a problem hiding this comment.
can be removed if container is used!
you can also use:
TheRock/.github/actions/setup_test_environment/action.yml
Lines 23 to 26 in 7dff9ee
to help with python versioning!
There was a problem hiding this comment.
We can't use the container, but I'll switch it to using the actions step
| @@ -247,6 +247,20 @@ jobs: | |||
| "rocm_version": "${{ needs.setup_metadata.outputs.version }}" | |||
| } | |||
|
|
|||
| - name: Trigger build JAX wheels | |||
There was a problem hiding this comment.
The base URL right now is always https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/ (no CloudFront distribution planned). ROCm version is known by ${{ needs.setup_metadata.outputs.version }} and GPU family via ${{ matrix.target_bundle.amdgpu_family }}. Thus the URL should be
https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}-${{ needs.setup_metadata.outputs.version }}.tar.gz
| "release_type": "${{ env.RELEASE_TYPE }}", | ||
| "s3_subdir": "${{ env.S3_SUBDIR }}", | ||
| "rocm_version": "${{ needs.setup_metadata.outputs.version }}", | ||
| "tag_url": "${{ ??? }}" |
There was a problem hiding this comment.
CI fails with invalid workflow on this.
| - name: "Setting up Python" | ||
| uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0 | ||
| with: | ||
| python-version: 3.11 |
There was a problem hiding this comment.
want to use the input here? inputs.python_versions
| "release_type": "${{ env.RELEASE_TYPE }}", | ||
| "s3_subdir": "${{ env.S3_SUBDIR }}", | ||
| "rocm_version": "${{ needs.setup_metadata.outputs.version }}", | ||
| "tag_url": "https://therock-nightly-tarball.s3.us-east-2.amazonaws.com/therock-dist-linux-${{ matrix.target_bundle.amdgpu_family }}-${{ needs.setup_metadata.outputs.version }}.tar.gz" |
There was a problem hiding this comment.
nit: pretty sure you can remove the us-east-2 part
There was a problem hiding this comment.
Yeah, should be possible as well.
geomin12
left a comment
There was a problem hiding this comment.
lgtm! although, can we test this by triggering release_portable_linux_packages using workflow_dispatch? just to make sure everything works?
beec084 to
107c6bf
Compare
|
I kicked off a nightly build with this branch: https://github.com/ROCm/TheRock/actions/runs/17479247009 |
Co-authored-by: Scott Todd <scott.todd0@gmail.com>
fa9b009 to
d95caf7
Compare
| - name: Upload wheels to S3 | ||
| if: ${{ github.repository_owner == 'ROCm' }} | ||
| run: | | ||
| aws s3 cp ${{ env.PACKAGE_DIST_DIR }}/ s3://${{ env.S3_BUCKET_PY }}/${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }}/ \ | ||
| --recursive --exclude "*" --include "*.whl" |
There was a problem hiding this comment.
How close are these wheels ready to be advertised? If not ready yet, should we upload them to s3_staging_subdir instead of s3_subdir? We started uploading pytorch wheels first to "staging" then we copy them out once tests complete.
Once ready, we should advertise JAX support here:
- https://github.com/ROCm/TheRock?tab=readme-ov-file#features (change "JAX ... builds are in the works" language)
- https://github.com/ROCm/TheRock?tab=readme-ov-file#nightly-release-status (with a badge for workflow status)
- Add a section to https://github.com/ROCm/TheRock/blob/main/RELEASES.md for JAX like the pytorch sections
- Maybe add dedicated docs for how to build JAX (or link to a repo with docs). Could maybe add a page like https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/README.md for JAX, with just a README but no code/scripts?
There was a problem hiding this comment.
These are not ready. I don't have the automated testing in place yet to make sure that they're good. Will switch to using the staging directory.
There was a problem hiding this comment.
Thanks. You might want to put a TODO in there to
- Plumb through both the staging and "prod" URL
- First upload to staging, then run tests, then copy to "prod" if tests passed (matching what we do for torch)
Motivation
Fixes problems with the JAX CI workflow that was created in #1033
Technical Details
Fixes minor problems that make the workflow crash, and ensures some command-line tools that we need are installed.
Test Plan
Make sure the workflow works when run with Actions: https://github.com/ROCm/TheRock/actions/workflows/build_linux_jax_wheels.yml