forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 82
CI: Add ROCm nightly docker workflow #3115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
leo-automation
wants to merge
53
commits into
develop
Choose a base branch
from
rocm-nightly-gha
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+185
−0
Open
Changes from 18 commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
507fd86
First draft
leo-automation 59590af
Pass artfiact and test before pushing image
leo-automation ef2aefd
Matrix
leo-automation fc2c16d
Remove notifications for now
leo-automation 2a6a9e6
Temporary trigger
leo-automation a7223b2
Self hosted runners smoke test
leo-automation c818f73
Delete smokes
leo-automation 454a5fa
Verboose and runner change
leo-automation 03393e9
Updae dockerfile
leo-automation 551548d
Remove tty
leo-automation e53f83c
Use older buildx with better build logging
leo-automation 2e68950
Verboose image build troublshooting
leo-automation 1784746
Debug
leo-automation 28cbb19
More logging
leo-automation ffdf09a
FIx permissions and have main jib disable sccache
leo-automation 865f60d
Debug
leo-automation 8711232
Fix debug script
leo-automation 9b82418
Debug script fix
leo-automation 1fbfc09
Implement Jithun's suggestions
leo-automation d74afa3
Updated timeout
leo-automation 717a478
Remove debug
leo-automation 2cd758b
pin sscache version
leo-automation c28afaa
Debug
leo-automation 66ffb00
buildx fix
leo-automation 554e8f5
Debug buildx
leo-automation bb8a72d
sscache version change
leo-automation a3fb579
Pin upstream commit
leo-automation cd7374c
sed on build and docker commit fix
leo-automation 3b901b6
cmake deps
leo-automation f9c83ca
Disable rocSHMEM
leo-automation 2c7f9b9
Remove push
leo-automation 3525232
Remove some debugging
leo-automation 8fc34b8
Enable for debug
leo-automation 1324872
Disable USE_NVSHMEM
leo-automation eb32e63
Enable image push
leo-automation cd940fe
failed to read dockerfile
leo-automation 6df0761
path fix
leo-automation cd81668
path fix
leo-automation 7fd94cb
Bypass sccache on torch_rocshmem
leo-automation e74bf12
Upgrade actioms versions
leo-automation 8c25b4c
Trivy vuln image scan
leo-automation 758f32b
All in one job
leo-automation af18af1
try 7.2.0
leo-automation 807c7a1
7.2
leo-automation 391c1d3
Bypass sschache on rochsmem torch target
leo-automation fb1c009
Remove cherry pick
leo-automation 20df855
sscacge workaround
leo-automation 077f47c
Address comments
leo-automation 012f035
Trivy increase context size
leo-automation 88ec330
Try removing use_preprocessor_cache_mode from sccache
leo-automation 7b8dd18
Cleanup
leo-automation 0fe733e
Add a FIXME
jithunnair-amd 202d0c5
Address comments
leo-automation File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| ARG BASE_IMAGE=rocm/pytorch-autobuild:base-latest | ||
| FROM ${BASE_IMAGE} | ||
| WORKDIR /tmp | ||
| USER root | ||
|
|
||
| ENV CI=1 | ||
| ENV PYTORCH_TEST_WITH_ROCM=1 | ||
| ENV PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" | ||
|
|
||
| RUN git clone https://github.com/pytorch/pytorch --recursive \ | ||
| && cd pytorch \ | ||
| && pip install -r requirements.txt \ | ||
| && git config --local user.name "AMD AMD" \ | ||
| && git config --local user.email "amd@amd.com" \ | ||
| && git remote add rocm https://github.com/ROCm/pytorch.git \ | ||
| && git fetch rocm \ | ||
| && git cherry-pick 519160d466782f5a62365be051fcb3ef90fa0b00 \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @leo-automation Do we need this as well? |
||
| && if ! .ci/pytorch/build.sh; then \ | ||
| echo "PyTorch build failed. Re-running likely failing HIP test targets with serial verbose Ninja output."; \ | ||
| if [ -d build ]; then \ | ||
| ninja -C build -t clean hip_half_test hip_distributions_test || true; \ | ||
| ninja -C build -j1 -v hip_half_test || true; \ | ||
| ninja -C build -j1 -v hip_distributions_test || true; \ | ||
| else \ | ||
| echo "Expected build directory 'build' was not found after failure."; \ | ||
| fi; \ | ||
| exit 1; \ | ||
| fi \ | ||
| && rm -rf /tmp/pytorch/.git | ||
| RUN git clone https://github.com/pytorch/vision \ | ||
|
leo-automation marked this conversation as resolved.
|
||
| && cd vision \ | ||
| && FORCE_CUDA=1 python setup.py install \ | ||
| && rm -rf /tmp/vision/.git | ||
| RUN git clone https://github.com/pytorch/audio \ | ||
| && cd audio \ | ||
| && python setup.py install \ | ||
| && rm -rf /tmp/audio/.git | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -euxo pipefail | ||
|
|
||
| ARTIFACT_DIR="${ARTIFACT_DIR:-/debug-artifacts}" | ||
| WORKDIR=/tmp/pytorch | ||
| PATCH_SHA=519160d466782f5a62365be051fcb3ef90fa0b00 | ||
| LOG_HELPER="${LOG_HELPER:-/workspace/rocm-nightly-workflow/.github/scripts/run_with_log_heartbeat.sh}" | ||
| PYTORCH_SOURCE_SHA="${PYTORCH_SOURCE_SHA:-8a6524408a49ab2293f694b43131d0fc17e45a32}" | ||
| TARGET_NINJA="${TARGET_NINJA:-auto}" | ||
|
|
||
| detect_failed_target() { | ||
| local log_file=$1 | ||
| local failed_line | ||
| local target | ||
| local -a outputs | ||
|
|
||
| failed_line=$(grep -E '^FAILED: ' "$log_file" | tail -n 1 || true) | ||
| if [[ -z "$failed_line" ]]; then | ||
| return 1 | ||
| fi | ||
|
|
||
| failed_line=${failed_line#FAILED: } | ||
| read -r -a outputs <<< "$failed_line" | ||
| if [[ ${#outputs[@]} -eq 0 ]]; then | ||
| return 1 | ||
| fi | ||
|
|
||
| for target in "${outputs[@]}"; do | ||
| if [[ $target == "$WORKDIR/build/"* ]]; then | ||
| printf '%s\n' "${target#"$WORKDIR/build/"}" | ||
| return 0 | ||
| fi | ||
| if [[ $target != /* ]]; then | ||
| printf '%s\n' "$target" | ||
| return 0 | ||
| fi | ||
| done | ||
|
|
||
| printf '%s\n' "${outputs[0]}" | ||
| } | ||
|
|
||
| mkdir -p "$ARTIFACT_DIR" | ||
| if ! touch "$ARTIFACT_DIR/.write-test" 2>/dev/null; then | ||
| echo "Artifact directory '$ARTIFACT_DIR' is not writable by uid $(id -u)." >&2 | ||
| exit 1 | ||
| fi | ||
| rm -f "$ARTIFACT_DIR/.write-test" | ||
| rm -rf "$WORKDIR" | ||
|
|
||
| git clone https://github.com/pytorch/pytorch --recursive "$WORKDIR" | ||
| cd "$WORKDIR" | ||
| git checkout "$PYTORCH_SOURCE_SHA" | ||
| git submodule sync --recursive | ||
| git submodule update --init --recursive | ||
|
|
||
| pip install -r requirements.txt | ||
| git config --local user.name "AMD AMD" | ||
| git config --local user.email "amd@amd.com" | ||
| git remote add rocm https://github.com/ROCm/pytorch.git | ||
| git fetch rocm | ||
| git cherry-pick "$PATCH_SHA" | ||
|
|
||
| if bash "$LOG_HELPER" "$ARTIFACT_DIR/build.log" -- .ci/pytorch/build.sh; then | ||
| if [[ -f build/.ninja_log ]]; then | ||
| cp build/.ninja_log "$ARTIFACT_DIR"/ | ||
| fi | ||
| exit 0 | ||
| fi | ||
|
|
||
| if [[ -f build/.ninja_log ]]; then | ||
| cp build/.ninja_log "$ARTIFACT_DIR"/ | ||
| fi | ||
|
|
||
| if [[ ! -d build ]]; then | ||
| echo "Expected build directory 'build' was not found after the failed build." | tee -a "$ARTIFACT_DIR/build.log" | ||
| exit 1 | ||
| fi | ||
|
|
||
| rerun_target=$TARGET_NINJA | ||
| if [[ $rerun_target == auto ]]; then | ||
| rerun_target=$(detect_failed_target "$ARTIFACT_DIR/build.log" || true) | ||
| fi | ||
|
|
||
| if [[ -z "$rerun_target" ]]; then | ||
| echo "Unable to determine the failed Ninja target from build.log. Set TARGET_NINJA to override auto detection." | tee -a "$ARTIFACT_DIR/build.log" | ||
| exit 1 | ||
| fi | ||
|
|
||
| target_log_name="${rerun_target//[^A-Za-z0-9_.-]/_}.log" | ||
|
|
||
| # Capture the real error context from the original build.log. The main build | ||
| # runs with high parallelism, so the `FAILED:` line is typically buried before | ||
| # hundreds of lines of unrelated warnings from siblings that were compiling | ||
| # concurrently. Dump the window around it so the error is actually visible. | ||
| { | ||
| echo "=== Error context around FAILED: line in build.log ===" | ||
| awk ' | ||
| { buf[NR]=$0 } | ||
| /^FAILED: / && !printing { | ||
| start = NR-80; if (start<1) start=1 | ||
| for (i=start; i<NR; i++) if (i in buf) print buf[i] | ||
| printing=1; lines=0 | ||
| } | ||
| printing { print; lines++; if (lines>=120) exit } | ||
| ' "$ARTIFACT_DIR/build.log" || true | ||
| echo "=== End error context ===" | ||
| } | tee -a "$ARTIFACT_DIR/build.log" | ||
|
|
||
| echo "PyTorch build failed at source SHA ${PYTORCH_SOURCE_SHA}. Re-running detected target ${rerun_target} with serial verbose Ninja output." | tee -a "$ARTIFACT_DIR/build.log" | ||
|
|
||
| # Do NOT `ninja -t clean <target>` here: that is transitive and wipes every | ||
| # dependency of the target (often ~all of libtorch), forcing a multi-hour | ||
| # cold rebuild at -j1. The failing target's output does not exist because | ||
| # the build failed, so ninja will naturally re-run only the failing command. | ||
|
|
||
| # The .ci build epilogue stops sccache; restart it so the rerun can still | ||
| # hit whatever objects were cached during the main build. | ||
| if command -v sccache >/dev/null 2>&1; then | ||
| sccache --start-server || true | ||
| fi | ||
|
|
||
| if ! bash "$LOG_HELPER" "$ARTIFACT_DIR/$target_log_name" -- \ | ||
| ninja -C build -j1 -v "$rerun_target"; then | ||
| { | ||
| echo "Focused rerun of ${rerun_target} failed. Last 200 lines from ${target_log_name}:" | ||
| tail -n 200 "$ARTIFACT_DIR/$target_log_name" || true | ||
| } | tee -a "$ARTIFACT_DIR/build.log" | ||
| fi | ||
|
|
||
| exit 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| usage() { | ||
| echo "Usage: $0 LOG_FILE -- COMMAND [ARGS...]" >&2 | ||
| exit 2 | ||
| } | ||
|
|
||
| if [[ $# -lt 3 ]]; then | ||
| usage | ||
| fi | ||
|
|
||
| log_file=$1 | ||
| shift | ||
|
|
||
| if [[ $1 != "--" ]]; then | ||
| usage | ||
| fi | ||
| shift | ||
|
|
||
| heartbeat_seconds="${HEARTBEAT_SECONDS:-300}" | ||
| tail_lines="${TAIL_LINES:-200}" | ||
| check_interval=5 | ||
|
|
||
| mkdir -p "$(dirname "$log_file")" | ||
| : >"$log_file" | ||
|
|
||
| "$@" >"$log_file" 2>&1 & | ||
| cmd_pid=$! | ||
|
|
||
| cleanup() { | ||
| if kill -0 "$cmd_pid" 2>/dev/null; then | ||
| kill "$cmd_pid" 2>/dev/null || true | ||
| wait "$cmd_pid" 2>/dev/null || true | ||
| fi | ||
| } | ||
| trap cleanup EXIT | ||
|
|
||
| command_str=$(printf '%q ' "$@") | ||
| command_str=${command_str% } | ||
|
|
||
| next_heartbeat=0 | ||
| while kill -0 "$cmd_pid" 2>/dev/null; do | ||
| now=$(date +%s) | ||
| if (( now >= next_heartbeat )); then | ||
| echo "[$(date -u +%FT%TZ)] Command still running: ${command_str}" | ||
| echo "[$(date -u +%FT%TZ)] Log file: ${log_file} ($(du -h "$log_file" | cut -f1))" | ||
| next_heartbeat=$((now + heartbeat_seconds)) | ||
| fi | ||
| sleep "$check_interval" | ||
| done | ||
|
|
||
| if wait "$cmd_pid"; then | ||
| status=0 | ||
| else | ||
| status=$? | ||
| fi | ||
|
|
||
| trap - EXIT | ||
|
|
||
| if [[ $status -eq 0 ]]; then | ||
| echo "Command completed successfully. Full log saved to ${log_file}" | ||
| exit 0 | ||
| fi | ||
|
|
||
| echo "Command failed with exit code ${status}. Last ${tail_lines} lines from ${log_file}:" | ||
| tail -n "$tail_lines" "$log_file" || true | ||
| exit "$status" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.