Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/test_pytorch_wheels_full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,11 @@ jobs:
echo "CC=${ROCM_HOME}/lib/llvm/bin/clang" >> "${GITHUB_ENV}"
echo "CXX=${ROCM_HOME}/lib/llvm/bin/clang++" >> "${GITHUB_ENV}"

- name: Install C++ build tools for PyTorch tests
run: |
apt-get update
apt-get install -y --no-install-recommends build-essential
Comment on lines +278 to +281
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow makes an attempt to be portable to Windows, see the if: ${{ runner.os == 'Windows' steps below. These commands that will only ever work on Linux should either be

  1. Moved into a test dockerfile used on Linux so they aren't downloaded from apt every workflow run (canonical's servers have gone offline a few times recently, leading to workflow instability that would be avoided if all deps were in our own dockerfiles)
  2. Gated by a if: ${{ runner.os == 'Linux' }} condition

That being said, we may also want to segment the tests that need build-essential somehow, similar to tests that need rocm development headers/files. Users should get a functional install of pytorch with just pip install torch, and anything that needs to be separately installed at the operating system level should be treated with care. If only 1% of the software/tests need additional packages we don't want to miss that suddenly growing to 50%+ since all of our test workflows preinstall packages.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. my old comment (which i never sent) was "why is this needed?".

definitely needs to be flagged behind linux and should have some explanation why it is needed in the first plac.e


# export_test_times.py must run from the PyTorch repo root with
# PYTHONPATH=. because it imports internal PyTorch modules (e.g.
# tools.stats.*) that expect the repo root on sys.path.
Expand Down
89 changes: 89 additions & 0 deletions external-builds/pytorch/skip_tests/pytorch_2.11.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these test failures/skips unique to torch version 2.11? What about 2.12: https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/skip_tests/pytorch_2.12.py ?

Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,95 @@
"torch": [
"test_cpp_warnings_have_python_context_cuda",
],
"utils": [
# ROCm devel/runtime-dependent UT. Skip in the PyTorch full-suite lane;
# this is expected to run in the separate ROCm devel UT step.
"test_load_standalone",
],
"multiprocessing": [
# ROCm devel/runtime-dependent UTs. Skip in the PyTorch full-suite
# lane; these are expected to run in the separate ROCm devel UT step.
"(test_fs and not test_fs_)",
"test_fs_is_shared",
"test_fs_pool",
"test_fs_preserve_sharing",
"test_fs_sharing",
],
"serialization": [
# TestSerialization - NJT weights_only import check
# TestOldSerialization - CI env assertion
"test_debug_set_in_ci",
],
"modules": [
# TestModuleCUDA - CTCLoss cpu/gpu parity scalar mismatch
"test_cpu_gpu_parity_nn_CTCLoss_cuda_float32",
# TestModuleCUDA - CTCLoss forward scalar mismatch
"test_forward_nn_CTCLoss_cuda_float32",
],
"export": [
# TestExportOnFakeCudaCUDA - subprocess import fails: missing librocm_sysdeps_liblzma.so.5
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rather sounds like a bug that should be fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree librocm_sysdebs_liblzma must be available otherwise it is a bug

"test_fake_export___getitem___cuda_float32",
"test_fake_export_nn_functional_batch_norm_cuda_float32",
"test_fake_export_nn_functional_batch_norm_without_cudnn_cuda_float32",
"test_fake_export_nn_functional_conv2d_cuda_float32",
"test_fake_export_nn_functional_instance_norm_cuda_float32",
"test_fake_export_nn_functional_multi_margin_loss_cuda_float32",
"test_fake_export_nn_functional_scaled_dot_product_attention_cuda_float32",
"test_fake_export_nonzero_cuda_float32",
"test_preserve_original_behavior_cuda",
],
"inductor": [
# inductor/test_aot_inductor_package: AOTI C++ package tests need
# more complete CMake/runtime library-path handling in the wheel CI lane.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment already says this might need proper handling in the wheel, why are tests skipped instead of aiming for a fix?

"test_compile_after_package_multi_arch",
"test_compile_after_package_static",
"test_compile_standalone_cos",
"test_compile_with_exporter",
"test_compile_with_exporter_weights",
#Also failed on https://github.com/ROCm/TheRock/actions/runs/24898379109
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs will expire, so this needs an issue to track.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please add a proper error message here - so we know why it was added to the skip list

"test_flex_attention_logging_cuda",
"test_linalg_eig_stride_consistency_cuda",
"test_linalg_eig_stride_consistency_cuda",
"test_linalg_eig_stride_consistency_dynamic_shapes_cuda",
"test_repeated_calling_cuda",
#new test
"test_run2run_determinism_model_name_DistillGPT2_training_or_inference_inference_precision_amp",
#Passed in https://github.com/ROCm/TheRock/actions/runs/24898379109
"test_return_aux_deprecation_warnings_cuda_float16",
Comment on lines +123 to +124
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is is passing, why is it on the exclude list? As above, logs will expire and this might need a proper issue.

],
"functorch": [
#passed on https://github.com/ROCm/TheRock/actions/runs/24898379109
"test_torch_return_types_returns_cuda",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are those tests flaky or why are the passing in one but are here added to the skip list?

],
"jit_fuser_te": [
#passed on https://github.com/ROCm/TheRock/actions/runs/24898379109
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"test_binary_div_ops",
"test_binary_ops",
"test_binary_tensor_scalar_ops",
"test_ternary_norm_ops",
"test_ternary_ops",
"test_unary_ops",
"test_where_ops",
"test_binary_div_ops",
"test_binary_ops",
"test_binary_tensor_scalar_ops",
"test_ternary_norm_ops",
"test_ternary_ops",
"test_unary_ops",
"test_where_ops",
],
"torch_config_hash_determinism": [
#passed on https://github.com/ROCm/TheRock/actions/runs/24898379109
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"test_inductor_config_hash_portable_deterministic",
"test_inductor_config_hash_portable_without_ignore",
],
"spectral_ops": [
#new failures
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new failures - why? rocm or pytorch? would be good to triage a bit if it is a new failure to know where the regresison comes from

and please again: more error messages as comments here so we can better categorize the test failure

#skipping these tests for a cleaner run
#these tests passed in https://github.com/ROCm/TheRock/actions/runs/25360031025
"test_istft_against_librosa_cuda_float64",
"test_stft_cuda_float64",
],
},
"gfx942": {
"cuda": [
Expand Down
Loading