Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gpu builds #340

Merged
merged 7 commits into from
Feb 1, 2025
Merged

Fix gpu builds #340

merged 7 commits into from
Feb 1, 2025

Conversation

h-vetinari
Copy link
Member

Fall-out from #318, especially the torchinductor tests.

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Jan 30, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13080209035. Examine the logs at this URL for more detail.

@h-vetinari
Copy link
Member Author

Fall-out from #318, especially the torchinductor tests.

By adding these tests, the pytorch test suite run on windows went from

= 7486 passed, 1432 skipped, 43 deselected, 31 xfailed, 75940 warnings in 1067.19s (0:17:47) =

to

= 8122 passed, 1494 skipped, 45 deselected, 31 xfailed, 75969 warnings in 3534.84s (0:58:54) =

I don't have timings for linux yet, but assume it will also increase (though perhaps less so). In any case, I think I'll constrain these torchinductor tests to run for only one python version.

@hmaarrfk
Copy link
Contributor

woudl it help if i built the linux-gpu stuff?

@hmaarrfk
Copy link
Contributor

I guess are you ready for me to start some jobs locally?

@h-vetinari
Copy link
Member Author

woudl it help if i built the linux-gpu stuff?

Thanks for the kind offer! The server has been working well lately but right now the resources are occupied by conda-forge/libmagma-feedstock#24. The torchinductor tests have really been a PITA to fix, but I feel we're finally close now.

linux-aarch64 + CUDA is passing already, and so is windows (aside from the fact that it ran into a timeout, which I've addressed with the last commits though). So the only thing I need to see here still is whether the full test suite can pass on linux-64 + CUDA, because though I tried to look for any needles in the haystack, it's possible that there are still some singleton test failures that need to be skipped.

The whole situation is complicated by the fact that we need some fixes for the CMake metadata as well to make it useable at all (i.e. compiling against CUDA-enabled libtorch on a CPU agent - which is the situation we're in for all the dependent packages compiling against pytorch). I have a hack for that +/- ready in #339, but I want to separate these concerns, because it's always possible that I still overlooked something.

In my ideal scenario, the CI for this PR and #339 could run (CUDA-only; no aarch) essentially at full speed, if I had the entire server available (hence my request to cancel the builds in conda-forge/libmagma-feedstock#24 for now). Then I could decide at the end of my day (~14h) which one to merge based on whether the CMake stuff is passing or not.

So depends on how much capacity you have. You could try a build for one of the CUDA variants in #339 and tell me if it passes. If so, we could use the artefact from that for publication right away.

@h-vetinari
Copy link
Member Author

Update: now running at full steam, so this should be good for now, and we should have at least one passing PR in ~12h that we can merge. If you do have spare cycles for building things locally, I think the much bigger impact would be if we could unblock tensorflow... 😅

@h-vetinari
Copy link
Member Author

So finally, the MKL+CUDA build passed. There's something strange with the tests though - the run for py311 collects a whole bunch more tests and takes longer than elsewhere (even than py312, which is the only version where we run the inductor tests).

TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 13171 passed, 2586 skipped, 91 xfailed, 143216 warnings in 2916.32s (0:48:36) =
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py311_h3846359_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 458.74s (0:07:38) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py39_hdffab68_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestCustomOp and test_data_dependent_compile) or (TestCustomOp and test_functionalize_error) or (TestCustomOpAPI and test_compile) or (TestCustomOpAPI and test_fake) or test_compile_int4_mm or test_compile_int8_mm or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7532 passed, 1375 skipped, 31 xfailed, 75718 warnings in 455.21s (0:07:35) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py313_h33c0e77_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
== 7552 passed, 1375 skipped, 31 xfailed, 75701 warnings in 459.08s (0:07:39) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py310_hca309f4_311.conda
TEST START: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py test/inductor/test_torchinductor.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
= 8196 passed, 1429 skipped, 31 xfailed, 76339 warnings in 2177.80s (0:36:17) ==
TEST END: /home/conda/feedstock_root/build_artifacts/linux-64/pytorch-2.5.1-cuda126_mkl_py312_hdbe889e_311.conda

The set of modules and skips is exactly the same as on 3.9 or 3.10, so I don't know what would explain this difference in test collection. Perhaps there are some tests upstream that are only run for 3.11? 🤔

@h-vetinari
Copy link
Member Author

OK, three extra test failure for the generic blas run (not even from the torchinductor module, so no idea why this triggers), but I'm skipping them now and merging this. The CMake stuff needed at least another round, and I want to get the first round of fixes out (esp. a working win+CUDA build with the right bin/lib split). The rest will hopefully soon after in #339.

@h-vetinari h-vetinari marked this pull request as ready for review February 1, 2025 10:56
h-vetinari added a commit that referenced this pull request Feb 1, 2025
@h-vetinari h-vetinari merged commit d04bba8 into conda-forge:main Feb 1, 2025
25 of 27 checks passed
@h-vetinari h-vetinari deleted the fix_gpu branch February 1, 2025 10:57
@h-vetinari
Copy link
Member Author

Ah, the pleasure of flaky tests. After passing here, the MKL build failed on main with:

=========================== short test summary info ============================
FAILED [0.2937s] test/test_autograd.py::TestAutogradDeviceTypeCUDA::test_reentrant_parent_error_on_cpu_cuda - AssertionError: "Simulate error" does not match "grad can be implicitly created only for scalar outputs"

To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradDeviceTypeCUDA.test_reentrant_parent_error_on_cpu_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 1 failed, 14514 passed, 2663 skipped, 91 xfailed, 143956 warnings in 4192.22s (1:09:52) =

Also, each test run now collected 13k+ tests and took ~50min. Very weird how that happens.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 2, 2025

Also, each test run now collected 13k+ tests and took ~50min. Very weird how that happens.

Does this mean that it takes 3.9, 3.10, 3.11, 3.12, 3.13 so an extra 250 mins for the tests to run on each platform?

@h-vetinari
Copy link
Member Author

h-vetinari commented Feb 2, 2025

Well, it's supposed to take <10 min per version, and one run (for 3.12 only) which includes torchinductor that's a bit longer. And this is exactly what happened for the openblas build on main for example, but it seems to change randomly. I opened an issue about this: #343

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 2, 2025

ah ok, so total 50 mins / 10 hours of the build. Got it. I was going to ask how much it takes!

Thanks!!!

@h-vetinari
Copy link
Member Author

The normal case should be about 1:10h (4x8min + 40min), but when we hit #343, it can take much longer suddenly (worst case almost 5h).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants