-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gpu builds #340
Fix gpu builds #340
Conversation
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipe/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/13080209035. Examine the logs at this URL for more detail. |
By adding these tests, the pytorch test suite run on windows went from
to
I don't have timings for linux yet, but assume it will also increase (though perhaps less so). In any case, I think I'll constrain these torchinductor tests to run for only one python version. |
woudl it help if i built the linux-gpu stuff? |
I guess are you ready for me to start some jobs locally? |
Thanks for the kind offer! The server has been working well lately but right now the resources are occupied by conda-forge/libmagma-feedstock#24. The torchinductor tests have really been a PITA to fix, but I feel we're finally close now.
The whole situation is complicated by the fact that we need some fixes for the CMake metadata as well to make it useable at all (i.e. compiling against CUDA-enabled libtorch on a CPU agent - which is the situation we're in for all the dependent packages compiling against pytorch). I have a hack for that +/- ready in #339, but I want to separate these concerns, because it's always possible that I still overlooked something. In my ideal scenario, the CI for this PR and #339 could run (CUDA-only; no aarch) essentially at full speed, if I had the entire server available (hence my request to cancel the builds in conda-forge/libmagma-feedstock#24 for now). Then I could decide at the end of my day (~14h) which one to merge based on whether the CMake stuff is passing or not. So depends on how much capacity you have. You could try a build for one of the CUDA variants in #339 and tell me if it passes. If so, we could use the artefact from that for publication right away. |
Update: now running at full steam, so this should be good for now, and we should have at least one passing PR in ~12h that we can merge. If you do have spare cycles for building things locally, I think the much bigger impact would be if we could unblock tensorflow... 😅 |
So finally, the MKL+CUDA build passed. There's something strange with the tests though - the run for py311 collects a whole bunch more tests and takes longer than elsewhere (even than py312, which is the only version where we run the inductor tests).
The set of modules and skips is exactly the same as on 3.9 or 3.10, so I don't know what would explain this difference in test collection. Perhaps there are some tests upstream that are only run for 3.11? 🤔 |
OK, three extra test failure for the generic blas run (not even from the torchinductor module, so no idea why this triggers), but I'm skipping them now and merging this. The CMake stuff needed at least another round, and I want to get the first round of fixes out (esp. a working win+CUDA build with the right bin/lib split). The rest will hopefully soon after in #339. |
Ah, the pleasure of flaky tests. After passing here, the MKL build failed on main with:
Also, each test run now collected 13k+ tests and took ~50min. Very weird how that happens. |
Does this mean that it takes 3.9, 3.10, 3.11, 3.12, 3.13 so an extra 250 mins for the tests to run on each platform? |
Well, it's supposed to take <10 min per version, and one run (for 3.12 only) which includes torchinductor that's a bit longer. And this is exactly what happened for the openblas build on main for example, but it seems to change randomly. I opened an issue about this: #343 |
ah ok, so total 50 mins / 10 hours of the build. Got it. I was going to ask how much it takes! Thanks!!! |
The normal case should be about 1:10h (4x8min + 40min), but when we hit #343, it can take much longer suddenly (worst case almost 5h). |
Fall-out from #318, especially the torchinductor tests.