[amd] fix out of resources issue related to turning on async_copy#808
Conversation
njriasan
left a comment
There was a problem hiding this comment.
One minor question regarding if we should check the actual Triton knob, but otherwise this looks good to me.
|
Can you help check the CI failure? It seems addmm, gemm, and swiglu are still failing. |
The problem in these three tests seem not related to out of shared memory, but it is related to async copy. They are the same problem and a little weird. Every time I re-compile triton, these errors can be reproduced, and I can see gpucore.**** dumps. Seems like it is a memory access violation problem. On the other hand, If I do NOT set async_copy, it can pass. Once it is passed, when I reran with async_copy enabled, it can still pass until I recompile triton. Still looking into the issue. will update you later. |
Here is the error message The GPU memory access violation happens after the kernel execution. seems like related to torch compile. Do you have any idea what it is wrong here? |
|
For the gemm_test, here is the output error: We can see the kernel name is |
|
Hi @scxiao, I believe |
some update: |
|
Thanks for looking into it, @scxiao ! cc @karthickai @njriasan , Triton AMD async copy update does not play well with torchinductor, should we consider adding |
We are working on that now and almost done for that. The problem is in the llvm backend side and the fixed has been landed. BTW I would expect adding |
|
Hi @xuzhao9, with this PR: triton-lang/triton#9431 merged, all failed tests (memory access violation) will pass. Let us wait this one landed, then we can resume this PR. Thanks |
|
FYI the PR has landed |
5ade1e0 to
fe0401a
Compare
|
Nvm, just realized that we need to update the testing Docker image. I am working on that. |
|
Hi @xuzhao9, is the CI build uses the latest upstream triton? If not, how can we set to that? Thanks |
|
@scxiao it updates upstream triton every week, but we can manually kick off run to update that. |
|
The Docker has updated and this PR has fixed all test failures. Thanks @scxiao a lot for your contribution! |
Turning on async copy will need more lds usage, so some configurations may cause lds out of resource. Specifically, lds usage increases from (num_stages - 1) * usage_per_iter to num_stages * usage_per_iter. This PR fixes that by decreasing the num_stages by 1 so avoid the out of resources.