WIP: add custom cuda kernel for GPU accumulation into output image#27
WIP: add custom cuda kernel for GPU accumulation into output image#27McHaillet wants to merge 1 commit intoteamtomo:mainfrom
Conversation
|
Hahah very nice, love to see it and very cool that you got it working! I'm a bit surprised there isn't more of a speedup, I guess the real wins would come from not generating/manipulating the index arrays and doing it all in the kernel 🥲 |
|
Questions I have about adding compiled stuff into teamtomo packages
|
Yea I know... I think its multiple reasons:
|
haha, no clue
Easiest is probably by having JIT compilation with caching on the machine (i believe this already does that). The first run of the function will require some compilation time, but subsequent stuff can just use the cache. |
|
Heh, right - there's some hidden complexity with the JIT too, compilation is only done for a specific shape which leads to fighting the JIT overhead if you use it in code paths with dynamic shapes Not discounting anything just things to keep in mind :-) |
|
Some answers:
I will make some measurements of the pure kernel as well to get a better sense of the performance without boiler plate. |
|
All useful info, thanks! I haven't thought too carefully but my intuition is if we're going to write a kernel it might as well be for the whole backprojection rather than just this step... |

@alisterburt I couldnt help myself, this is my naive implementation of a custom CUDA kernel for value accumulation (replacement for index_put with accumulate=True).
This is a naive implementation that assumes idx_c, idx_z, etc... are all 1-dimensional tensors and not the (b, c, z, y, x). I realised later that that was the case, that will probably make the indexing in the kernels a bit more complicated as you would have to deal with broadcasting.
For now, I am not gonna continue fixating on finishing this but feel free to continue it (or anyone else).
When using it in torch-fourier-slice with this exact code:
I now get these timings, which are a speed-up to what I reported before in teamtomo/torch-fourier-slice#27 :
Projected time on cuda:0: 0.53 sec.
Projected time on cpu: 1.93 sec. (multithreaded CPU)