-
Notifications
You must be signed in to change notification settings - Fork 182
Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: chrxh <[email protected]>
|
@copilot Optimize NeuronProcessor like this: [...] Also try to optimize calcFluidForces_reconnectCells_correctOverlap |
…tions Co-authored-by: chrxh <[email protected]>
Implemented the optimizations in commit a7361d0: NeuronProcessor::processCell:
CellProcessor::calcFluidForces_reconnectCells_correctOverlap:
|
| atomicAdd_block(&sumInput[row], neuronsState->weights[threadIdx.x] * signal.channels[col]); | ||
| // Each thread computes one weight * input product | ||
| auto row = block.thread_rank() / MAX_CHANNELS; | ||
| auto col = block.thread_rank() % MAX_CHANNELS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot check if this is tile.thread_rank(). Could the calculation of row also be simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e02f930. Changed to use tile.thread_rank() directly for the column index instead of block.thread_rank() % MAX_CHANNELS. The row calculation is kept as block.thread_rank() / MAX_CHANNELS since it represents the tile index.
| __inline__ __device__ void CellProcessor::calcFluidForces_reconnectCells_correctOverlap(SimulationData& data) | ||
| { | ||
| auto block = cg::this_thread_block(); | ||
| auto warp = cg::tiled_partition<32>(block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot choose tile size of thread block size. Thus if there is only 1 tile per block, the calculations below can be simplified. For instance the second accumulation over all tiles (e.g. atomicAdd_block(&F_pressure.x, sumF_pressure_x);) is then not necessary anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, cg::reduce only supports tiled partitions (with compile-time power-of-2 sizes), not the full thread_block directly. Since the fluid kernel's block size is dynamic (scanRectLength * scanRectLength), we cannot use a single tile covering the entire block. The warp-level reduction with atomic accumulation across warps is the cleanest approach available with the cooperative groups API. In e02f930 I cleaned up the code by moving shared memory initialization to the beginning.
Co-authored-by: chrxh <[email protected]>
<cooperative_groups/reduce.h>header to both filescg::tiled_partition<MAX_CHANNELS>andcg::reducefor efficient warp-level reductiontile.thread_rank()for column index instead of modulo operationOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.