Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346

Copilot · 2026-01-06T13:39:40Z

Add <cooperative_groups/reduce.h> header to both files
Optimize NeuronProcessor::processCell using cg::tiled_partition<MAX_CHANNELS> and cg::reduce for efficient warp-level reduction
Simplify NeuronProcessor to use tile.thread_rank() for column index instead of modulo operation
Optimize CellProcessor::calcFluidForces_reconnectCells_correctOverlap using warp-level reduction (block-level reduction not supported by cg::reduce on thread_block)
Build and verify changes compile correctly
Run tests to verify functionality is preserved

Original prompt

Optimize NeuronProcessor: use Cooperative Groups for flexible thread synchronization

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: chrxh <[email protected]>

chrxh · 2026-01-06T14:00:14Z

@copilot Optimize NeuronProcessor like this:
device inline void NeuronProcessor::processCell(SimulationData& data,
SimulationStatistics& statistics,
Cell* cell)
{
auto block = cg::this_thread_block();
auto tile = cg::tiled_partition<MAX_CHANNELS>(block);

// Use tile-level reduction for sumInput
float myWeight = cell->neuralNetwork->weights[threadIdx.x];
float myInput = myWeight * signal.channels[threadIdx.x % MAX_CHANNELS];

// Efficient warp-level reduction
float sum = cg::reduce(tile, myInput, cg::plus<float>());

if (tile.thread_rank() == 0) {
    sumInput[threadIdx.x / MAX_CHANNELS] = sum + biases[threadIdx.x / MAX_CHANNELS];
}
block.sync();

[...]

Also try to optimize calcFluidForces_reconnectCells_correctOverlap

…tions Co-authored-by: chrxh <[email protected]>

Copilot · 2026-01-06T14:18:19Z

@copilot Optimize NeuronProcessor like this:
device inline void NeuronProcessor::processCell(SimulationData& data,
SimulationStatisti...

Implemented the optimizations in commit a7361d0:

NeuronProcessor::processCell:

Uses cg::tiled_partition<MAX_CHANNELS> (8-thread tiles) to partition the 64-thread block
Each tile handles one row (output channel) of the weight matrix
Replaced atomicAdd_block with cg::reduce(tile, myInput, cg::plus<float>()) for efficient warp-level reduction

CellProcessor::calcFluidForces_reconnectCells_correctOverlap:

Uses cg::tiled_partition<32> for warp-level operations
Per-thread local accumulators replaced all per-cell atomicAdd_block calls
Warp-level reduction via cg::reduce before writing to shared memory
Only warp leaders (1 per 32 threads) write reduced sums, minimizing atomic contention

chrxh · 2026-01-06T16:15:42Z

source/EngineGpuKernels/NeuronProcessor.cuh

-    atomicAdd_block(&sumInput[row], neuronsState->weights[threadIdx.x] * signal.channels[col]);
+    // Each thread computes one weight * input product
+    auto row = block.thread_rank() / MAX_CHANNELS;
+    auto col = block.thread_rank() % MAX_CHANNELS;


@copilot check if this is tile.thread_rank(). Could the calculation of row also be simplified?

Fixed in e02f930. Changed to use tile.thread_rank() directly for the column index instead of block.thread_rank() % MAX_CHANNELS. The row calculation is kept as block.thread_rank() / MAX_CHANNELS since it represents the tile index.

chrxh · 2026-01-06T16:28:31Z

source/EngineGpuKernels/CellProcessor.cuh

 __inline__ __device__ void CellProcessor::calcFluidForces_reconnectCells_correctOverlap(SimulationData& data)
 {
+    auto block = cg::this_thread_block();
+    auto warp = cg::tiled_partition<32>(block);


@copilot choose tile size of thread block size. Thus if there is only 1 tile per block, the calculations below can be simplified. For instance the second accumulation over all tiles (e.g. atomicAdd_block(&F_pressure.x, sumF_pressure_x);) is then not necessary anymore.

Unfortunately, cg::reduce only supports tiled partitions (with compile-time power-of-2 sizes), not the full thread_block directly. Since the fluid kernel's block size is dynamic (scanRectLength * scanRectLength), we cannot use a single tile covering the entire block. The warp-level reduction with atomic accumulation across warps is the cleanest approach available with the cooperative groups API. In e02f930 I cleaned up the code by moving shared memory initialization to the beginning.

Co-authored-by: chrxh <[email protected]>

Initial plan

a362045

Copilot AI assigned Copilot and chrxh Jan 6, 2026

Copilot started work on behalf of chrxh January 6, 2026 13:41 View session

Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups

4d5ce14

Co-authored-by: chrxh <[email protected]>

Copilot AI changed the title ~~[WIP] Optimize NeuronProcessor using Cooperative Groups for synchronization~~ Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups Jan 6, 2026

Copilot AI requested a review from chrxh January 6, 2026 13:54

Copilot finished work on behalf of chrxh January 6, 2026 13:54

Copilot started work on behalf of chrxh January 6, 2026 14:01 View session

Optimize NeuronProcessor and CellProcessor with tile-level warp reduc…

a7361d0

…tions Co-authored-by: chrxh <[email protected]>

Copilot finished work on behalf of chrxh January 6, 2026 14:19

chrxh reviewed Jan 6, 2026

View reviewed changes

Copilot started work on behalf of chrxh January 6, 2026 16:30 View session

Simplify NeuronProcessor using tile.thread_rank() for column index

e02f930

Co-authored-by: chrxh <[email protected]>

Copilot AI requested a review from chrxh January 6, 2026 16:41

Copilot finished work on behalf of chrxh January 6, 2026 16:41

chrxh added 2 commits January 6, 2026 21:33

CreatorWindow creates structure cells

a4c5f20

Renamings

9cb0ca2

chrxh marked this pull request as ready for review January 6, 2026 21:01

chrxh merged commit 794b8aa into develop Jan 6, 2026
1 check passed

chrxh deleted the copilot/optimize-neuron-processor branch January 6, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346

Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346

Uh oh!

Copilot AI commented Jan 6, 2026 •

edited

Loading

Uh oh!

chrxh commented Jan 6, 2026

Uh oh!

Copilot AI commented Jan 6, 2026

Uh oh!

chrxh Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

chrxh Jan 6, 2026

Uh oh!

Copilot AI Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346

Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups #346

Uh oh!

Conversation

Copilot AI commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrxh commented Jan 6, 2026

Uh oh!

Copilot AI commented Jan 6, 2026

Uh oh!

chrxh Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

chrxh Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 6, 2026 •

edited

Loading