Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 6, 2026

  • Add <cooperative_groups/reduce.h> header to both files
  • Optimize NeuronProcessor::processCell using cg::tiled_partition<MAX_CHANNELS> and cg::reduce for efficient warp-level reduction
  • Simplify NeuronProcessor to use tile.thread_rank() for column index instead of modulo operation
  • Optimize CellProcessor::calcFluidForces_reconnectCells_correctOverlap using warp-level reduction (block-level reduction not supported by cg::reduce on thread_block)
  • Build and verify changes compile correctly
  • Run tests to verify functionality is preserved
Original prompt

Optimize NeuronProcessor: use Cooperative Groups for flexible thread synchronization


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Optimize NeuronProcessor using Cooperative Groups for synchronization Optimize NeuronProcessor and CellProcessor using CUDA Cooperative Groups Jan 6, 2026
Copilot AI requested a review from chrxh January 6, 2026 13:54
@chrxh
Copy link
Owner

chrxh commented Jan 6, 2026

@copilot Optimize NeuronProcessor like this:
device inline void NeuronProcessor::processCell(SimulationData& data,
SimulationStatistics& statistics,
Cell* cell)
{
auto block = cg::this_thread_block();
auto tile = cg::tiled_partition<MAX_CHANNELS>(block);

// Use tile-level reduction for sumInput
float myWeight = cell->neuralNetwork->weights[threadIdx.x];
float myInput = myWeight * signal.channels[threadIdx.x % MAX_CHANNELS];

// Efficient warp-level reduction
float sum = cg::reduce(tile, myInput, cg::plus<float>());

if (tile.thread_rank() == 0) {
    sumInput[threadIdx.x / MAX_CHANNELS] = sum + biases[threadIdx.x / MAX_CHANNELS];
}
block.sync();

[...]

Also try to optimize calcFluidForces_reconnectCells_correctOverlap

Copy link
Contributor Author

Copilot AI commented Jan 6, 2026

@copilot Optimize NeuronProcessor like this:
device inline void NeuronProcessor::processCell(SimulationData& data,
SimulationStatisti...

Implemented the optimizations in commit a7361d0:

NeuronProcessor::processCell:

  • Uses cg::tiled_partition<MAX_CHANNELS> (8-thread tiles) to partition the 64-thread block
  • Each tile handles one row (output channel) of the weight matrix
  • Replaced atomicAdd_block with cg::reduce(tile, myInput, cg::plus<float>()) for efficient warp-level reduction

CellProcessor::calcFluidForces_reconnectCells_correctOverlap:

  • Uses cg::tiled_partition<32> for warp-level operations
  • Per-thread local accumulators replaced all per-cell atomicAdd_block calls
  • Warp-level reduction via cg::reduce before writing to shared memory
  • Only warp leaders (1 per 32 threads) write reduced sums, minimizing atomic contention

atomicAdd_block(&sumInput[row], neuronsState->weights[threadIdx.x] * signal.channels[col]);
// Each thread computes one weight * input product
auto row = block.thread_rank() / MAX_CHANNELS;
auto col = block.thread_rank() % MAX_CHANNELS;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot check if this is tile.thread_rank(). Could the calculation of row also be simplified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e02f930. Changed to use tile.thread_rank() directly for the column index instead of block.thread_rank() % MAX_CHANNELS. The row calculation is kept as block.thread_rank() / MAX_CHANNELS since it represents the tile index.

__inline__ __device__ void CellProcessor::calcFluidForces_reconnectCells_correctOverlap(SimulationData& data)
{
auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot choose tile size of thread block size. Thus if there is only 1 tile per block, the calculations below can be simplified. For instance the second accumulation over all tiles (e.g. atomicAdd_block(&F_pressure.x, sumF_pressure_x);) is then not necessary anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, cg::reduce only supports tiled partitions (with compile-time power-of-2 sizes), not the full thread_block directly. Since the fluid kernel's block size is dynamic (scanRectLength * scanRectLength), we cannot use a single tile covering the entire block. The warp-level reduction with atomic accumulation across warps is the cleanest approach available with the cooperative groups API. In e02f930 I cleaned up the code by moving shared memory initialization to the beginning.

Copilot AI requested a review from chrxh January 6, 2026 16:41
@chrxh chrxh marked this pull request as ready for review January 6, 2026 21:01
@chrxh chrxh merged commit 794b8aa into develop Jan 6, 2026
1 check passed
@chrxh chrxh deleted the copilot/optimize-neuron-processor branch January 6, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants