| layout | default |
|---|---|
| title | Performance |
| parent | Documentation |
| nav_order | 5 |
Optimization strategies and performance tuning for the N-Body Particle Simulation System.
The repository now includes a dedicated nbody_benchmarks executable plus ./scripts/benchmark.sh for non-interactive, headless benchmark runs.
./scripts/build.sh
./scripts/benchmark.sh
./scripts/benchmark.sh serialization.round_trip build/benchmark-results.jsonWith -DNBODY_ENABLE_PROFILING=ON, benchmark output also includes named phase timings such as serialization.save, serialization.load, simulation.update, or force-specific phases when those surfaces are compiled in.
Benchmark runs emit machine-readable JSON with:
- benchmark name
- force method
- particle count
- iteration count
- numeric metrics
- numeric tuning parameters
- optional phase timing samples
| Particles | Target FPS | Algorithm |
|---|---|---|
| 10,000 | 60+ | Direct N² |
| 100,000 | 60+ | Barnes-Hut |
| 1,000,000 | 30+ | Barnes-Hut / Spatial Hash |
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 3080 (10GB) |
| CUDA | 12.2 |
| CPU | AMD Ryzen 9 5900X |
| RAM | 32GB DDR4-3600 |
| Particles | Direct N² | Barnes-Hut (θ=0.5) | Spatial Hash |
|---|---|---|---|
| 1,000 | 60+ | 60+ | 60+ |
| 10,000 | 60+ | 60+ | 60+ |
| 50,000 | ~30 | 60+ | 60+ |
| 100,000 | ~8 | 60+ | 60+ |
| 500,000 | <1 | ~45 | 60+ |
| 1,000,000 | N/A | ~25 | 60+ |
| Particles | Particle Data | Barnes-Hut | Spatial Hash | Total |
|---|---|---|---|---|
| 100K | ~5 MB | ~10 MB | ~2 MB | ~17 MB |
| 1M | ~50 MB | ~100 MB | ~20 MB | ~170 MB |
| 10M | ~500 MB | ~1 GB | ~200 MB | ~1.7 GB |
Recommended block sizes by GPU architecture:
| Architecture | Series | Optimal Block Size |
|---|---|---|
| Ada Lovelace | RTX 40xx | 256 or 512 |
| Ampere | RTX 30xx | 256 |
| Turing | RTX 20xx | 256 |
| Volta | V100 | 256 or 512 |
// Query optimal configuration
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int max_threads = prop.maxThreadsPerBlock;
int warp_size = prop.warpSize;
int optimal_block_size = (max_threads / warp_size / 2) * warp_size; // Usually 256For Direct N² kernel, shared memory reduces global memory traffic by ~50%:
// Shared memory per block
size_t shared_size = block_size * 4 * sizeof(float); // pos_x, pos_y, pos_z, mass
// Check limits
if (shared_size > prop.sharedMemPerBlock) {
// Reduce block size or disable shared memory tiling
}Use CUDA Occupancy API to find optimal configuration:
int min_grid_size, optimal_block_size;
cudaOccupancyMaxPotentialBlockSize(
&min_grid_size, &optimal_block_size,
computeForcesDirectKernel,
0, // dynamic shared memory
0 // block size limit
);Ensure coalesced memory access:
// GOOD: Coalesced access
int i = blockIdx.x * blockDim.x + threadIdx.x;
float x = pos_x[i]; // Thread i accesses address i
// BAD: Strided access (uncoalesced)
float x = pos_x[i * stride]; // Threads access non-consecutive addressesMinimize branch divergence within warps:
// BAD: Warp divergence
if (threadIdx.x % 2 == 0) {
// Path A (16 threads active)
} else {
// Path B (16 threads active)
}
// GOOD: No divergence
float result = (threadIdx.x % 2 == 0) ? value_a : value_b;Use Structure of Arrays (SoA) instead of Array of Structures (AoS):
// BAD: AoS (poor for GPU)
struct Particle { float x, y, z, vx, vy, vz, mass; };
Particle particles[N];
// GOOD: SoA (coalesced access)
struct ParticleData {
float* pos_x; float* pos_y; float* pos_z;
float* vel_x; float* vel_y; float* vel_z;
float* mass;
};Avoid frequent allocations:
class MemoryPool {
public:
void* allocate(size_t size) {
if (pools_.count(size) && !pools_[size].empty()) {
void* ptr = pools_[size].back();
pools_[size].pop_back();
return ptr;
}
void* ptr;
cudaMalloc(&ptr, size);
return ptr;
}
void deallocate(void* ptr, size_t size) {
pools_[size].push_back(ptr);
}
private:
std::map<size_t, std::vector<void*>> pools_;
};CUDA-OpenGL interop eliminates CPU-GPU transfer:
// Traditional (slow)
cudaMemcpy(h_positions, d_positions, size, cudaMemcpyDeviceToHost);
glBufferData(GL_ARRAY_BUFFER, size, h_positions, GL_DYNAMIC_DRAW);
// Zero-copy (fast)
float* d_vbo = interop.mapPositionBuffer();
updatePositionsKernel<<<grid, block>>>(d_particles, d_vbo);
interop.unmapPositionBuffer();
// VBO ready for renderingBalance accuracy vs performance:
// Adaptive theta based on FPS
void adaptTheta(float current_fps, float target_fps) {
if (current_fps < target_fps * 0.9f) {
theta = std::min(theta + 0.05f, 1.0f); // Faster, less accurate
} else if (current_fps > target_fps * 1.1f) {
theta = std::max(theta - 0.02f, 0.3f); // Slower, more accurate
}
}Optimal cell size equals cutoff radius:
float optimal_cell_size = cutoff_radius;
// For non-uniform distributions, slightly larger
float cell_size = cutoff_radius * 1.2f;Estimate maximum stable timestep:
float estimateMaxTimeStep(const ParticleData* d_particles, float softening) {
// Find maximum velocity and acceleration
float v_max = computeMaxVelocity(d_particles);
float a_max = computeMaxAcceleration(d_particles);
// Stability conditions
float dt_pos = softening / v_max; // Position change limit
float dt_vel = sqrt(softening / a_max); // Velocity change limit
// Use safety factor
return std::min(dt_pos, dt_vel) * 0.5f;
}System-level performance analysis:
nsys profile --stats=true -o report ./nbody_sim 100000
nsys-ui report.nsys-rep # Open in GUIKey metrics:
- GPU utilization
- CPU-GPU synchronization overhead
- Kernel execution time
- Memory transfer time
Kernel-level detailed analysis:
ncu --set full -o report.ncu-rep ./nbody_sim 100000
ncu-ui report.ncu-rep # Open in GUIKey metrics:
- Occupancy (%)
- Memory throughput
- Compute throughput
- Instruction mix
- Warp stall reasons
class PerformanceCounter {
public:
void start() {
cudaEventRecord(start_);
}
void stop() {
cudaEventRecord(stop_);
cudaEventSynchronize(stop_);
float ms;
cudaEventElapsedTime(&ms, start_, stop_);
times_.push_back(ms);
}
float getAverageFPS() const {
float avg = std::accumulate(times_.begin(), times_.end(), 0.0f)
/ times_.size();
return 1000.0f / avg;
}
private:
cudaEvent_t start_, stop_;
std::vector<float> times_;
};Symptoms: GPU utilization < 50%
Solutions:
- Increase block size
- Reduce synchronization points
- Use multiple CUDA streams
- Check for CPU bottlenecks
Symptoms: Low compute throughput, high memory throughput
Solutions:
- Use shared memory tiling
- Ensure coalesced access
- Consider algorithm change
- Reduce precision (float16)
Symptoms: Sudden performance drop
Check with: ncu --metrics launch_stats
Solutions:
- Reduce local variables
- Use
__launch_bounds__:__global__ void __launch_bounds__(256, 4) kernel(...) { }
Symptoms: Slower than Direct N² at small N
Solutions:
// Auto-switch based on particle count
if (particle_count < 10000) {
useDirectMethod();
} else {
useBarnesHut();
}Before profiling, ensure:
- Release build (
-O3 -DNDEBUG) - Fast math enabled (
-use_fast_math) - Correct architecture (
-arch=sm_86) - Latest GPU drivers
- No debug output in hot paths
Optimization priority:
- Choose correct algorithm
- Optimize memory access patterns
- Tune thread block size
- Use shared memory
- Reduce synchronization
- Micro-optimizations