|
| 1 | +# Integrating Standard Parallelism Libraries with diffeq |
| 2 | + |
| 3 | +This document shows how to use standard parallelism libraries with the diffeq library, addressing the feedback to avoid custom parallel classes and use proven standard libraries instead. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Instead of creating custom parallel classes, we recommend using established standard libraries: |
| 8 | + |
| 9 | +- **std::execution** - C++17/20 execution policies |
| 10 | +- **OpenMP** - Cross-platform shared memory multiprocessing |
| 11 | +- **Intel TBB** - Threading Building Blocks for advanced parallel algorithms |
| 12 | +- **NVIDIA Thrust** - GPU acceleration without writing CUDA kernels |
| 13 | + |
| 14 | +## Quick Examples |
| 15 | + |
| 16 | +### 1. Basic Parallel Integration with std::execution |
| 17 | + |
| 18 | +```cpp |
| 19 | +#include <examples/standard_parallelism.hpp> |
| 20 | +#include <execution> |
| 21 | +#include <algorithm> |
| 22 | + |
| 23 | +// Simple harmonic oscillator |
| 24 | +struct HarmonicOscillator { |
| 25 | + void operator()(const std::vector<double>& y, std::vector<double>& dydt, double t) const { |
| 26 | + dydt[0] = y[1]; // dx/dt = v |
| 27 | + dydt[1] = -y[0]; // dv/dt = -x (ω=1) |
| 28 | + } |
| 29 | +}; |
| 30 | + |
| 31 | +// Multiple initial conditions in parallel |
| 32 | +std::vector<std::vector<double>> initial_conditions(1000); |
| 33 | +// ... fill with different initial conditions ... |
| 34 | + |
| 35 | +HarmonicOscillator system; |
| 36 | +diffeq::examples::StandardParallelODE<std::vector<double>, double>::integrate_multiple_conditions( |
| 37 | + system, initial_conditions, 0.01, 100 |
| 38 | +); |
| 39 | +``` |
| 40 | + |
| 41 | +### 2. Parameter Sweep (Beyond Initial Conditions) |
| 42 | + |
| 43 | +```cpp |
| 44 | +// Vary system parameters in parallel |
| 45 | +std::vector<double> frequencies = {0.5, 1.0, 1.5, 2.0, 2.5}; |
| 46 | +std::vector<std::vector<double>> results; |
| 47 | + |
| 48 | +diffeq::examples::StandardParallelODE<std::vector<double>, double>::parameter_sweep( |
| 49 | + [](const std::vector<double>& y, std::vector<double>& dydt, double t, double omega) { |
| 50 | + dydt[0] = y[1]; |
| 51 | + dydt[1] = -omega*omega*y[0]; // Parameterized frequency |
| 52 | + }, |
| 53 | + {1.0, 0.0}, frequencies, results, 0.01, 100 |
| 54 | +); |
| 55 | +``` |
| 56 | +
|
| 57 | +### 3. OpenMP for CPU Parallelism |
| 58 | +
|
| 59 | +```cpp |
| 60 | +#include <omp.h> |
| 61 | +
|
| 62 | +std::vector<std::vector<double>> states(1000); |
| 63 | +// ... initialize states ... |
| 64 | +
|
| 65 | +#pragma omp parallel for |
| 66 | +for (size_t i = 0; i < states.size(); ++i) { |
| 67 | + auto integrator = diffeq::integrators::ode::RK4Integrator<std::vector<double>, double>(system); |
| 68 | + for (int step = 0; step < 100; ++step) { |
| 69 | + integrator.step(states[i], 0.01); |
| 70 | + } |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +### 4. GPU Acceleration with Thrust (NO Custom Kernels!) |
| 75 | + |
| 76 | +```cpp |
| 77 | +#include <thrust/for_each.h> |
| 78 | +#include <thrust/execution_policy.h> |
| 79 | +#include <thrust/device_vector.h> |
| 80 | + |
| 81 | +// Copy to GPU |
| 82 | +thrust::device_vector<std::vector<double>> gpu_states = host_states; |
| 83 | + |
| 84 | +// GPU parallel execution without writing kernels! |
| 85 | +thrust::for_each(thrust::device, gpu_states.begin(), gpu_states.end(), |
| 86 | + [=] __device__ (std::vector<double>& state) { |
| 87 | + // Your ODE integration code here |
| 88 | + // (integrator needs to be GPU-compatible) |
| 89 | + }); |
| 90 | + |
| 91 | +// Copy back to host |
| 92 | +thrust::copy(gpu_states.begin(), gpu_states.end(), host_states.begin()); |
| 93 | +``` |
| 94 | +
|
| 95 | +### 5. Intel TBB for Advanced Parallelism |
| 96 | +
|
| 97 | +```cpp |
| 98 | +#include <tbb/parallel_for_each.h> |
| 99 | +
|
| 100 | +tbb::parallel_for_each(states.begin(), states.end(), |
| 101 | + [&](std::vector<double>& state) { |
| 102 | + auto integrator = diffeq::integrators::ode::RK4Integrator<std::vector<double>, double>(system); |
| 103 | + for (int step = 0; step < 100; ++step) { |
| 104 | + integrator.step(state, 0.01); |
| 105 | + } |
| 106 | + }); |
| 107 | +``` |
| 108 | + |
| 109 | +## Benefits of This Approach |
| 110 | + |
| 111 | +### ✅ Advantages |
| 112 | +- **Proven Libraries**: Use optimized, well-tested standard libraries |
| 113 | +- **No Learning Curve**: No custom classes to learn |
| 114 | +- **Flexibility**: Vary parameters, integrators, callbacks, devices |
| 115 | +- **Hardware Specific**: Choose the right tool for each use case |
| 116 | +- **GPU Support**: Thrust provides GPU acceleration without writing kernels |
| 117 | +- **Standard Detection**: Use standard library functions for hardware detection |
| 118 | + |
| 119 | +### ❌ What We Avoid |
| 120 | +- Custom "Facade" classes |
| 121 | +- Reinventing parallel algorithms |
| 122 | +- Custom hardware detection code |
| 123 | +- Restricting flexibility to only initial conditions |
| 124 | + |
| 125 | +## Choosing the Right Library |
| 126 | + |
| 127 | +| Use Case | Recommended Library | Why | |
| 128 | +|----------|-------------------|-----| |
| 129 | +| Simple parallel loops | `std::execution::par` | Built into C++17, no dependencies | |
| 130 | +| CPU-intensive computation | OpenMP | Mature, cross-platform, great CPU scaling | |
| 131 | +| Complex task dependencies | Intel TBB | Advanced algorithms, work-stealing | |
| 132 | +| GPU acceleration | NVIDIA Thrust | GPU without custom kernels | |
| 133 | +| Mixed workloads | Combination | Use the right tool for each part | |
| 134 | + |
| 135 | +## Real-World Examples |
| 136 | + |
| 137 | +### Monte Carlo Simulations |
| 138 | +```cpp |
| 139 | +// Parameter sweep with different random seeds |
| 140 | +std::vector<int> seeds(1000); |
| 141 | +std::iota(seeds.begin(), seeds.end(), 1); |
| 142 | + |
| 143 | +std::for_each(std::execution::par, seeds.begin(), seeds.end(), |
| 144 | + [&](int seed) { |
| 145 | + std::mt19937 rng(seed); |
| 146 | + // Run simulation with this random number generator |
| 147 | + // Each thread gets its own RNG state |
| 148 | + }); |
| 149 | +``` |
| 150 | +
|
| 151 | +### Robotics Control Systems |
| 152 | +```cpp |
| 153 | +// Real-time control with different controller parameters |
| 154 | +#pragma omp parallel for schedule(static) |
| 155 | +for (size_t i = 0; i < control_parameters.size(); ++i) { |
| 156 | + auto controller = create_controller(control_parameters[i]); |
| 157 | + auto integrator = diffeq::integrators::ode::RK4Integrator<State, double>(controller); |
| 158 | + |
| 159 | + // Simulate control system |
| 160 | + for (int step = 0; step < simulation_steps; ++step) { |
| 161 | + integrator.step(robot_state[i], dt); |
| 162 | + } |
| 163 | +} |
| 164 | +``` |
| 165 | + |
| 166 | +### Multi-Physics Simulations |
| 167 | +```cpp |
| 168 | +// Different physics models running simultaneously |
| 169 | +#pragma omp parallel sections |
| 170 | +{ |
| 171 | + #pragma omp section |
| 172 | + { |
| 173 | + // Fluid dynamics |
| 174 | + integrate_fluid_system(); |
| 175 | + } |
| 176 | + |
| 177 | + #pragma omp section |
| 178 | + { |
| 179 | + // Structural mechanics |
| 180 | + integrate_structural_system(); |
| 181 | + } |
| 182 | + |
| 183 | + #pragma omp section |
| 184 | + { |
| 185 | + // Heat transfer |
| 186 | + integrate_thermal_system(); |
| 187 | + } |
| 188 | +} |
| 189 | +``` |
| 190 | + |
| 191 | +## Hardware Detection (Standard Way) |
| 192 | + |
| 193 | +Instead of custom hardware detection, use standard library functions: |
| 194 | + |
| 195 | +```cpp |
| 196 | +// OpenMP thread count |
| 197 | +int num_threads = omp_get_max_threads(); |
| 198 | + |
| 199 | +// CUDA device count |
| 200 | +int device_count = 0; |
| 201 | +cudaGetDeviceCount(&device_count); |
| 202 | +bool gpu_available = (device_count > 0); |
| 203 | + |
| 204 | +// TBB automatic initialization |
| 205 | +tbb::task_scheduler_init init; // Uses all available cores |
| 206 | +``` |
| 207 | +
|
| 208 | +## Integration with Existing diffeq Code |
| 209 | +
|
| 210 | +The beauty of this approach is that **your existing diffeq code doesn't change**: |
| 211 | +
|
| 212 | +```cpp |
| 213 | +// This works exactly as before |
| 214 | +auto integrator = diffeq::integrators::ode::RK4Integrator<std::vector<double>, double>(system); |
| 215 | +integrator.step(state, dt); |
| 216 | +
|
| 217 | +// Just wrap it in standard parallel constructs when you need parallelism |
| 218 | +std::for_each(std::execution::par, states.begin(), states.end(), |
| 219 | + [&](auto& state) { |
| 220 | + auto integrator = diffeq::integrators::ode::RK4Integrator<std::vector<double>, double>(system); |
| 221 | + integrator.step(state, dt); |
| 222 | + }); |
| 223 | +``` |
| 224 | + |
| 225 | +## Building and Dependencies |
| 226 | + |
| 227 | +### CMake Integration |
| 228 | +```cmake |
| 229 | +# For std::execution |
| 230 | +target_compile_features(your_target PRIVATE cxx_std_17) |
| 231 | +
|
| 232 | +# For OpenMP |
| 233 | +find_package(OpenMP REQUIRED) |
| 234 | +target_link_libraries(your_target OpenMP::OpenMP_CXX) |
| 235 | +
|
| 236 | +# For Intel TBB |
| 237 | +find_package(TBB REQUIRED) |
| 238 | +target_link_libraries(your_target TBB::tbb) |
| 239 | +
|
| 240 | +# For NVIDIA Thrust (comes with CUDA) |
| 241 | +find_package(CUDA REQUIRED) |
| 242 | +target_link_libraries(your_target ${CUDA_LIBRARIES}) |
| 243 | +``` |
| 244 | + |
| 245 | +This approach gives you maximum flexibility while leveraging proven, optimized libraries instead of reinventing the wheel. |
0 commit comments