A high-performance, multi-physics cosmological engine written in pure C (C17), designed to evolve billions of dark matter particles coupled to a baryonic SPH fluid under a cosmologically expanding spacetime, and to render the resulting cosmic web through physically-based volumetric path integration.
Macrocosm is an engine that simulates the gravitational and hydrodynamic evolution of the large-scale structure of the Universe, from near-homogeneous Gaussian initial conditions at redshift z ≈ 50 down to z = 0, reproducing the filamentary network of dark matter halos, baryonic gas, walls and voids that defines the cosmic web.
The core engineering commitments are:
- Pure C17 — no C++ runtime, no hidden allocations, deterministic control over memory layout and instruction scheduling.
- Extreme numerical rigor — every physical law is integrated with a documented, peer-reviewed numerical scheme; no heuristic shortcuts.
- Colossal scale — target budget of 10⁸–10⁹ particles on a single workstation (64-core CPU + discrete GPU), scaling to clusters via optional MPI.
- Photorealism — the output is not a diagnostic plot. It is a cinematic, physically-grounded image of cosmic structure in formation.
The engine decomposes cleanly into three physical layers (cosmology, gravity,
hydrodynamics), three engineering layers (memory, parallelism, rendering) and
a clean public C API defined under include/macrocosm/.
| Regime | Physical process | Numerical treatment |
|---|---|---|
| Cosmological background | FLRW expansion, a(t) from Friedmann equations |
Tabulated integration + cubic spline |
| Initial conditions | Gaussian random field, CAMB-style P(k), growth D₊(a) |
Zel'dovich approximation on a regular Lagrangian grid |
| Collisionless gravity | Vlasov–Poisson system of the dark matter fluid | TreePM: Barnes–Hut short-range + FFT long-range |
| Baryonic hydrodynamics | Euler equations with adiabatic index γ | Smoothed Particle Hydrodynamics (M₄ cubic-spline kernel) |
| Radiative processes | Optically thin cooling function Λ(T, Z) | Sub-cycled implicit step per particle |
| Halo identification | Non-linear collapse | Friends-of-Friends with linking length b = 0.2 · mean_sep |
| Visualization | Emission / absorption of the combined density–temperature field | Compute-shader ray marching with delta tracking |
- Full general-relativistic gravity (Newtonian limit in comoving coordinates is used).
- Ideal MHD of the intergalactic medium (reserved for a future Macrocosm-MHD branch).
- Neutrino transport and non-cold dark matter species.
- Radiative transfer beyond optically thin cooling.
All equations below are the load-bearing equations of the engine. Every one is implemented verbatim in a module listed in Section 4.
Spatial coordinates are comoving, x = r / a(t). The scale factor
satisfies the Friedmann equation:
H²(a) = H₀² [ Ωₘ a⁻³ + Ω_r a⁻⁴ + Ω_Λ + Ω_k a⁻² ]
with Planck 2018 fiducial parameters (Ωₘ=0.315, Ω_Λ=0.685, h=0.674). A
one-dimensional table of a(t) is precomputed at startup by 4th-order
Runge–Kutta and queried via binary search during evolution.
The collisionless Boltzmann equation for the phase-space distribution f(x,p,t), together with the Poisson equation for the perturbed potential φ:
dp/dt = −m ∇φ , dx/dt = p / (m a²)
∇² φ = 4 π G a² (ρ − ρ̄)
Symplectic integration by kick–drift–kick Leapfrog with a(t)-dependent
drift factors (Quinn et al. 1997).
The gravitational potential is split in Fourier space by a Gaussian filter
of width r_s ≈ 1.25 · Δx_mesh:
φ(k) = φ_long(k) + φ_short(k)
φ_long(k) = φ(k) · exp(−k² r_s²) → solved on the PM grid via FFT
φ_short(r) = Σ_j G m_j / |r − r_j| · erfc(|r − r_j| / 2 r_s) → tree walk
- Long range: real-to-complex FFT (FFTW3) on a 1024³ mesh; CIC mass assignment and force interpolation with the mandatory deconvolution of the CIC window function.
- Short range: Barnes–Hut octree with multipole acceptance criterion
θ ≤ 0.5, truncated beyond ~4.5r_s.
Density, pressure and force for each gas particle i:
ρ_i = Σ_j m_j W(|r_i − r_j|, h_i)
P_i = (γ − 1) ρ_i u_i
dv_i/dt = −Σ_j m_j [ P_i/ρ_i² + P_j/ρ_j² + Π_ij ] ∇_i W_ij
du_i/dt = ½ Σ_j m_j [ P_i/ρ_i² + P_j/ρ_j² + Π_ij ] (v_i − v_j) · ∇_i W_ij
using the M₄ cubic-spline kernel W(r,h) (Monaghan 1992) and the
Monaghan–Gingold artificial viscosity Π_ij with switches α=1, β=2.
Timesteps follow the Courant–Friedrichs–Lewy condition
Δt_i = C_CFL h_i / (c_s,i + |v_i|) with C_CFL = 0.3.
Vector and tensor operations that appear in the PM solver, in SPH gradient
reconstruction and in the rendering matrix chain are all routed through a
single header, include/macrocosm/linalg.h, compiled with explicit SIMD
intrinsics (AVX2 baseline, AVX-512 where available):
- 3×3 symmetric tensors (stress, velocity gradient ∇v, deformation).
- Batched 4-wide / 8-wide dot products, cross products, fused-multiply-add over SoA arrays.
- 1024³ complex FFT wrapped around FFTW3 with pencil decomposition ready for future MPI parallelization.
Numerical references: Golub & Van Loan, Matrix Computations (4th ed.) for conditioning analysis of the Poisson inversion; Trefethen & Bau, Numerical Linear Algebra for the FFT accuracy bounds used in the convergence tests.
macrocosm/
├── include/macrocosm/ Public C API (opaque handles, PODs only)
│ ├── types.h Fixed-width types, SoA particle views, flags
│ ├── linalg.h SIMD vector/tensor primitives
│ ├── cosmology.h a(t), H(a), D+(a), Friedmann table
│ ├── particles.h Allocation, Morton sort, domain queries
│ ├── octree.h Barnes–Hut tree build & multipole walk
│ ├── pm_solver.h FFT-based long-range Poisson solver
│ ├── treepm.h Force combiner, force splitting
│ ├── sph.h Neighbour search, density, pressure, forces
│ ├── integrator.h KDK Leapfrog, adaptive timesteps
│ └── render.h Splatting, ray march, framebuffer output
├── src/
│ ├── core/ cosmology / particles / integrator
│ ├── gravity/ octree / pm_solver / treepm
│ ├── hydro/ sph
│ ├── parallel/ domain decomposition + thread pools
│ └── render/ volumetric pipeline (GPU dispatch + host-side staging)
└── shaders/ GLSL 4.30+ compute + composite vertex/fragment
Every particle array is a Struct-of-Arrays (SoA). No Array-of-Structures is permitted outside of debug I/O paths.
typedef struct {
/* hot, streamed in the integrator inner loop */
double *restrict x, *restrict y, *restrict z; /* comoving position */
double *restrict vx, *restrict vy, *restrict vz; /* canonical momentum / m·a² */
float *restrict ax, *restrict ay, *restrict az; /* acceleration */
/* warm, read by SPH / render */
float *restrict mass, *restrict h, *restrict rho, *restrict u;
/* cold, touched once per step */
uint64_t *restrict morton;
uint32_t *restrict id, *restrict type;
size_t n;
size_t capacity;
} ParticleSoA;Key decisions:
- 64-byte alignment on every
*returned by the allocator so that AVX-512 loads (_mm512_load_pd) are guaranteed aligned — no fallback path. - Morton (Z-order) sorting of particles every N_sort ≈ 32 steps. This preserves spatial locality, pushes octree-neighbor pairs into the same L2 cache line, and turns the tree walk into an almost-linear streaming load.
- Huge pages (
MADV_HUGEPAGE/VirtualAllocwithMEM_LARGE_PAGES) for arrays above 256 MiB to reduce TLB misses. - Arena allocator per module. No malloc in the hot path. All scratch buffers (neighbor lists, FFT workspace, tree nodes) are preallocated at init time from a contiguous arena.
Three-tier parallelism, composable and independent:
| Tier | Technology | Scope |
|---|---|---|
| Instruction-level | AVX2 / AVX-512 intrinsics | Inner loops of force, density, SPH gradients |
| Thread-level | OpenMP 5 parallel for, plus a custom work-stealing pool for the tree walk |
All CPU loops over particles |
| Process-level (optional, Phase 4) | MPI 4 with Peano–Hilbert domain decomposition | Multi-node cluster runs |
The tree walk is the only kernel where the naïve OpenMP loop hits a load imbalance wall (dense halos vs. low-density voids); it uses an explicit work-stealing deque (one per hardware thread), with the queues seeded by a Peano–Hilbert space-filling curve partition of the particle set so that stolen work is cache-adjacent to locally owned work.
All floating-point behavior is deterministic across runs at fixed thread count — no reduction of forces in non-deterministic order. Reductions are performed via a tree of partial sums over fixed-width SIMD lanes.
The visual output is a physically-grounded image of the density and temperature fields. It does not plot particles — particles are only the carriers of the underlying fluid fields.
A GPU compute shader projects every SPH particle into a sparse 3D texture
using its kernel W(r, h) as a footprint. Four channels are accumulated:
gas density ρ, internal energy u (proxy for temperature), mean metallicity
Z (reserved), and dark-matter projected density ρ_DM (one float per voxel,
separate texture).
Splat accumulation uses imageAtomicAdd on 32-bit float representations
(bit-cast), or on a pre-sorted particle list using segment reduction when
strict determinism is required.
A second compute shader ray-marches from the camera through the volume. The transport equation solved along each ray is the classical emission–absorption integral (Kajiya 1986, Production Volume Rendering Wrenninge 2012):
L(ray) = ∫₀^∞ j(s) · exp(−∫₀^s κ(s') ds') ds
Where j(s) is the emission coefficient derived from a physical temperature
mapping of u(s):
- Planck black-body emission at effective temperature
T(u), integrated over the Johnson B, V, R bands and color-matched to sRGB with the CIE 1931 2° observer. - Bremsstrahlung boost in regions with
T > 10⁶ K(cluster cores) reproducing the X-ray appearance of intracluster gas.
κ(s) is dominated by Thomson scattering in hot phases and by dust
extinction (Calzetti law) in cool high-density regions.
A fullscreen quad samples the HDR output, applies ACES tonemapping and a per-pixel bloom kernel for the brightest halo cores, and writes the final sRGB framebuffer. The fullscreen-quad + compute-texture pattern is inherited directly from the companion projects (see Section 8.1).
- Language: C17 (
-std=c17), no GNU extensions in public headers. - Compiler: GCC 13+, Clang 17+, or MSVC 19.37+.
- Build system: CMake ≥ 3.21, single
CMakeLists.txtat the root. - Required dependencies: FFTW3 (double precision), OpenGL 4.5, GLEW, GLFW 3, OpenMP.
- Optional dependencies: MPI (Phase 4), CUDA/HIP (Phase 5 — heterogeneous TreePM short-range kernel).
- Aggressive math flags, enabled by default in Release:
-O3 -march=native -ffast-math -funroll-loops -ftree-vectorize -fopenmp(GCC/Clang) —/O2 /fp:fast /arch:AVX2 /openmp:llvm(MSVC).
The project is developed strictly in phases. No phase is allowed to begin before the previous one passes its verification test.
- Directory tree,
CMakeLists.txt, architecturalREADME.md. - Zero code in hot paths. Placeholder
main.cthat links the toolchain.
Delivered in this phase:
- Public headers under
include/macrocosm/:types.h(fixed-width types, MC_RESTRICT / MC_ALIGN / MC_STATIC_ASSERT, status enum, particle-species enum),mem.h(cross-platform 64-byte aligned allocator),linalg.h(scalarvec3/mat3/sym3primitives plus batched SoA routines),particles.h(SoA container),morton.h(21-bit-per-axis Z-order encoder),radix_sort.h(parallel LSD radix sort). - Implementations under
src/core/:linalg.c(axpy, dot, pairwise tree sum — forward errorO(log₂ n · ε · Σ|xᵢ|)after Higham 2002 Thm 4.1; deterministic at fixed thread count),particles.c(aligned allocator + gather-based permutation over all 16 SoA fields),morton.c(BMI2pdep/pextfast path with scalar fallback per Raman & Wise 2008),radix_sort.c(stable 8-bit-digit LSD sort with per-thread / per-bin exclusive scan per Wassenberg & Sanders 2011). - Verification driver in
src/main.c: allocates 2²⁰ particles, fills positions from a deterministic 64-bit LCG (Knuth TAoCP 3.2.1.3), encodes Morton keys, sorts, applies the permutation to every SoA field, and asserts (a) monotonicity of the sorted keys and (b) bit-exact round-trip ofmortonandidpost-permutation. Exits non-zero on any failure. - Deferred to Phase 6:
huge-page allocation (
MADV_HUGEPAGE/MEM_LARGE_PAGES) and the per-module arena allocator — both are memory-footprint optimizations that only become measurable at the 10⁸-particle production scale and are not required for Phase 1 correctness.
a(t)integrator andD₊(a)table.- Gaussian random field generator with a user-supplied
P(k). - Zel'dovich displacement of a regular grid.
- Verification: the measured
P(k)of the generated particles matches the input spectrum to within 2% in every bin.
- Barnes–Hut octree with multipole expansion up to quadrupole.
- PM solver on 512³ (dev) / 1024³ (production) mesh.
- KDK Leapfrog integrator with cosmological drift factors.
- Verification: Zel'dovich test (linear growth of a plane wave) matches
the analytic
D₊(a)to < 1% at z = 0.
- Neighbour search piggybacking on the tree.
- Density loop, pressure force, artificial viscosity, adiabatic energy.
- Radiative cooling sub-step.
- Verification: Sod shock tube and Evrard collapse test reproduce published reference solutions.
- Splat compute shader and 3D sparse texture.
- Physical ray-march compute shader with black-body + bremsstrahlung.
- ACES tonemap and bloom pass.
- Verification: render a Gaussian blob with known analytic column density and confirm radiometric correctness to < 5%.
- Peano–Hilbert domain decomposition across OpenMP work-stealing queues.
- Optional MPI transport layer.
- Optional CUDA/HIP short-range tree walk.
- Continuous regression on the Santa Barbara Cluster Comparison setup.
Macrocosm builds on two existing sibling projects in this workspace. Their design decisions and, in specific cases, their code are adapted here:
-
../atoms/— Hydrogen Quantum Orbital Visualizer.- Reused concept: inverse-transform sampling via a precomputed CDF
tabulated with
lower_bound, applied inatoms/src/atom_realtime.cpp(sampleR,sampleTheta). In Macrocosm this pattern is redeployed in the initial-conditions generator to sample displacement amplitudes from an arbitraryP(k)without ever storing the full field in memory. - Reused concept: the inferno / fire colormap (
heatmap_fireinatom_realtime.cpp) — adapted to temperature-based emission in the render pipeline as a fallback for the black-body path when physical color mapping is disabled. - Reused concept: the SSBO-backed fullscreen-quad render pattern used
in
atoms/src/atom_raytracer.cpp— mirrored by our compositing stage.
- Reused concept: inverse-transform sampling via a precomputed CDF
tabulated with
-
../black_hole/— Schwarzschild Raytracer.- Reused concept: the GLSL compute-shader + UBO + image-store pipeline
of
black_hole/geodesic.compandblack_hole.cppis the direct template for Macrocosm's render compute shaders. Binding points, dispatch strategy (16×16 local size, ceil-divided global size), and adaptive resolution while camera is moving are all adopted. - Reused concept: the classical 4th-order Runge–Kutta integrator used
to step rays through curved spacetime is structurally identical to the
integrator we use for
a(t)and for the optional volumetric-ray refraction term. - Reused concept: the CPU-side Newtonian N-body loop inside
black_hole.cpp(pairwiseG·m₁·m₂/r²) is our reference naïve baseline against which the TreePM force error is measured in Phase 3 verification.
- Reused concept: the GLSL compute-shader + UBO + image-store pipeline
of
Citations are listed here in the exact form they will appear in code comments at the site where each technique is applied.
- Springel, V. (2005). The cosmological simulation code GADGET-2.
MNRAS 364, 1105. — applied in:
src/gravity/treepm.cfor the short/long-range force split and insrc/core/integrator.cfor the cosmological Leapfrog drift/kick factors. - Barnes, J. & Hut, P. (1986). A hierarchical O(N log N) force-calculation
algorithm. Nature 324, 446. — applied in:
src/gravity/octree.c(tree construction and multipole acceptance criterion). - Hockney, R. W. & Eastwood, J. W. (1988). Computer Simulation Using
Particles. — applied in:
src/gravity/pm_solver.c(CIC mass assignment and CIC deconvolution in Fourier space). - Monaghan, J. J. (1992). Smoothed Particle Hydrodynamics. ARAA 30,
543. — applied in:
src/hydro/sph.c(M₄ kernel, pressure force). - Price, D. J. (2012). Smoothed Particle Hydrodynamics and
Magnetohydrodynamics. J. Comp. Phys. 231, 759. — applied in:
src/hydro/sph.c(artificial viscosity switches and energy equation). - Quinn, T., Katz, N., Stadel, J., Lake, G. (1997). Time stepping N-body
simulations. arXiv:astro-ph/9710043. — applied in:
src/core/integrator.c(cosmological drift factor ∫ dt/a²). - Zel'dovich, Ya. B. (1970). Gravitational instability: An approximate
theory for large wavelength perturbations. A&A 5, 84. — applied
in:
src/core/cosmology.c(initial displacement field). - Sutherland, R. S. & Dopita, M. A. (1993). Cooling functions for low-
density astrophysical plasmas. ApJS 88, 253. — applied in:
src/hydro/sph.csub-step (Λ(T, Z) tabulated cooling). - Planck Collaboration (2018). Cosmological parameters. A&A 641, A6.
— applied in:
src/core/cosmology.c(fiducial Ωₘ, Ω_Λ, h). - Frigo, M. & Johnson, S. G. (2005). The design and implementation of
FFTW3. Proc. IEEE 93, 216. — applied in:
src/gravity/pm_solver.c(FFT backend). - Warren, M. S. & Salmon, J. K. (1993). A parallel hashed oct-tree N-body
algorithm. Proc. Supercomputing '93. — applied in:
src/parallel/domain.c(Peano–Hilbert domain decomposition).
- Golub, G. H. & Van Loan, C. F. Matrix Computations, 4th ed. — conditioning
and numerical stability of the discrete Poisson operator in
src/gravity/pm_solver.c. - Trefethen, L. N. & Bau, D. Numerical Linear Algebra. — error analysis of
the batched SIMD reductions in
src/core/integrator.c. - Drepper, U. (2007). What Every Programmer Should Know About Memory. —
cache blocking, alignment and NUMA strategies in
src/parallel/threads.candinclude/macrocosm/particles.h. - Kajiya, J. T. (1986). The rendering equation. SIGGRAPH '86. — basis of
the emission–absorption integral in
shaders/raymarch.comp. - Wrenninge, M. (2012). Production Volume Rendering. — delta-tracking and
transfer-function design used in
shaders/raymarch.comp.
This document is the architectural foundation of the project. All subsequent implementation is required to be justifiable in terms of a section above. Any deviation must be reflected here in the same commit.