Skip to content

Jovinull/macrocosm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Macrocosm — Large-Scale Structure & Cosmic Web Simulator

A high-performance, multi-physics cosmological engine written in pure C (C17), designed to evolve billions of dark matter particles coupled to a baryonic SPH fluid under a cosmologically expanding spacetime, and to render the resulting cosmic web through physically-based volumetric path integration.


1. Project Vision

Macrocosm is an engine that simulates the gravitational and hydrodynamic evolution of the large-scale structure of the Universe, from near-homogeneous Gaussian initial conditions at redshift z ≈ 50 down to z = 0, reproducing the filamentary network of dark matter halos, baryonic gas, walls and voids that defines the cosmic web.

The core engineering commitments are:

  1. Pure C17 — no C++ runtime, no hidden allocations, deterministic control over memory layout and instruction scheduling.
  2. Extreme numerical rigor — every physical law is integrated with a documented, peer-reviewed numerical scheme; no heuristic shortcuts.
  3. Colossal scale — target budget of 10⁸–10⁹ particles on a single workstation (64-core CPU + discrete GPU), scaling to clusters via optional MPI.
  4. Photorealism — the output is not a diagnostic plot. It is a cinematic, physically-grounded image of cosmic structure in formation.

The engine decomposes cleanly into three physical layers (cosmology, gravity, hydrodynamics), three engineering layers (memory, parallelism, rendering) and a clean public C API defined under include/macrocosm/.


2. Scientific Scope

2.1 Phenomena Modeled

Regime Physical process Numerical treatment
Cosmological background FLRW expansion, a(t) from Friedmann equations Tabulated integration + cubic spline
Initial conditions Gaussian random field, CAMB-style P(k), growth D₊(a) Zel'dovich approximation on a regular Lagrangian grid
Collisionless gravity Vlasov–Poisson system of the dark matter fluid TreePM: Barnes–Hut short-range + FFT long-range
Baryonic hydrodynamics Euler equations with adiabatic index γ Smoothed Particle Hydrodynamics (M₄ cubic-spline kernel)
Radiative processes Optically thin cooling function Λ(T, Z) Sub-cycled implicit step per particle
Halo identification Non-linear collapse Friends-of-Friends with linking length b = 0.2 · mean_sep
Visualization Emission / absorption of the combined density–temperature field Compute-shader ray marching with delta tracking

2.2 Out of Scope (Phase 1)

  • Full general-relativistic gravity (Newtonian limit in comoving coordinates is used).
  • Ideal MHD of the intergalactic medium (reserved for a future Macrocosm-MHD branch).
  • Neutrino transport and non-cold dark matter species.
  • Radiative transfer beyond optically thin cooling.

3. Mathematical Foundation

All equations below are the load-bearing equations of the engine. Every one is implemented verbatim in a module listed in Section 4.

3.1 Cosmological Background

Spatial coordinates are comoving, x = r / a(t). The scale factor satisfies the Friedmann equation:

H²(a) = H₀² [ Ωₘ a⁻³ + Ω_r a⁻⁴ + Ω_Λ + Ω_k a⁻² ]

with Planck 2018 fiducial parameters (Ωₘ=0.315, Ω_Λ=0.685, h=0.674). A one-dimensional table of a(t) is precomputed at startup by 4th-order Runge–Kutta and queried via binary search during evolution.

3.2 Gravity — Vlasov–Poisson in Comoving Coordinates

The collisionless Boltzmann equation for the phase-space distribution f(x,p,t), together with the Poisson equation for the perturbed potential φ:

dp/dt = −m ∇φ ,       dx/dt = p / (m a²)
∇² φ  = 4 π G a² (ρ − ρ̄)

Symplectic integration by kick–drift–kick Leapfrog with a(t)-dependent drift factors (Quinn et al. 1997).

3.3 TreePM Force Decomposition

The gravitational potential is split in Fourier space by a Gaussian filter of width r_s ≈ 1.25 · Δx_mesh:

φ(k) = φ_long(k) + φ_short(k)
φ_long(k)  = φ(k) · exp(−k² r_s²)     → solved on the PM grid via FFT
φ_short(r) = Σ_j G m_j / |r − r_j| · erfc(|r − r_j| / 2 r_s)   → tree walk
  • Long range: real-to-complex FFT (FFTW3) on a 1024³ mesh; CIC mass assignment and force interpolation with the mandatory deconvolution of the CIC window function.
  • Short range: Barnes–Hut octree with multipole acceptance criterion θ ≤ 0.5, truncated beyond ~4.5 r_s.

3.4 Hydrodynamics — SPH

Density, pressure and force for each gas particle i:

ρ_i = Σ_j m_j W(|r_i − r_j|, h_i)
P_i = (γ − 1) ρ_i u_i
dv_i/dt = −Σ_j m_j [ P_i/ρ_i² + P_j/ρ_j² + Π_ij ] ∇_i W_ij
du_i/dt = ½ Σ_j m_j [ P_i/ρ_i² + P_j/ρ_j² + Π_ij ] (v_i − v_j) · ∇_i W_ij

using the M₄ cubic-spline kernel W(r,h) (Monaghan 1992) and the Monaghan–Gingold artificial viscosity Π_ij with switches α=1, β=2. Timesteps follow the Courant–Friedrichs–Lewy condition Δt_i = C_CFL h_i / (c_s,i + |v_i|) with C_CFL = 0.3.

3.5 Linear Algebra Layer

Vector and tensor operations that appear in the PM solver, in SPH gradient reconstruction and in the rendering matrix chain are all routed through a single header, include/macrocosm/linalg.h, compiled with explicit SIMD intrinsics (AVX2 baseline, AVX-512 where available):

  • 3×3 symmetric tensors (stress, velocity gradient ∇v, deformation).
  • Batched 4-wide / 8-wide dot products, cross products, fused-multiply-add over SoA arrays.
  • 1024³ complex FFT wrapped around FFTW3 with pencil decomposition ready for future MPI parallelization.

Numerical references: Golub & Van Loan, Matrix Computations (4th ed.) for conditioning analysis of the Poisson inversion; Trefethen & Bau, Numerical Linear Algebra for the FFT accuracy bounds used in the convergence tests.


4. Engine Architecture

4.1 Module Map

macrocosm/
├── include/macrocosm/          Public C API (opaque handles, PODs only)
│   ├── types.h                 Fixed-width types, SoA particle views, flags
│   ├── linalg.h                SIMD vector/tensor primitives
│   ├── cosmology.h             a(t), H(a), D+(a), Friedmann table
│   ├── particles.h             Allocation, Morton sort, domain queries
│   ├── octree.h                Barnes–Hut tree build & multipole walk
│   ├── pm_solver.h             FFT-based long-range Poisson solver
│   ├── treepm.h                Force combiner, force splitting
│   ├── sph.h                   Neighbour search, density, pressure, forces
│   ├── integrator.h            KDK Leapfrog, adaptive timesteps
│   └── render.h                Splatting, ray march, framebuffer output
├── src/
│   ├── core/        cosmology / particles / integrator
│   ├── gravity/     octree / pm_solver / treepm
│   ├── hydro/       sph
│   ├── parallel/    domain decomposition + thread pools
│   └── render/      volumetric pipeline (GPU dispatch + host-side staging)
└── shaders/         GLSL 4.30+ compute + composite vertex/fragment

4.2 Memory Design

Every particle array is a Struct-of-Arrays (SoA). No Array-of-Structures is permitted outside of debug I/O paths.

typedef struct {
    /* hot, streamed in the integrator inner loop */
    double *restrict x, *restrict y, *restrict z;        /* comoving position */
    double *restrict vx, *restrict vy, *restrict vz;     /* canonical momentum / m·a² */
    float  *restrict ax, *restrict ay, *restrict az;     /* acceleration */

    /* warm, read by SPH / render */
    float  *restrict mass, *restrict h, *restrict rho, *restrict u;

    /* cold, touched once per step */
    uint64_t *restrict morton;
    uint32_t *restrict id, *restrict type;

    size_t   n;
    size_t   capacity;
} ParticleSoA;

Key decisions:

  • 64-byte alignment on every * returned by the allocator so that AVX-512 loads (_mm512_load_pd) are guaranteed aligned — no fallback path.
  • Morton (Z-order) sorting of particles every N_sort ≈ 32 steps. This preserves spatial locality, pushes octree-neighbor pairs into the same L2 cache line, and turns the tree walk into an almost-linear streaming load.
  • Huge pages (MADV_HUGEPAGE / VirtualAlloc with MEM_LARGE_PAGES) for arrays above 256 MiB to reduce TLB misses.
  • Arena allocator per module. No malloc in the hot path. All scratch buffers (neighbor lists, FFT workspace, tree nodes) are preallocated at init time from a contiguous arena.

4.3 Parallelism Model

Three-tier parallelism, composable and independent:

Tier Technology Scope
Instruction-level AVX2 / AVX-512 intrinsics Inner loops of force, density, SPH gradients
Thread-level OpenMP 5 parallel for, plus a custom work-stealing pool for the tree walk All CPU loops over particles
Process-level (optional, Phase 4) MPI 4 with Peano–Hilbert domain decomposition Multi-node cluster runs

The tree walk is the only kernel where the naïve OpenMP loop hits a load imbalance wall (dense halos vs. low-density voids); it uses an explicit work-stealing deque (one per hardware thread), with the queues seeded by a Peano–Hilbert space-filling curve partition of the particle set so that stolen work is cache-adjacent to locally owned work.

All floating-point behavior is deterministic across runs at fixed thread count — no reduction of forces in non-deterministic order. Reductions are performed via a tree of partial sums over fixed-width SIMD lanes.


5. Rendering Pipeline

The visual output is a physically-grounded image of the density and temperature fields. It does not plot particles — particles are only the carriers of the underlying fluid fields.

5.1 Stage 1 — Volumetric Splatting

A GPU compute shader projects every SPH particle into a sparse 3D texture using its kernel W(r, h) as a footprint. Four channels are accumulated: gas density ρ, internal energy u (proxy for temperature), mean metallicity Z (reserved), and dark-matter projected density ρ_DM (one float per voxel, separate texture).

Splat accumulation uses imageAtomicAdd on 32-bit float representations (bit-cast), or on a pre-sorted particle list using segment reduction when strict determinism is required.

5.2 Stage 2 — Physical Ray Marching

A second compute shader ray-marches from the camera through the volume. The transport equation solved along each ray is the classical emission–absorption integral (Kajiya 1986, Production Volume Rendering Wrenninge 2012):

L(ray) = ∫₀^∞ j(s) · exp(−∫₀^s κ(s') ds') ds

Where j(s) is the emission coefficient derived from a physical temperature mapping of u(s):

  • Planck black-body emission at effective temperature T(u), integrated over the Johnson B, V, R bands and color-matched to sRGB with the CIE 1931 2° observer.
  • Bremsstrahlung boost in regions with T > 10⁶ K (cluster cores) reproducing the X-ray appearance of intracluster gas.

κ(s) is dominated by Thomson scattering in hot phases and by dust extinction (Calzetti law) in cool high-density regions.

5.3 Stage 3 — Compositing

A fullscreen quad samples the HDR output, applies ACES tonemapping and a per-pixel bloom kernel for the brightest halo cores, and writes the final sRGB framebuffer. The fullscreen-quad + compute-texture pattern is inherited directly from the companion projects (see Section 8.1).


6. Build Environment

  • Language: C17 (-std=c17), no GNU extensions in public headers.
  • Compiler: GCC 13+, Clang 17+, or MSVC 19.37+.
  • Build system: CMake ≥ 3.21, single CMakeLists.txt at the root.
  • Required dependencies: FFTW3 (double precision), OpenGL 4.5, GLEW, GLFW 3, OpenMP.
  • Optional dependencies: MPI (Phase 4), CUDA/HIP (Phase 5 — heterogeneous TreePM short-range kernel).
  • Aggressive math flags, enabled by default in Release: -O3 -march=native -ffast-math -funroll-loops -ftree-vectorize -fopenmp (GCC/Clang) — /O2 /fp:fast /arch:AVX2 /openmp:llvm (MSVC).

7. Development Roadmap

The project is developed strictly in phases. No phase is allowed to begin before the previous one passes its verification test.

Phase 0 — Scaffolding ✅ complete

  • Directory tree, CMakeLists.txt, architectural README.md.
  • Zero code in hot paths. Placeholder main.c that links the toolchain.

Phase 1 — Core Data Structures ✅ complete

Delivered in this phase:

  • Public headers under include/macrocosm/: types.h (fixed-width types, MC_RESTRICT / MC_ALIGN / MC_STATIC_ASSERT, status enum, particle-species enum), mem.h (cross-platform 64-byte aligned allocator), linalg.h (scalar vec3 / mat3 / sym3 primitives plus batched SoA routines), particles.h (SoA container), morton.h (21-bit-per-axis Z-order encoder), radix_sort.h (parallel LSD radix sort).
  • Implementations under src/core/: linalg.c (axpy, dot, pairwise tree sum — forward error O(log₂ n · ε · Σ|xᵢ|) after Higham 2002 Thm 4.1; deterministic at fixed thread count), particles.c (aligned allocator + gather-based permutation over all 16 SoA fields), morton.c (BMI2 pdep/pext fast path with scalar fallback per Raman & Wise 2008), radix_sort.c (stable 8-bit-digit LSD sort with per-thread / per-bin exclusive scan per Wassenberg & Sanders 2011).
  • Verification driver in src/main.c: allocates 2²⁰ particles, fills positions from a deterministic 64-bit LCG (Knuth TAoCP 3.2.1.3), encodes Morton keys, sorts, applies the permutation to every SoA field, and asserts (a) monotonicity of the sorted keys and (b) bit-exact round-trip of morton and id post-permutation. Exits non-zero on any failure.
  • Deferred to Phase 6: huge-page allocation (MADV_HUGEPAGE / MEM_LARGE_PAGES) and the per-module arena allocator — both are memory-footprint optimizations that only become measurable at the 10⁸-particle production scale and are not required for Phase 1 correctness.

Phase 2 — Cosmology and Initial Conditions

  • a(t) integrator and D₊(a) table.
  • Gaussian random field generator with a user-supplied P(k).
  • Zel'dovich displacement of a regular grid.
  • Verification: the measured P(k) of the generated particles matches the input spectrum to within 2% in every bin.

Phase 3 — TreePM Gravity + Leapfrog

  • Barnes–Hut octree with multipole expansion up to quadrupole.
  • PM solver on 512³ (dev) / 1024³ (production) mesh.
  • KDK Leapfrog integrator with cosmological drift factors.
  • Verification: Zel'dovich test (linear growth of a plane wave) matches the analytic D₊(a) to < 1% at z = 0.

Phase 4 — SPH Baryons

  • Neighbour search piggybacking on the tree.
  • Density loop, pressure force, artificial viscosity, adiabatic energy.
  • Radiative cooling sub-step.
  • Verification: Sod shock tube and Evrard collapse test reproduce published reference solutions.

Phase 5 — Rendering

  • Splat compute shader and 3D sparse texture.
  • Physical ray-march compute shader with black-body + bremsstrahlung.
  • ACES tonemap and bloom pass.
  • Verification: render a Gaussian blob with known analytic column density and confirm radiometric correctness to < 5%.

Phase 6 — Production Scaling

  • Peano–Hilbert domain decomposition across OpenMP work-stealing queues.
  • Optional MPI transport layer.
  • Optional CUDA/HIP short-range tree walk.
  • Continuous regression on the Santa Barbara Cluster Comparison setup.

8. Provenance & References

8.1 Companion Projects (directly reused)

Macrocosm builds on two existing sibling projects in this workspace. Their design decisions and, in specific cases, their code are adapted here:

  • ../atoms/Hydrogen Quantum Orbital Visualizer.

    • Reused concept: inverse-transform sampling via a precomputed CDF tabulated with lower_bound, applied in atoms/src/atom_realtime.cpp (sampleR, sampleTheta). In Macrocosm this pattern is redeployed in the initial-conditions generator to sample displacement amplitudes from an arbitrary P(k) without ever storing the full field in memory.
    • Reused concept: the inferno / fire colormap (heatmap_fire in atom_realtime.cpp) — adapted to temperature-based emission in the render pipeline as a fallback for the black-body path when physical color mapping is disabled.
    • Reused concept: the SSBO-backed fullscreen-quad render pattern used in atoms/src/atom_raytracer.cpp — mirrored by our compositing stage.
  • ../black_hole/Schwarzschild Raytracer.

    • Reused concept: the GLSL compute-shader + UBO + image-store pipeline of black_hole/geodesic.comp and black_hole.cpp is the direct template for Macrocosm's render compute shaders. Binding points, dispatch strategy (16×16 local size, ceil-divided global size), and adaptive resolution while camera is moving are all adopted.
    • Reused concept: the classical 4th-order Runge–Kutta integrator used to step rays through curved spacetime is structurally identical to the integrator we use for a(t) and for the optional volumetric-ray refraction term.
    • Reused concept: the CPU-side Newtonian N-body loop inside black_hole.cpp (pairwise G·m₁·m₂/r²) is our reference naïve baseline against which the TreePM force error is measured in Phase 3 verification.

8.2 Scientific References

Citations are listed here in the exact form they will appear in code comments at the site where each technique is applied.

  • Springel, V. (2005). The cosmological simulation code GADGET-2. MNRAS 364, 1105. — applied in: src/gravity/treepm.c for the short/long-range force split and in src/core/integrator.c for the cosmological Leapfrog drift/kick factors.
  • Barnes, J. & Hut, P. (1986). A hierarchical O(N log N) force-calculation algorithm. Nature 324, 446. — applied in: src/gravity/octree.c (tree construction and multipole acceptance criterion).
  • Hockney, R. W. & Eastwood, J. W. (1988). Computer Simulation Using Particles. — applied in: src/gravity/pm_solver.c (CIC mass assignment and CIC deconvolution in Fourier space).
  • Monaghan, J. J. (1992). Smoothed Particle Hydrodynamics. ARAA 30, 543. — applied in: src/hydro/sph.c (M₄ kernel, pressure force).
  • Price, D. J. (2012). Smoothed Particle Hydrodynamics and Magnetohydrodynamics. J. Comp. Phys. 231, 759. — applied in: src/hydro/sph.c (artificial viscosity switches and energy equation).
  • Quinn, T., Katz, N., Stadel, J., Lake, G. (1997). Time stepping N-body simulations. arXiv:astro-ph/9710043. — applied in: src/core/integrator.c (cosmological drift factor ∫ dt/a²).
  • Zel'dovich, Ya. B. (1970). Gravitational instability: An approximate theory for large wavelength perturbations. A&A 5, 84. — applied in: src/core/cosmology.c (initial displacement field).
  • Sutherland, R. S. & Dopita, M. A. (1993). Cooling functions for low- density astrophysical plasmas. ApJS 88, 253. — applied in: src/hydro/sph.c sub-step (Λ(T, Z) tabulated cooling).
  • Planck Collaboration (2018). Cosmological parameters. A&A 641, A6. — applied in: src/core/cosmology.c (fiducial Ωₘ, Ω_Λ, h).
  • Frigo, M. & Johnson, S. G. (2005). The design and implementation of FFTW3. Proc. IEEE 93, 216. — applied in: src/gravity/pm_solver.c (FFT backend).
  • Warren, M. S. & Salmon, J. K. (1993). A parallel hashed oct-tree N-body algorithm. Proc. Supercomputing '93. — applied in: src/parallel/domain.c (Peano–Hilbert domain decomposition).

8.3 Engineering References

  • Golub, G. H. & Van Loan, C. F. Matrix Computations, 4th ed. — conditioning and numerical stability of the discrete Poisson operator in src/gravity/pm_solver.c.
  • Trefethen, L. N. & Bau, D. Numerical Linear Algebra. — error analysis of the batched SIMD reductions in src/core/integrator.c.
  • Drepper, U. (2007). What Every Programmer Should Know About Memory. — cache blocking, alignment and NUMA strategies in src/parallel/threads.c and include/macrocosm/particles.h.
  • Kajiya, J. T. (1986). The rendering equation. SIGGRAPH '86. — basis of the emission–absorption integral in shaders/raymarch.comp.
  • Wrenninge, M. (2012). Production Volume Rendering. — delta-tracking and transfer-function design used in shaders/raymarch.comp.

This document is the architectural foundation of the project. All subsequent implementation is required to be justifiable in terms of a section above. Any deviation must be reflected here in the same commit.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors