Skip to content

ClimaCore refactoring plan for discussion #2468

@tapios

Description

@tapios

ClimaCore Refactoring Plan

Purpose

ClimaCore.jl has ~35k lines of source and ~33k lines of tests. The three largest modules (Operators (7.5k), MatrixFields (5.2k), and DataLayouts (4.2k)) account for 48% of the source code and contain substantial structural duplication. This plan targets a ~20% source reduction, making the code easier to extend, maintain, and optimize for performance.

The refactoring is organized into 6 phases. Each phase is independently shippable and has a clear "done" criterion. The phases are designed so that later ones benefit from earlier ones but do not require them.

This is meant as a starting point for discussion, not yet as a task list. It's likely I'm missing things, so please edit this as needed.


Phase 1: DataLayouts — Parameterized Layouts

Target: src/DataLayouts/*, ext/cuda/data_layouts_* (5,055 lines → ~2,000 lines)

Problem

14 DataLayout types are defined, but they are all permutations of 5 universal index dimensions (i, j, f, v, h). The FH/HF variants (e.g., IJFH vs IJHF, VIFH vs VIHF) are identical except for whether the field dimension f precedes or follows the element dimension h in memory. This creates systematic duplication:

  • 9 slab() implementations differing only in which dimension is dropped
  • 11 column() implementations differing only in which dimensions are extracted
  • 5 level() implementations differing only in slice position
  • fill.jl, copyto.jl, mapreduce.jl, broadcast.jl all re-implement the same loops per layout
  • to_data_specific() and to_data_specific_field() encode 22 manual index permutations
  • field_dim(), h_dim(), type_params(), union_all() each have 12-15 one-liner definitions

The existing code already shows awareness of this: to_data_specific encodes permutations as tuple selections, field_dim/h_dim return integer positions, and singleton types provide a dispatch key. The infrastructure for parameterization is half-built, and it should be straightforward to generalize into a simpler, more concise interface.

Possible Approach

Step 1.1 — Unify all layouts into three data structures.

By retaining singleton dimensions (corresponding to axes of length one), all the various layout combinations can be compacted into just three alternatives:

struct VIJFH{S, A}
    array::A
end
struct IH1JH2{S, A}
    array::A
end
struct IV1JH2{S, A}
    array::A
end

The current design also contains array size information in the type parameters, but these belong in the array field rather than the DataLayout wrapper. For example, an VIJFH layout can be expressed as follows:

using StaticArrays: SizedArray, Dynamic
# given: static size parameters Nv, Ni, Nj, Nf; dynamic parameter Nh
array = zeros(T, Nv, Ni, Nj, Nf, Nh)
sized_array = SizedArray{Tuple{Nv, Ni, Nj, Nf, Dynamic()}(array)
data = VIJFH(sized_array)

Step 1.2 — Define API functions as loops over the three layouts.

All indexing will be standardized to either data[v, i, j, f, h], data[i, h1, j, h2], or data[i, v1, j, h2]. Other API functions like slab, column, and level can be implemented as wrappers for these indices. Replace duplicate implementations of fill!, copyto!, etc., with generic loops over the CartesianIndices corresponding to each layout. Use UnrolledUtilities to ensure that symbols in the layout tuple are always constant-propagated during compilation, producing the same specialized code we have today.

Step 1.3 — Collapse broadcast implementations.

The 11 DataStyle types (IFHStyle, IHFStyle, IJFHStyle, ...) can be reduced to a single DataLayoutStyle, with different types of styles combined according to their layout tuples:

struct DataLayoutStyle{L, A} end
Base.Broadcast.BroadcastStyle(
    ::DataLayoutStyle{L1, A1},
    ::DataLayoutStyle{L2, A2},
) where {L1, L2, A1, A2} =
    DataLayoutStyle{unrolled_unique((L1..., L2...)), promote_array_type(A1, A2)}

Also, having separate code paths and data structures for CPUs and GPUs adds a lot of unnecessary complexity. Use ClimaComms.@threaded to define a unified implementation, confining all device-specific abstractions to extensions of ClimaComms.

Testing Strategy

  • Add a parametric test that verifies slab, column, level, fill!, copyto!, mapreduce for all relevant layouts in a single test file.
  • Add a round-trip property test: for each layout type, fill with known data via universal indexing, then read back via storage indexing and verify equivalence.
  • Add a dead-code elimination test: for several different struct types, evaluate broadcast expressions that only use a subset of the struct fields, and check that the compiled LLVM code has the minimum number of reads and allocations.
  • Either run all unit tests on both CPUs and GPUs, or confirm that CPU-only tests are also GPU-compatible by checking for type instabilities with JET.@test_opt.
    • Running on both devices is preferable to JET tests, since it is not uncommon for type instabilities to exist on only one of two devices.

Done Criterion

API functions pass all correctness tests, demonstrating optimal inlining and type stability. Line count reduced by ≥ 50%.


Phase 2: Finite Difference Operators — Stencil Abstraction

Target: src/Operators/finitedifference.jl, ext/cuda/operators_finite_difference.jl, ext/cuda/operators_fd_*, ext/cuda/column_matrix_helpers.jl (6,390 lines → ~3,000 lines)

Problem

53 operator struct types are defined, with 80+ boundary condition method implementations. The dominant pattern is:

  1. Interior stencils are nearly identical across operator pairs (e.g., GradientF2C and GradientC2F share the same interior computation; InterpolateF2C and InterpolateC2F are rdiv(a⁺ ⊞ a⁻, 2) in both cases).
  2. Each operator re-defines stencil_interior, stencil_left_boundary, stencil_right_boundary, boundary_width, and stencil_interior_width.
  3. Boundary condition handling is duplicated per-operator × per-BC type (SetValue, SetGradient, Extrapolate), with left/right mirrors that are often identical modulo sign.
  4. The F2C/C2F distinction affects only return_space and which boundary points exist — the stencil computation itself is the same.

Approach

Step 2.1 — Drop redundant information from operator types.

The return space of an operator can be inferred from the types of its inputs, so this information does not need to be duplicated in the type signature. That is, instead of separate GradientF2C and GradientC2F operators, we can use a single Gradient:

@kwdef struct Gradient{BC <: NamedTuple} <: FiniteDifferenceOperator
    bcs::BC = NamedTuple()
end

Although the current design also generalizes the boundary conditions into a NamedTuple with generic names, this seems like an unnecessary degree of abstraction. If needed, custom constructors can be used to handle generic boundary names, but we can just use one set of names for the internals of our implementation. We can discuss as a group which combination will be most clear: top/bottom, left/right, bc1/bc2, or something else.

Step 2.2 — Eliminate code divergence at boundaries.

Having separate code paths for boundaries and interiors is only beneficial when all iteration is statically unrolled. When this is not the case (e.g., in GPU kernels where iteration is split across multiple threads running in parallel), this design pattern introduces runtime warp divergence. To avoid this performance penalty, replace the stencil_left_boundary/stencil_interior/stencil_right_boundary functions with a single stencil function, compacting all warp divergence into the bottommost call to getindex:

@inline stencil((; bcs)::Interpolate, space, idx, hidx, arg) =
    (getidx(arg, space, idx + half, hidx, bcs) + getidx(arg, space, idx - half, hidx, bcs)) /2
# getidx applies boundary conditions when out-of-bounds

This was partially completed in #2466, but only for the CUDA code relating to linear operators. It should be straightforward to generalize for CPU code.

Step 2.3 — Define all operators in terms of a single primitive.

All finite difference operators can define using a single "convolution" primitive, with a special case for linear operators, for which the convolution is an inner product. This was partially started in #2466, but a large amount of complexity was needed to handle AxisTensors, whose projections are not currently expressible using simple products. Merging in #2457 should allow us to eliminate this complexity.

Step 2.4 — Reduce advection operator variants.

The 8 advection operators (Upwind, LinVanLeer, TVDLimited, 3rdOrder variants × C2F/F2C) share the same control flow — they differ in the interpolation rule applied within the stencil. Factor the interpolation rule into a strategy parameter rather than duplicating the entire operator.

Testing Strategy

  • Existing convergence tests (convergence_column.jl, convergence_advection_diffusion1d.jl) are the primary correctness gate.
  • Add boundary symmetry property tests: for symmetric BCs (SetValue), verify that applying the operator to a symmetric input produces symmetric output.
  • Extend dead-code elimination tests from DataLayouts for chains of operators.
  • Consolidate the operator performance tests/benchmarks.
  • Run all tests on CPUs and GPUs.

Done Criterion

API functions pass all correctness and performance tests, demonstrating optimal inlining and type stability. Operator count reduced from 53 to ~25. Line count reduced by ≥ 50%.


Phase 3: Spectral Element Operators — Dimension-Generic Kernels

Target: src/Operators/spectralelement.jl (1,873 lines → ~1,200 lines)

Problem

Each spectral element operator (Divergence, Gradient, WeakDivergence, WeakGradient, Curl, WeakCurl) has separate 1D {(1,)} and 2D {(1,2)} implementations. The 2D version repeats the 1D logic for each axis. Example: Divergence{(1,)} is a single loop over i ∈ 1:Nq computing D[ii,i] ⊠ Jv¹, while Divergence{(1,2)} repeats this for both i and j axes with Jv¹ and Jv².

Approach

Step 3.1 — Write dimension-generic apply_operator using ntuple.

The core pattern for all SE operators is can be written generically over N spatial dimensions using ntuple and CartesianIndices:

function apply_operator(op::Divergence{I}, space, slabidx, arg) where {I}
    N = length(I)
    # ... generic loop over N dimensions
    for d in 1:N
        Jvd = local_geometry.J  contravariant_component(v, d, local_geometry)
        # accumulate D along axis d
    end
end

Step 3.2 — Unify Strong/Weak operator variants.

Strong form operators (Divergence, Gradient) and weak form operators (WeakDivergence, WeakGradient) differ only in the use of J vs WJ and which metric terms appear. Factor this as a FormType parameter (:strong / :weak) rather than duplicating the entire operator.

GPU Performance Implications

  • The generic loop over N dimensions is easier to map to shared-memory tiling strategies (iterate tiles of Nq^N points).
  • Reducing the number of distinct operator kernels means fewer compilations and less GPU binary size.
  • The unified strong/weak form allows a single fused kernel to compute both when needed (e.g., for diffusion = weak divergence of gradient).

Testing Strategy

  • Existing spectral element convergence tests (sphere gradient, divergence, curl, diffusion) are the primary correctness gate.
  • Add @test_opt tests to verify zero-overhead for N=1 and N=2.
  • Add equivalence tests between old and new implementations on a reference mesh before removing old code.

Done Criterion

All spectral element tests pass. Each operator has one apply_operator method instead of two. Line count reduced by ~35%.


Phase 4: Type Alias Cleanup and Structural Simplification

Target: Spaces (1,394 lines), Grids (1,296 lines), Fields (2,453 lines)

Problem

45+ const type aliases create a Cartesian product of names across three modules:

  • Grids: ExtrudedSpectralElementGrid2D, ExtrudedCubedSphereSpectralElementGrid3D, DeviceSpectralElementGrid2D, etc.
  • Spaces: ExtrudedFiniteDifferenceSpace2D, CenterExtrudedFiniteDifferenceSpace3D, ExtrudedRectilinearSpectralElementSpace3D, etc.
  • Fields: SpectralElementField2D, CenterExtrudedFiniteDifferenceField, ExtrudedCubedSphereSpectralElementField3D, etc.

Every new grid variant requires adding aliases in all three modules. The aliases don't carry behavior — they're just type constraints used for dispatch. In practice, runtime isa checks are used anyway (e.g., space isa ExtrudedFiniteDifferenceSpace3D ? ... : ...).

Approach

Step 4.1 — Replace type aliases with trait queries.

Instead of:

const ExtrudedFiniteDifferenceSpace3D = ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}

Define:

horizontal_ndims(::ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}) = 2  # 3D = 2D horiz + 1D vert
is_cubed_sphere(space) = grid(space) isa CubedSphereSpectralElementGrid2D

Then dispatch on horizontal_ndims(space) rather than matching a named type alias. This is both more extensible and more explicit about what property is being dispatched on.

Step 4.2 — Eliminate Field type aliases entirely.

The 23 Field aliases add zero functionality — Field{V, S} already carries the space type S, so any dispatch that works on a Field alias works equally well with Field{<:Any, <:SomeSpace}. Remove all Field aliases and use Field{<:Any, <:SpaceType} where needed (or the trait queries from Step 4.1).

Step 4.3 — Consolidate Device grid variants.*

DeviceSpectralElementGrid2D, DeviceFiniteDifferenceGrid, DeviceExtrudedFiniteDifferenceGrid are separate structs that mirror main grid types for GPU use. Replace with Adapt.adapt_structure on the main grid types, using Adapt.jl's existing pattern. This eliminates the parallel type hierarchy.

Step 4.4 — Simplify space constructors.

30+ constructor methods for space types can be reduced by:

  • Removing deprecated constructors (already tested in deprecations.jl)
  • Using a single entry point per space type with keyword arguments
  • Delegating grid construction to the Grids module exclusively

GPU Performance Implications

  • Eliminating Device* structs in favor of Adapt.adapt_structure means GPU adaptations are automatically correct when fields are added to grid types — currently, Device* types can silently become stale.
  • Trait-based dispatch is more amenable to compiler specialization than Union-type matching.

Testing Strategy

  • Existing tests cover correctness.
  • Add tests that verify trait queries return correct values for all space/grid combinations.
  • Ensure Adapt.adapt_structure round-trips correctly for GPU grids.

Done Criterion

All existing tests pass. Zero type aliases remain in Fields. Grid/Space alias count reduced from 45+ to < 10 (keeping only the most common user-facing names as a thin compatibility layer). Device* types eliminated.


Phase 5: Test Suite Consolidation

Target: test/ (32,590 lines → ~22,000 lines)

Problems

  1. Duplication: convergence_rate() is defined locally in 68 places. Fill/copyto helpers duplicated across test files. Matrix field test harnesses duplicated 21 times.
  2. Missing coverage: No property-based tests, no cross-module integration tests, sparse GPU correctness tests (only verify kernel launch, not output).
  3. Ad-hoc organization: Some files mix unit tests, convergence tests, and benchmarks.

Approach

Step 5.1 — Extract shared test utilities.

Move convergence_rate, common space constructors, and fill/copyto helpers into test/TestUtilities/. The module already exists but is underused.

Step 5.2 — Parametric layout tests (supports Phase 1).

Replace the per-layout test files (data0d.jl, data1d.jl, data2d.jl, data1dx.jl, data2dx.jl) with a single parametric test that loops over all layout types:

@testset "DataLayout: $DL" for DL in [DataF, IF, IFH, IHF, IJF, IJFH, IJHF, VF, VIFH, VIHF, VIJFH, VIJHF]
    # ... uniform tests for slab, column, level, fill!, copyto!, mapreduce
end

Step 5.3 — Add GPU correctness tests.

Current GPU tests only verify that kernels launch without error. Add tests that:

  • Run the same computation on CPU and GPU, compare results within tolerance.
  • Cover all operator families (spectral element, finite difference, hybrid).
  • Test edge cases: single-element meshes, boundary-only columns.

Step 5.4 — Add cross-module integration tests.

Test the full pipeline: Domain → Mesh → Topology → Grid → Space → Field → Operator → result. Currently each module is tested in isolation. Integration tests catch interface mismatches.

Step 5.5 — Fix or remove disabled tests.

  • Terrain warp: diagnose the hang, fix or mark as a known issue with a tracking issue.
  • MatrixFields solver/operator tests: either speed them up or move to Buildkite CI

Tests to Consider Deleting

  • Per-layout data tests that will be superseded by parametric tests (Step 5.2).
  • Redundant convergence rate implementations (replaced by shared utility).
  • Benchmark files that duplicate unit test setup with minimal additional value (keep benchmarks in benchmarks/ or perf/ only).

Tests to Add

Category What Why
Property-based Round-trip: data == read(write(data)) for all layouts Catches permutation bugs
Property-based Operator adjoint identity: ⟨div(u), f⟩ = -⟨u, grad(f)⟩ Mathematical correctness
GPU correctness CPU vs GPU comparison for all operator families Currently untested
Integration Full pipeline: domain → field → operator → result Interface mismatches
Regression @test_opt for all generic methods from Phase 1-3 Zero-overhead guarantee

Done Criterion

Shared utilities eliminate ≥ 50 duplicate function definitions. GPU tests verify numerical correctness. All disabled tests either fixed or tracked.


Phase 6: Documentation Addition

  • Add docstrings that are clear, concise, and consistent with the code
  • Add a docs/ suite that outlines basic formulations and functionality, summarizing key aspects of the dycore paper, at the level of detail and in the style of, e.g., ClimaTimeSteppers.jl

Done Criterion

Docs suite approved by dycore paper co-authors.


Phasing and Dependencies

Phase 1 (DataLayouts)  ─────────────────────────────────────────►
Phase 2 (FD Operators) ──────────────────────────────────────────►
Phase 3 (SE Operators)          (can start after Phase 1 done) ──►
Phase 4 (Type Aliases)          (independent, can start anytime) ►
Phase 5 (Tests)         (parallel with all phases) ──────────────►
Phase 6 (Docs)         (parallel with all phases) ──────────────►

Risk Mitigation

  1. Performance regression: Every phase requires @test_opt and allocation tests before merging. Run benchmarks (perf/, benchmarks/) before and after each phase.
  2. Downstream breakage: ClimaCore is used by ClimaAtmos and other CliMA packages. Each phase should be merged as a minor version bump with a CHANGELOG entry. Type alias removal (Phase 4) is the highest-risk for downstream -> provide a deprecation period.
  3. GPU correctness: Phases 1-3 change code that runs on GPU. The current GPU tests only verify kernel launch. Adding GPU correctness tests should be done before or concurrently with Phases 1-3, not after.
  4. Compiler regressions: Generic code using ntuple / @generated can hit Julia compiler edge cases. Test on Julia 1.10-1.12.

Summary

Phase Target Current Lines Estimated After Reduction
1 DataLayouts 4,231 ~1,800 ~57%
2 FD Operators 4,282 ~2,000 ~53%
3 SE Operators 1,873 ~1,200 ~36%
4 Type Aliases (Spaces/Grids/Fields) 5,143 ~3,800 ~26%
Total src 35,423 ~28,700 ~19%

Metadata

Metadata

Labels

SDISoftware Development Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions