ClimaCore Refactoring Plan
Purpose
ClimaCore.jl has ~35k lines of source and ~33k lines of tests. The three largest modules (Operators (7.5k), MatrixFields (5.2k), and DataLayouts (4.2k)) account for 48% of the source code and contain substantial structural duplication. This plan targets a ~20% source reduction, making the code easier to extend, maintain, and optimize for performance.
The refactoring is organized into 6 phases. Each phase is independently shippable and has a clear "done" criterion. The phases are designed so that later ones benefit from earlier ones but do not require them.
This is meant as a starting point for discussion, not yet as a task list. It's likely I'm missing things, so please edit this as needed.
Phase 1: DataLayouts — Parameterized Layouts
Target: src/DataLayouts/*, ext/cuda/data_layouts_* (5,055 lines → ~2,000 lines)
Problem
14 DataLayout types are defined, but they are all permutations of 5 universal index dimensions (i, j, f, v, h). The FH/HF variants (e.g., IJFH vs IJHF, VIFH vs VIHF) are identical except for whether the field dimension f precedes or follows the element dimension h in memory. This creates systematic duplication:
- 9
slab() implementations differing only in which dimension is dropped
- 11
column() implementations differing only in which dimensions are extracted
- 5
level() implementations differing only in slice position
fill.jl, copyto.jl, mapreduce.jl, broadcast.jl all re-implement the same loops per layout
to_data_specific() and to_data_specific_field() encode 22 manual index permutations
field_dim(), h_dim(), type_params(), union_all() each have 12-15 one-liner definitions
The existing code already shows awareness of this: to_data_specific encodes permutations as tuple selections, field_dim/h_dim return integer positions, and singleton types provide a dispatch key. The infrastructure for parameterization is half-built, and it should be straightforward to generalize into a simpler, more concise interface.
Possible Approach
Step 1.1 — Unify all layouts into three data structures.
By retaining singleton dimensions (corresponding to axes of length one), all the various layout combinations can be compacted into just three alternatives:
struct VIJFH{S, A}
array::A
end
struct IH1JH2{S, A}
array::A
end
struct IV1JH2{S, A}
array::A
end
The current design also contains array size information in the type parameters, but these belong in the array field rather than the DataLayout wrapper. For example, an VIJFH layout can be expressed as follows:
using StaticArrays: SizedArray, Dynamic
# given: static size parameters Nv, Ni, Nj, Nf; dynamic parameter Nh
array = zeros(T, Nv, Ni, Nj, Nf, Nh)
sized_array = SizedArray{Tuple{Nv, Ni, Nj, Nf, Dynamic()}(array)
data = VIJFH(sized_array)
Step 1.2 — Define API functions as loops over the three layouts.
All indexing will be standardized to either data[v, i, j, f, h], data[i, h1, j, h2], or data[i, v1, j, h2]. Other API functions like slab, column, and level can be implemented as wrappers for these indices. Replace duplicate implementations of fill!, copyto!, etc., with generic loops over the CartesianIndices corresponding to each layout. Use UnrolledUtilities to ensure that symbols in the layout tuple are always constant-propagated during compilation, producing the same specialized code we have today.
Step 1.3 — Collapse broadcast implementations.
The 11 DataStyle types (IFHStyle, IHFStyle, IJFHStyle, ...) can be reduced to a single DataLayoutStyle, with different types of styles combined according to their layout tuples:
struct DataLayoutStyle{L, A} end
Base.Broadcast.BroadcastStyle(
::DataLayoutStyle{L1, A1},
::DataLayoutStyle{L2, A2},
) where {L1, L2, A1, A2} =
DataLayoutStyle{unrolled_unique((L1..., L2...)), promote_array_type(A1, A2)}
Also, having separate code paths and data structures for CPUs and GPUs adds a lot of unnecessary complexity. Use ClimaComms.@threaded to define a unified implementation, confining all device-specific abstractions to extensions of ClimaComms.
Testing Strategy
- Add a parametric test that verifies
slab, column, level, fill!, copyto!, mapreduce for all relevant layouts in a single test file.
- Add a round-trip property test: for each layout type, fill with known data via universal indexing, then read back via storage indexing and verify equivalence.
- Add a dead-code elimination test: for several different struct types, evaluate broadcast expressions that only use a subset of the struct fields, and check that the compiled LLVM code has the minimum number of reads and allocations.
- Either run all unit tests on both CPUs and GPUs, or confirm that CPU-only tests are also GPU-compatible by checking for type instabilities with
JET.@test_opt.
- Running on both devices is preferable to JET tests, since it is not uncommon for type instabilities to exist on only one of two devices.
Done Criterion
API functions pass all correctness tests, demonstrating optimal inlining and type stability. Line count reduced by ≥ 50%.
Phase 2: Finite Difference Operators — Stencil Abstraction
Target: src/Operators/finitedifference.jl, ext/cuda/operators_finite_difference.jl, ext/cuda/operators_fd_*, ext/cuda/column_matrix_helpers.jl (6,390 lines → ~3,000 lines)
Problem
53 operator struct types are defined, with 80+ boundary condition method implementations. The dominant pattern is:
- Interior stencils are nearly identical across operator pairs (e.g.,
GradientF2C and GradientC2F share the same interior computation; InterpolateF2C and InterpolateC2F are rdiv(a⁺ ⊞ a⁻, 2) in both cases).
- Each operator re-defines
stencil_interior, stencil_left_boundary, stencil_right_boundary, boundary_width, and stencil_interior_width.
- Boundary condition handling is duplicated per-operator × per-BC type (SetValue, SetGradient, Extrapolate), with left/right mirrors that are often identical modulo sign.
- The F2C/C2F distinction affects only
return_space and which boundary points exist — the stencil computation itself is the same.
Approach
Step 2.1 — Drop redundant information from operator types.
The return space of an operator can be inferred from the types of its inputs, so this information does not need to be duplicated in the type signature. That is, instead of separate GradientF2C and GradientC2F operators, we can use a single Gradient:
@kwdef struct Gradient{BC <: NamedTuple} <: FiniteDifferenceOperator
bcs::BC = NamedTuple()
end
Although the current design also generalizes the boundary conditions into a NamedTuple with generic names, this seems like an unnecessary degree of abstraction. If needed, custom constructors can be used to handle generic boundary names, but we can just use one set of names for the internals of our implementation. We can discuss as a group which combination will be most clear: top/bottom, left/right, bc1/bc2, or something else.
Step 2.2 — Eliminate code divergence at boundaries.
Having separate code paths for boundaries and interiors is only beneficial when all iteration is statically unrolled. When this is not the case (e.g., in GPU kernels where iteration is split across multiple threads running in parallel), this design pattern introduces runtime warp divergence. To avoid this performance penalty, replace the stencil_left_boundary/stencil_interior/stencil_right_boundary functions with a single stencil function, compacting all warp divergence into the bottommost call to getindex:
@inline stencil((; bcs)::Interpolate, space, idx, hidx, arg) =
(getidx(arg, space, idx + half, hidx, bcs) + getidx(arg, space, idx - half, hidx, bcs)) /2
# getidx applies boundary conditions when out-of-bounds
This was partially completed in #2466, but only for the CUDA code relating to linear operators. It should be straightforward to generalize for CPU code.
Step 2.3 — Define all operators in terms of a single primitive.
All finite difference operators can define using a single "convolution" primitive, with a special case for linear operators, for which the convolution is an inner product. This was partially started in #2466, but a large amount of complexity was needed to handle AxisTensors, whose projections are not currently expressible using simple products. Merging in #2457 should allow us to eliminate this complexity.
Step 2.4 — Reduce advection operator variants.
The 8 advection operators (Upwind, LinVanLeer, TVDLimited, 3rdOrder variants × C2F/F2C) share the same control flow — they differ in the interpolation rule applied within the stencil. Factor the interpolation rule into a strategy parameter rather than duplicating the entire operator.
Testing Strategy
- Existing convergence tests (
convergence_column.jl, convergence_advection_diffusion1d.jl) are the primary correctness gate.
- Add boundary symmetry property tests: for symmetric BCs (SetValue), verify that applying the operator to a symmetric input produces symmetric output.
- Extend dead-code elimination tests from DataLayouts for chains of operators.
- Consolidate the operator performance tests/benchmarks.
- Run all tests on CPUs and GPUs.
Done Criterion
API functions pass all correctness and performance tests, demonstrating optimal inlining and type stability. Operator count reduced from 53 to ~25. Line count reduced by ≥ 50%.
Phase 3: Spectral Element Operators — Dimension-Generic Kernels
Target: src/Operators/spectralelement.jl (1,873 lines → ~1,200 lines)
Problem
Each spectral element operator (Divergence, Gradient, WeakDivergence, WeakGradient, Curl, WeakCurl) has separate 1D {(1,)} and 2D {(1,2)} implementations. The 2D version repeats the 1D logic for each axis. Example: Divergence{(1,)} is a single loop over i ∈ 1:Nq computing D[ii,i] ⊠ Jv¹, while Divergence{(1,2)} repeats this for both i and j axes with Jv¹ and Jv².
Approach
Step 3.1 — Write dimension-generic apply_operator using ntuple.
The core pattern for all SE operators is can be written generically over N spatial dimensions using ntuple and CartesianIndices:
function apply_operator(op::Divergence{I}, space, slabidx, arg) where {I}
N = length(I)
# ... generic loop over N dimensions
for d in 1:N
Jvd = local_geometry.J ⊠ contravariant_component(v, d, local_geometry)
# accumulate D along axis d
end
end
Step 3.2 — Unify Strong/Weak operator variants.
Strong form operators (Divergence, Gradient) and weak form operators (WeakDivergence, WeakGradient) differ only in the use of J vs WJ and which metric terms appear. Factor this as a FormType parameter (:strong / :weak) rather than duplicating the entire operator.
GPU Performance Implications
- The generic loop over
N dimensions is easier to map to shared-memory tiling strategies (iterate tiles of Nq^N points).
- Reducing the number of distinct operator kernels means fewer compilations and less GPU binary size.
- The unified strong/weak form allows a single fused kernel to compute both when needed (e.g., for diffusion = weak divergence of gradient).
Testing Strategy
- Existing spectral element convergence tests (sphere gradient, divergence, curl, diffusion) are the primary correctness gate.
- Add
@test_opt tests to verify zero-overhead for N=1 and N=2.
- Add equivalence tests between old and new implementations on a reference mesh before removing old code.
Done Criterion
All spectral element tests pass. Each operator has one apply_operator method instead of two. Line count reduced by ~35%.
Phase 4: Type Alias Cleanup and Structural Simplification
Target: Spaces (1,394 lines), Grids (1,296 lines), Fields (2,453 lines)
Problem
45+ const type aliases create a Cartesian product of names across three modules:
- Grids:
ExtrudedSpectralElementGrid2D, ExtrudedCubedSphereSpectralElementGrid3D, DeviceSpectralElementGrid2D, etc.
- Spaces:
ExtrudedFiniteDifferenceSpace2D, CenterExtrudedFiniteDifferenceSpace3D, ExtrudedRectilinearSpectralElementSpace3D, etc.
- Fields:
SpectralElementField2D, CenterExtrudedFiniteDifferenceField, ExtrudedCubedSphereSpectralElementField3D, etc.
Every new grid variant requires adding aliases in all three modules. The aliases don't carry behavior — they're just type constraints used for dispatch. In practice, runtime isa checks are used anyway (e.g., space isa ExtrudedFiniteDifferenceSpace3D ? ... : ...).
Approach
Step 4.1 — Replace type aliases with trait queries.
Instead of:
const ExtrudedFiniteDifferenceSpace3D = ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}
Define:
horizontal_ndims(::ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}) = 2 # 3D = 2D horiz + 1D vert
is_cubed_sphere(space) = grid(space) isa CubedSphereSpectralElementGrid2D
Then dispatch on horizontal_ndims(space) rather than matching a named type alias. This is both more extensible and more explicit about what property is being dispatched on.
Step 4.2 — Eliminate Field type aliases entirely.
The 23 Field aliases add zero functionality — Field{V, S} already carries the space type S, so any dispatch that works on a Field alias works equally well with Field{<:Any, <:SomeSpace}. Remove all Field aliases and use Field{<:Any, <:SpaceType} where needed (or the trait queries from Step 4.1).
Step 4.3 — Consolidate Device grid variants.*
DeviceSpectralElementGrid2D, DeviceFiniteDifferenceGrid, DeviceExtrudedFiniteDifferenceGrid are separate structs that mirror main grid types for GPU use. Replace with Adapt.adapt_structure on the main grid types, using Adapt.jl's existing pattern. This eliminates the parallel type hierarchy.
Step 4.4 — Simplify space constructors.
30+ constructor methods for space types can be reduced by:
- Removing deprecated constructors (already tested in
deprecations.jl)
- Using a single entry point per space type with keyword arguments
- Delegating grid construction to the
Grids module exclusively
GPU Performance Implications
- Eliminating Device* structs in favor of
Adapt.adapt_structure means GPU adaptations are automatically correct when fields are added to grid types — currently, Device* types can silently become stale.
- Trait-based dispatch is more amenable to compiler specialization than Union-type matching.
Testing Strategy
- Existing tests cover correctness.
- Add tests that verify trait queries return correct values for all space/grid combinations.
- Ensure
Adapt.adapt_structure round-trips correctly for GPU grids.
Done Criterion
All existing tests pass. Zero type aliases remain in Fields. Grid/Space alias count reduced from 45+ to < 10 (keeping only the most common user-facing names as a thin compatibility layer). Device* types eliminated.
Phase 5: Test Suite Consolidation
Target: test/ (32,590 lines → ~22,000 lines)
Problems
- Duplication:
convergence_rate() is defined locally in 68 places. Fill/copyto helpers duplicated across test files. Matrix field test harnesses duplicated 21 times.
- Missing coverage: No property-based tests, no cross-module integration tests, sparse GPU correctness tests (only verify kernel launch, not output).
- Ad-hoc organization: Some files mix unit tests, convergence tests, and benchmarks.
Approach
Step 5.1 — Extract shared test utilities.
Move convergence_rate, common space constructors, and fill/copyto helpers into test/TestUtilities/. The module already exists but is underused.
Step 5.2 — Parametric layout tests (supports Phase 1).
Replace the per-layout test files (data0d.jl, data1d.jl, data2d.jl, data1dx.jl, data2dx.jl) with a single parametric test that loops over all layout types:
@testset "DataLayout: $DL" for DL in [DataF, IF, IFH, IHF, IJF, IJFH, IJHF, VF, VIFH, VIHF, VIJFH, VIJHF]
# ... uniform tests for slab, column, level, fill!, copyto!, mapreduce
end
Step 5.3 — Add GPU correctness tests.
Current GPU tests only verify that kernels launch without error. Add tests that:
- Run the same computation on CPU and GPU, compare results within tolerance.
- Cover all operator families (spectral element, finite difference, hybrid).
- Test edge cases: single-element meshes, boundary-only columns.
Step 5.4 — Add cross-module integration tests.
Test the full pipeline: Domain → Mesh → Topology → Grid → Space → Field → Operator → result. Currently each module is tested in isolation. Integration tests catch interface mismatches.
Step 5.5 — Fix or remove disabled tests.
- Terrain warp: diagnose the hang, fix or mark as a known issue with a tracking issue.
- MatrixFields solver/operator tests: either speed them up or move to Buildkite CI
Tests to Consider Deleting
- Per-layout data tests that will be superseded by parametric tests (Step 5.2).
- Redundant convergence rate implementations (replaced by shared utility).
- Benchmark files that duplicate unit test setup with minimal additional value (keep benchmarks in
benchmarks/ or perf/ only).
Tests to Add
| Category |
What |
Why |
| Property-based |
Round-trip: data == read(write(data)) for all layouts |
Catches permutation bugs |
| Property-based |
Operator adjoint identity: ⟨div(u), f⟩ = -⟨u, grad(f)⟩ |
Mathematical correctness |
| GPU correctness |
CPU vs GPU comparison for all operator families |
Currently untested |
| Integration |
Full pipeline: domain → field → operator → result |
Interface mismatches |
| Regression |
@test_opt for all generic methods from Phase 1-3 |
Zero-overhead guarantee |
Done Criterion
Shared utilities eliminate ≥ 50 duplicate function definitions. GPU tests verify numerical correctness. All disabled tests either fixed or tracked.
Phase 6: Documentation Addition
- Add docstrings that are clear, concise, and consistent with the code
- Add a docs/ suite that outlines basic formulations and functionality, summarizing key aspects of the dycore paper, at the level of detail and in the style of, e.g., ClimaTimeSteppers.jl
Done Criterion
Docs suite approved by dycore paper co-authors.
Phasing and Dependencies
Phase 1 (DataLayouts) ─────────────────────────────────────────►
Phase 2 (FD Operators) ──────────────────────────────────────────►
Phase 3 (SE Operators) (can start after Phase 1 done) ──►
Phase 4 (Type Aliases) (independent, can start anytime) ►
Phase 5 (Tests) (parallel with all phases) ──────────────►
Phase 6 (Docs) (parallel with all phases) ──────────────►
Risk Mitigation
- Performance regression: Every phase requires
@test_opt and allocation tests before merging. Run benchmarks (perf/, benchmarks/) before and after each phase.
- Downstream breakage: ClimaCore is used by ClimaAtmos and other CliMA packages. Each phase should be merged as a minor version bump with a CHANGELOG entry. Type alias removal (Phase 4) is the highest-risk for downstream -> provide a deprecation period.
- GPU correctness: Phases 1-3 change code that runs on GPU. The current GPU tests only verify kernel launch. Adding GPU correctness tests should be done before or concurrently with Phases 1-3, not after.
- Compiler regressions: Generic code using
ntuple / @generated can hit Julia compiler edge cases. Test on Julia 1.10-1.12.
Summary
| Phase |
Target |
Current Lines |
Estimated After |
Reduction |
| 1 |
DataLayouts |
4,231 |
~1,800 |
~57% |
| 2 |
FD Operators |
4,282 |
~2,000 |
~53% |
| 3 |
SE Operators |
1,873 |
~1,200 |
~36% |
| 4 |
Type Aliases (Spaces/Grids/Fields) |
5,143 |
~3,800 |
~26% |
| Total src |
|
35,423 |
~28,700 |
~19% |
ClimaCore Refactoring Plan
Purpose
ClimaCore.jl has ~35k lines of source and ~33k lines of tests. The three largest modules (Operators (7.5k), MatrixFields (5.2k), and DataLayouts (4.2k)) account for 48% of the source code and contain substantial structural duplication. This plan targets a ~20% source reduction, making the code easier to extend, maintain, and optimize for performance.
The refactoring is organized into 6 phases. Each phase is independently shippable and has a clear "done" criterion. The phases are designed so that later ones benefit from earlier ones but do not require them.
This is meant as a starting point for discussion, not yet as a task list. It's likely I'm missing things, so please edit this as needed.
Phase 1: DataLayouts — Parameterized Layouts
Target:
src/DataLayouts/*,ext/cuda/data_layouts_*(5,055 lines → ~2,000 lines)Problem
14 DataLayout types are defined, but they are all permutations of 5 universal index dimensions
(i, j, f, v, h). The FH/HF variants (e.g., IJFH vs IJHF, VIFH vs VIHF) are identical except for whether the field dimensionfprecedes or follows the element dimensionhin memory. This creates systematic duplication:slab()implementations differing only in which dimension is droppedcolumn()implementations differing only in which dimensions are extractedlevel()implementations differing only in slice positionfill.jl,copyto.jl,mapreduce.jl,broadcast.jlall re-implement the same loops per layoutto_data_specific()andto_data_specific_field()encode 22 manual index permutationsfield_dim(),h_dim(),type_params(),union_all()each have 12-15 one-liner definitionsThe existing code already shows awareness of this:
to_data_specificencodes permutations as tuple selections,field_dim/h_dimreturn integer positions, andsingletontypes provide a dispatch key. The infrastructure for parameterization is half-built, and it should be straightforward to generalize into a simpler, more concise interface.Possible Approach
Step 1.1 — Unify all layouts into three data structures.
By retaining singleton dimensions (corresponding to axes of length one), all the various layout combinations can be compacted into just three alternatives:
The current design also contains array size information in the type parameters, but these belong in the
arrayfield rather than theDataLayoutwrapper. For example, anVIJFHlayout can be expressed as follows:Step 1.2 — Define API functions as loops over the three layouts.
All indexing will be standardized to either
data[v, i, j, f, h],data[i, h1, j, h2], ordata[i, v1, j, h2]. Other API functions likeslab,column, andlevelcan be implemented as wrappers for these indices. Replace duplicate implementations offill!,copyto!, etc., with generic loops over theCartesianIndicescorresponding to each layout. Use UnrolledUtilities to ensure that symbols in the layout tuple are always constant-propagated during compilation, producing the same specialized code we have today.Step 1.3 — Collapse broadcast implementations.
The 11
DataStyletypes (IFHStyle,IHFStyle,IJFHStyle, ...) can be reduced to a singleDataLayoutStyle, with different types of styles combined according to their layout tuples:Also, having separate code paths and data structures for CPUs and GPUs adds a lot of unnecessary complexity. Use
ClimaComms.@threadedto define a unified implementation, confining all device-specific abstractions to extensions of ClimaComms.Testing Strategy
slab,column,level,fill!,copyto!,mapreducefor all relevant layouts in a single test file.JET.@test_opt.Done Criterion
API functions pass all correctness tests, demonstrating optimal inlining and type stability. Line count reduced by ≥ 50%.
Phase 2: Finite Difference Operators — Stencil Abstraction
Target:
src/Operators/finitedifference.jl,ext/cuda/operators_finite_difference.jl,ext/cuda/operators_fd_*,ext/cuda/column_matrix_helpers.jl(6,390 lines → ~3,000 lines)Problem
53 operator struct types are defined, with 80+ boundary condition method implementations. The dominant pattern is:
GradientF2CandGradientC2Fshare the same interior computation;InterpolateF2CandInterpolateC2Farerdiv(a⁺ ⊞ a⁻, 2)in both cases).stencil_interior,stencil_left_boundary,stencil_right_boundary,boundary_width, andstencil_interior_width.return_spaceand which boundary points exist — the stencil computation itself is the same.Approach
Step 2.1 — Drop redundant information from operator types.
The return space of an operator can be inferred from the types of its inputs, so this information does not need to be duplicated in the type signature. That is, instead of separate
GradientF2CandGradientC2Foperators, we can use a singleGradient:Although the current design also generalizes the boundary conditions into a
NamedTuplewith generic names, this seems like an unnecessary degree of abstraction. If needed, custom constructors can be used to handle generic boundary names, but we can just use one set of names for the internals of our implementation. We can discuss as a group which combination will be most clear:top/bottom,left/right,bc1/bc2, or something else.Step 2.2 — Eliminate code divergence at boundaries.
Having separate code paths for boundaries and interiors is only beneficial when all iteration is statically unrolled. When this is not the case (e.g., in GPU kernels where iteration is split across multiple threads running in parallel), this design pattern introduces runtime warp divergence. To avoid this performance penalty, replace the
stencil_left_boundary/stencil_interior/stencil_right_boundaryfunctions with a singlestencilfunction, compacting all warp divergence into the bottommost call togetindex:This was partially completed in #2466, but only for the CUDA code relating to linear operators. It should be straightforward to generalize for CPU code.
Step 2.3 — Define all operators in terms of a single primitive.
All finite difference operators can define using a single "convolution" primitive, with a special case for linear operators, for which the convolution is an inner product. This was partially started in #2466, but a large amount of complexity was needed to handle
AxisTensors, whose projections are not currently expressible using simple products. Merging in #2457 should allow us to eliminate this complexity.Step 2.4 — Reduce advection operator variants.
The 8 advection operators (Upwind, LinVanLeer, TVDLimited, 3rdOrder variants × C2F/F2C) share the same control flow — they differ in the interpolation rule applied within the stencil. Factor the interpolation rule into a strategy parameter rather than duplicating the entire operator.
Testing Strategy
convergence_column.jl,convergence_advection_diffusion1d.jl) are the primary correctness gate.Done Criterion
API functions pass all correctness and performance tests, demonstrating optimal inlining and type stability. Operator count reduced from 53 to ~25. Line count reduced by ≥ 50%.
Phase 3: Spectral Element Operators — Dimension-Generic Kernels
Target:
src/Operators/spectralelement.jl(1,873 lines → ~1,200 lines)Problem
Each spectral element operator (Divergence, Gradient, WeakDivergence, WeakGradient, Curl, WeakCurl) has separate 1D
{(1,)}and 2D{(1,2)}implementations. The 2D version repeats the 1D logic for each axis. Example:Divergence{(1,)}is a single loop overi ∈ 1:NqcomputingD[ii,i] ⊠ Jv¹, whileDivergence{(1,2)}repeats this for bothiandjaxes withJv¹andJv².Approach
Step 3.1 — Write dimension-generic apply_operator using ntuple.
The core pattern for all SE operators is can be written generically over
Nspatial dimensions usingntupleandCartesianIndices:Step 3.2 — Unify Strong/Weak operator variants.
Strong form operators (Divergence, Gradient) and weak form operators (WeakDivergence, WeakGradient) differ only in the use of
JvsWJand which metric terms appear. Factor this as aFormTypeparameter (:strong/:weak) rather than duplicating the entire operator.GPU Performance Implications
Ndimensions is easier to map to shared-memory tiling strategies (iterate tiles ofNq^Npoints).Testing Strategy
@test_opttests to verify zero-overhead forN=1andN=2.Done Criterion
All spectral element tests pass. Each operator has one
apply_operatormethod instead of two. Line count reduced by ~35%.Phase 4: Type Alias Cleanup and Structural Simplification
Target: Spaces (1,394 lines), Grids (1,296 lines), Fields (2,453 lines)
Problem
45+
consttype aliases create a Cartesian product of names across three modules:ExtrudedSpectralElementGrid2D,ExtrudedCubedSphereSpectralElementGrid3D,DeviceSpectralElementGrid2D, etc.ExtrudedFiniteDifferenceSpace2D,CenterExtrudedFiniteDifferenceSpace3D,ExtrudedRectilinearSpectralElementSpace3D, etc.SpectralElementField2D,CenterExtrudedFiniteDifferenceField,ExtrudedCubedSphereSpectralElementField3D, etc.Every new grid variant requires adding aliases in all three modules. The aliases don't carry behavior — they're just type constraints used for dispatch. In practice, runtime
isachecks are used anyway (e.g.,space isa ExtrudedFiniteDifferenceSpace3D ? ... : ...).Approach
Step 4.1 — Replace type aliases with trait queries.
Instead of:
Define:
Then dispatch on
horizontal_ndims(space)rather than matching a named type alias. This is both more extensible and more explicit about what property is being dispatched on.Step 4.2 — Eliminate Field type aliases entirely.
The 23 Field aliases add zero functionality —
Field{V, S}already carries the space typeS, so any dispatch that works on a Field alias works equally well withField{<:Any, <:SomeSpace}. Remove all Field aliases and useField{<:Any, <:SpaceType}where needed (or the trait queries from Step 4.1).Step 4.3 — Consolidate Device grid variants.*
DeviceSpectralElementGrid2D,DeviceFiniteDifferenceGrid,DeviceExtrudedFiniteDifferenceGridare separate structs that mirror main grid types for GPU use. Replace withAdapt.adapt_structureon the main grid types, usingAdapt.jl's existing pattern. This eliminates the parallel type hierarchy.Step 4.4 — Simplify space constructors.
30+ constructor methods for space types can be reduced by:
deprecations.jl)Gridsmodule exclusivelyGPU Performance Implications
Adapt.adapt_structuremeans GPU adaptations are automatically correct when fields are added to grid types — currently, Device* types can silently become stale.Testing Strategy
Adapt.adapt_structureround-trips correctly for GPU grids.Done Criterion
All existing tests pass. Zero type aliases remain in Fields. Grid/Space alias count reduced from 45+ to < 10 (keeping only the most common user-facing names as a thin compatibility layer). Device* types eliminated.
Phase 5: Test Suite Consolidation
Target:
test/(32,590 lines → ~22,000 lines)Problems
convergence_rate()is defined locally in 68 places. Fill/copyto helpers duplicated across test files. Matrix field test harnesses duplicated 21 times.Approach
Step 5.1 — Extract shared test utilities.
Move
convergence_rate, common space constructors, and fill/copyto helpers intotest/TestUtilities/. The module already exists but is underused.Step 5.2 — Parametric layout tests (supports Phase 1).
Replace the per-layout test files (
data0d.jl,data1d.jl,data2d.jl,data1dx.jl,data2dx.jl) with a single parametric test that loops over all layout types:Step 5.3 — Add GPU correctness tests.
Current GPU tests only verify that kernels launch without error. Add tests that:
Step 5.4 — Add cross-module integration tests.
Test the full pipeline: Domain → Mesh → Topology → Grid → Space → Field → Operator → result. Currently each module is tested in isolation. Integration tests catch interface mismatches.
Step 5.5 — Fix or remove disabled tests.
Tests to Consider Deleting
benchmarks/orperf/only).Tests to Add
data == read(write(data))for all layouts⟨div(u), f⟩ = -⟨u, grad(f)⟩@test_optfor all generic methods from Phase 1-3Done Criterion
Shared utilities eliminate ≥ 50 duplicate function definitions. GPU tests verify numerical correctness. All disabled tests either fixed or tracked.
Phase 6: Documentation Addition
Done Criterion
Docs suite approved by dycore paper co-authors.
Phasing and Dependencies
Risk Mitigation
@test_optand allocation tests before merging. Run benchmarks (perf/,benchmarks/) before and after each phase.ntuple/@generatedcan hit Julia compiler edge cases. Test on Julia 1.10-1.12.Summary