ClimaCore refactoring plan for discussion

# ClimaCore Refactoring Plan

## Purpose

ClimaCore.jl has ~35k lines of source and ~33k lines of tests. The three largest modules (Operators (7.5k), MatrixFields (5.2k), and DataLayouts (4.2k)) account for 48% of the source code and contain substantial structural duplication. This plan targets a ~20% source reduction, making the code easier to extend, maintain, and optimize for performance.

The refactoring is organized into 6 phases. Each phase is independently shippable and has a clear "done" criterion. The phases are designed so that later ones benefit from earlier ones but do not require them.

This is meant as a starting point for discussion, not yet as a task list. It's likely I'm missing things, so please edit this as needed. 

---

## Phase 1: DataLayouts — Parameterized Layouts

**Target**: `src/DataLayouts/*`, `ext/cuda/data_layouts_*` (5,055 lines → ~2,000 lines)

### Problem

14 DataLayout types are defined, but they are all permutations of 5 universal index dimensions `(i, j, f, v, h)`. The FH/HF variants (e.g., IJFH vs IJHF, VIFH vs VIHF) are identical except for whether the field dimension `f` precedes or follows the element dimension `h` in memory. This creates systematic duplication:

- 9 `slab()` implementations differing only in which dimension is dropped
- 11 `column()` implementations differing only in which dimensions are extracted
- 5 `level()` implementations differing only in slice position
- `fill.jl`, `copyto.jl`, `mapreduce.jl`, `broadcast.jl` all re-implement the same loops per layout
- `to_data_specific()` and `to_data_specific_field()` encode 22 manual index permutations
- `field_dim()`, `h_dim()`, `type_params()`, `union_all()` each have 12-15 one-liner definitions

The existing code already shows awareness of this: `to_data_specific` encodes permutations as tuple selections, `field_dim`/`h_dim` return integer positions, and `singleton` types provide a dispatch key. The infrastructure for parameterization is half-built, and it should be straightforward to generalize into a simpler, more concise interface.

### Possible Approach

**Step 1.1 — Unify all layouts into three data structures.**

By retaining singleton dimensions (corresponding to axes of length one), all the various layout combinations can be compacted into just three alternatives:

```julia
struct VIJFH{S, A}
    array::A
end
struct IH1JH2{S, A}
    array::A
end
struct IV1JH2{S, A}
    array::A
end
```

The current design also contains array size information in the type parameters, but these belong in the `array` field rather than the `DataLayout` wrapper. For example, an `VIJFH` layout can be expressed as follows:

```julia
using StaticArrays: SizedArray, Dynamic
# given: static size parameters Nv, Ni, Nj, Nf; dynamic parameter Nh
array = zeros(T, Nv, Ni, Nj, Nf, Nh)
sized_array = SizedArray{Tuple{Nv, Ni, Nj, Nf, Dynamic()}(array)
data = VIJFH(sized_array)
```

**Step 1.2 — Define API functions as loops over the three layouts.**

All indexing will be standardized to either `data[v, i, j, f, h]`, `data[i, h1, j, h2]`, or `data[i, v1, j, h2]`. Other API functions like `slab`, `column`, and `level` can be implemented as wrappers for these indices. Replace duplicate implementations of `fill!`, `copyto!`, etc., with generic loops over the `CartesianIndices` corresponding to each layout. Use UnrolledUtilities to ensure that symbols in the layout tuple are always constant-propagated during compilation, producing the same specialized code we have today.

**Step 1.3 — Collapse broadcast implementations.**

The 11 `DataStyle` types (`IFHStyle`, `IHFStyle`, `IJFHStyle`, ...) can be reduced to a single `DataLayoutStyle`, with different types of styles combined according to their layout tuples:

```julia
struct DataLayoutStyle{L, A} end
Base.Broadcast.BroadcastStyle(
    ::DataLayoutStyle{L1, A1},
    ::DataLayoutStyle{L2, A2},
) where {L1, L2, A1, A2} =
    DataLayoutStyle{unrolled_unique((L1..., L2...)), promote_array_type(A1, A2)}
```

Also, having separate code paths and data structures for CPUs and GPUs adds a lot of unnecessary complexity. Use `ClimaComms.@threaded` to define a unified implementation, confining all device-specific abstractions to extensions of ClimaComms.

### Testing Strategy

- Add a parametric test that verifies `slab`, `column`, `level`, `fill!`, `copyto!`, `mapreduce` for all relevant layouts in a single test file.
- Add a round-trip property test: for each layout type, fill with known data via universal indexing, then read back via storage indexing and verify equivalence.
- Add a dead-code elimination test: for several different struct types, evaluate broadcast expressions that only use a subset of the struct fields, and check that the compiled LLVM code has the minimum number of reads and allocations.
  - Unnecessary allocations due to insufficient inlining were identified as a critical performance bottleneck in #2464.
- Either run all unit tests on both CPUs and GPUs, or confirm that CPU-only tests are also GPU-compatible by checking for type instabilities with `JET.@test_opt`.
  - Running on both devices is preferable to JET tests, since it is not uncommon for type instabilities to exist on only one of two devices.

### Done Criterion

API functions pass all correctness tests, demonstrating optimal inlining and type stability. Line count reduced by ≥ 50%.

---

## Phase 2: Finite Difference Operators — Stencil Abstraction

**Target**: `src/Operators/finitedifference.jl`, `ext/cuda/operators_finite_difference.jl`, `ext/cuda/operators_fd_*`, `ext/cuda/column_matrix_helpers.jl` (6,390 lines → ~3,000 lines)

### Problem

53 operator struct types are defined, with 80+ boundary condition method implementations. The dominant pattern is:

1. Interior stencils are nearly identical across operator pairs (e.g., `GradientF2C` and `GradientC2F` share the same interior computation; `InterpolateF2C` and `InterpolateC2F` are `rdiv(a⁺ ⊞ a⁻, 2)` in both cases).
2. Each operator re-defines `stencil_interior`, `stencil_left_boundary`, `stencil_right_boundary`, `boundary_width`, and `stencil_interior_width`.
3. Boundary condition handling is duplicated per-operator × per-BC type (SetValue, SetGradient, Extrapolate), with left/right mirrors that are often identical modulo sign.
4. The F2C/C2F distinction affects only `return_space` and which boundary points exist — the stencil computation itself is the same.

### Approach

**Step 2.1 — Drop redundant information from operator types.**

The return space of an operator can be inferred from the types of its inputs, so this information does not need to be duplicated in the type signature. That is, instead of separate `GradientF2C` and `GradientC2F` operators, we can use a single `Gradient`:

```julia
@kwdef struct Gradient{BC <: NamedTuple} <: FiniteDifferenceOperator
    bcs::BC = NamedTuple()
end
```

Although the current design also generalizes the boundary conditions into a `NamedTuple` with generic names, this seems like an unnecessary degree of abstraction. If needed, custom constructors can be used to handle generic boundary names, but we can just use one set of names for the internals of our implementation. We can discuss as a group which combination will be most clear: `top/bottom`, `left`/`right`, `bc1`/`bc2`, or something else.

**Step 2.2 — Eliminate code divergence at boundaries.**

Having separate code paths for boundaries and interiors is only beneficial when all iteration is statically unrolled. When this is not the case (e.g., in GPU kernels where iteration is split across multiple threads running in parallel), this design pattern introduces runtime warp divergence. To avoid this performance penalty, replace the `stencil_left_boundary`/`stencil_interior`/`stencil_right_boundary` functions with a single `stencil` function, compacting all warp divergence into the bottommost call to `getindex`:

```julia
@inline stencil((; bcs)::Interpolate, space, idx, hidx, arg) =
    (getidx(arg, space, idx + half, hidx, bcs) + getidx(arg, space, idx - half, hidx, bcs)) /2
# getidx applies boundary conditions when out-of-bounds
```

This was partially completed in #2466, but only for the CUDA code relating to linear operators. It should be straightforward to generalize for CPU code.

**Step 2.3 — Define all operators in terms of a single primitive.**

All finite difference operators can define using a single "convolution" primitive, with a special case for linear operators, for which the convolution is an inner product. This was partially started in #2466, but a large amount of complexity was needed to handle `AxisTensor`s, whose projections are not currently expressible using simple products. Merging in #2457 should allow us to eliminate this complexity.

**Step 2.4 — Reduce advection operator variants.**

The 8 advection operators (Upwind, LinVanLeer, TVDLimited, 3rdOrder variants × C2F/F2C) share the same control flow — they differ in the interpolation rule applied within the stencil. Factor the interpolation rule into a strategy parameter rather than duplicating the entire operator.

### Testing Strategy

- Existing convergence tests (`convergence_column.jl`, `convergence_advection_diffusion1d.jl`) are the primary correctness gate.
- Add boundary symmetry property tests: for symmetric BCs (SetValue), verify that applying the operator to a symmetric input produces symmetric output.
- Extend dead-code elimination tests from DataLayouts for chains of operators.
- Consolidate the operator performance tests/benchmarks.
- Run all tests on CPUs and GPUs.

### Done Criterion

API functions pass all correctness and performance tests, demonstrating optimal inlining and type stability. Operator count reduced from 53 to ~25. Line count reduced by ≥ 50%.

---

## Phase 3: Spectral Element Operators — Dimension-Generic Kernels

**Target**: `src/Operators/spectralelement.jl` (1,873 lines → ~1,200 lines)

### Problem

Each spectral element operator (Divergence, Gradient, WeakDivergence, WeakGradient, Curl, WeakCurl) has separate 1D `{(1,)}` and 2D `{(1,2)}` implementations. The 2D version repeats the 1D logic for each axis. Example: `Divergence{(1,)}` is a single loop over `i ∈ 1:Nq` computing `D[ii,i] ⊠ Jv¹`, while `Divergence{(1,2)}` repeats this for both `i` and `j` axes with `Jv¹` and `Jv²`.

### Approach

**Step 3.1 — Write dimension-generic apply_operator using ntuple.**

The core pattern for all SE operators is can be written generically over `N` spatial dimensions using `ntuple` and `CartesianIndices`:

```julia
function apply_operator(op::Divergence{I}, space, slabidx, arg) where {I}
    N = length(I)
    # ... generic loop over N dimensions
    for d in 1:N
        Jvd = local_geometry.J ⊠ contravariant_component(v, d, local_geometry)
        # accumulate D along axis d
    end
end
```

**Step 3.2 — Unify Strong/Weak operator variants.**

Strong form operators (Divergence, Gradient) and weak form operators (WeakDivergence, WeakGradient) differ only in the use of `J` vs `WJ` and which metric terms appear. Factor this as a `FormType` parameter (`:strong` / `:weak`) rather than duplicating the entire operator.

### GPU Performance Implications

- The generic loop over `N` dimensions is easier to map to shared-memory tiling strategies (iterate tiles of `Nq^N` points).
- Reducing the number of distinct operator kernels means fewer compilations and less GPU binary size.
- The unified strong/weak form allows a single fused kernel to compute both when needed (e.g., for diffusion = weak divergence of gradient).

### Testing Strategy

- Existing spectral element convergence tests (sphere gradient, divergence, curl, diffusion) are the primary correctness gate.
- Add `@test_opt` tests to verify zero-overhead for `N=1` and `N=2`.
- Add equivalence tests between old and new implementations on a reference mesh before removing old code.

### Done Criterion

All spectral element tests pass. Each operator has one `apply_operator` method instead of two. Line count reduced by ~35%.

---

## Phase 4: Type Alias Cleanup and Structural Simplification

**Target**: Spaces (1,394 lines), Grids (1,296 lines), Fields (2,453 lines)

### Problem

45+ `const` type aliases create a Cartesian product of names across three modules:
- **Grids**: `ExtrudedSpectralElementGrid2D`, `ExtrudedCubedSphereSpectralElementGrid3D`, `DeviceSpectralElementGrid2D`, etc.
- **Spaces**: `ExtrudedFiniteDifferenceSpace2D`, `CenterExtrudedFiniteDifferenceSpace3D`, `ExtrudedRectilinearSpectralElementSpace3D`, etc.
- **Fields**: `SpectralElementField2D`, `CenterExtrudedFiniteDifferenceField`, `ExtrudedCubedSphereSpectralElementField3D`, etc.

Every new grid variant requires adding aliases in all three modules. The aliases don't carry behavior — they're just type constraints used for dispatch. In practice, runtime `isa` checks are used anyway (e.g., `space isa ExtrudedFiniteDifferenceSpace3D ? ... : ...`).

### Approach

**Step 4.1 — Replace type aliases with trait queries.**

Instead of:
```julia
const ExtrudedFiniteDifferenceSpace3D = ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}
```

Define:
```julia
horizontal_ndims(::ExtrudedFiniteDifferenceSpace{<:ExtrudedFiniteDifferenceGrid{<:SpectralElementGrid2D}}) = 2  # 3D = 2D horiz + 1D vert
is_cubed_sphere(space) = grid(space) isa CubedSphereSpectralElementGrid2D
```

Then dispatch on `horizontal_ndims(space)` rather than matching a named type alias. This is both more extensible and more explicit about what property is being dispatched on.

**Step 4.2 — Eliminate Field type aliases entirely.**

The 23 Field aliases add zero functionality — `Field{V, S}` already carries the space type `S`, so any dispatch that works on a Field alias works equally well with `Field{<:Any, <:SomeSpace}`. Remove all Field aliases and use `Field{<:Any, <:SpaceType}` where needed (or the trait queries from Step 4.1).

**Step 4.3 — Consolidate Device* grid variants.**

`DeviceSpectralElementGrid2D`, `DeviceFiniteDifferenceGrid`, `DeviceExtrudedFiniteDifferenceGrid` are separate structs that mirror main grid types for GPU use. Replace with `Adapt.adapt_structure` on the main grid types, using `Adapt.jl`'s existing pattern. This eliminates the parallel type hierarchy.

**Step 4.4 — Simplify space constructors.**

30+ constructor methods for space types can be reduced by:
- Removing deprecated constructors (already tested in `deprecations.jl`)
- Using a single entry point per space type with keyword arguments
- Delegating grid construction to the `Grids` module exclusively

### GPU Performance Implications

- Eliminating Device* structs in favor of `Adapt.adapt_structure` means GPU adaptations are automatically correct when fields are added to grid types — currently, Device* types can silently become stale.
- Trait-based dispatch is more amenable to compiler specialization than Union-type matching.

### Testing Strategy

- Existing tests cover correctness.
- Add tests that verify trait queries return correct values for all space/grid combinations.
- Ensure `Adapt.adapt_structure` round-trips correctly for GPU grids.

### Done Criterion

All existing tests pass. Zero type aliases remain in Fields. Grid/Space alias count reduced from 45+ to < 10 (keeping only the most common user-facing names as a thin compatibility layer). Device* types eliminated.

---

## Phase 5: Test Suite Consolidation

**Target**: `test/` (32,590 lines → ~22,000 lines)

### Problems

1. **Duplication**: `convergence_rate()` is defined locally in 68 places. Fill/copyto helpers duplicated across test files. Matrix field test harnesses duplicated 21 times.
2. **Missing coverage**: No property-based tests, no cross-module integration tests, sparse GPU correctness tests (only verify kernel launch, not output).
3. **Ad-hoc organization**: Some files mix unit tests, convergence tests, and benchmarks.

### Approach

**Step 5.1 — Extract shared test utilities.**

Move `convergence_rate`, common space constructors, and fill/copyto helpers into `test/TestUtilities/`. The module already exists but is underused.

**Step 5.2 — Parametric layout tests (supports Phase 1).**

Replace the per-layout test files (`data0d.jl`, `data1d.jl`, `data2d.jl`, `data1dx.jl`, `data2dx.jl`) with a single parametric test that loops over all layout types:

```julia
@testset "DataLayout: $DL" for DL in [DataF, IF, IFH, IHF, IJF, IJFH, IJHF, VF, VIFH, VIHF, VIJFH, VIJHF]
    # ... uniform tests for slab, column, level, fill!, copyto!, mapreduce
end
```

**Step 5.3 — Add GPU correctness tests.**

Current GPU tests only verify that kernels launch without error. Add tests that:
- Run the same computation on CPU and GPU, compare results within tolerance.
- Cover all operator families (spectral element, finite difference, hybrid).
- Test edge cases: single-element meshes, boundary-only columns.

**Step 5.4 — Add cross-module integration tests.**

Test the full pipeline: Domain → Mesh → Topology → Grid → Space → Field → Operator → result. Currently each module is tested in isolation. Integration tests catch interface mismatches.

**Step 5.5 — Fix or remove disabled tests.**

- Terrain warp: diagnose the hang, fix or mark as a known issue with a tracking issue.
- MatrixFields solver/operator tests: either speed them up or move to Buildkite CI

### Tests to Consider Deleting

- Per-layout data tests that will be superseded by parametric tests (Step 5.2).
- Redundant convergence rate implementations (replaced by shared utility).
- Benchmark files that duplicate unit test setup with minimal additional value (keep benchmarks in `benchmarks/` or `perf/` only).

### Tests to Add

| Category | What | Why |
|----------|------|-----|
| Property-based | Round-trip: `data == read(write(data))` for all layouts | Catches permutation bugs |
| Property-based | Operator adjoint identity: `⟨div(u), f⟩ = -⟨u, grad(f)⟩` | Mathematical correctness |
| GPU correctness | CPU vs GPU comparison for all operator families | Currently untested |
| Integration | Full pipeline: domain → field → operator → result | Interface mismatches |
| Regression | `@test_opt` for all generic methods from Phase 1-3 | Zero-overhead guarantee |

### Done Criterion

Shared utilities eliminate ≥ 50 duplicate function definitions. GPU tests verify numerical correctness. All disabled tests either fixed or tracked.

---

## Phase 6: Documentation Addition

* Add docstrings that are clear, concise, and consistent with the code
* Add a docs/ suite that outlines basic formulations and functionality, summarizing key aspects of the dycore paper, at the level of detail and in the style of, e.g., ClimaTimeSteppers.jl

### Done Criterion

Docs suite approved by dycore paper co-authors. 

---

## Phasing and Dependencies

```
Phase 1 (DataLayouts)  ─────────────────────────────────────────►
Phase 2 (FD Operators) ──────────────────────────────────────────►
Phase 3 (SE Operators)          (can start after Phase 1 done) ──►
Phase 4 (Type Aliases)          (independent, can start anytime) ►
Phase 5 (Tests)         (parallel with all phases) ──────────────►
Phase 6 (Docs)         (parallel with all phases) ──────────────►
```

## Risk Mitigation

1. **Performance regression**: Every phase requires `@test_opt` and allocation tests before merging. Run benchmarks (`perf/`, `benchmarks/`) before and after each phase.
2. **Downstream breakage**: ClimaCore is used by ClimaAtmos and other CliMA packages. Each phase should be merged as a minor version bump with a CHANGELOG entry. Type alias removal (Phase 4) is the highest-risk for downstream -> provide a deprecation period.
3. **GPU correctness**: Phases 1-3 change code that runs on GPU. The current GPU tests only verify kernel launch. Adding GPU correctness tests should be done **before or concurrently with** Phases 1-3, not after.
4. **Compiler regressions**: Generic code using `ntuple` / `@generated` can hit Julia compiler edge cases. Test on Julia 1.10-1.12. 

## Summary

| Phase | Target | Current Lines | Estimated After | Reduction |
|-------|--------|--------------|-----------------|-----------|
| 1 | DataLayouts | 4,231 | ~1,800 | ~57% |
| 2 | FD Operators | 4,282 | ~2,000 | ~53% |
| 3 | SE Operators | 1,873 | ~1,200 | ~36% |
| 4 | Type Aliases (Spaces/Grids/Fields) | 5,143 | ~3,800 | ~26% |
| **Total src** | | **35,423** | **~28,700** | **~19%** |



Category	What	Why
Property-based	Round-trip: `data == read(write(data))` for all layouts	Catches permutation bugs
Property-based	Operator adjoint identity: `⟨div(u), f⟩ = -⟨u, grad(f)⟩`	Mathematical correctness
GPU correctness	CPU vs GPU comparison for all operator families	Currently untested
Integration	Full pipeline: domain → field → operator → result	Interface mismatches
Regression	`@test_opt` for all generic methods from Phase 1-3	Zero-overhead guarantee

Phase	Target	Current Lines	Estimated After	Reduction
1	DataLayouts	4,231	~1,800	~57%
2	FD Operators	4,282	~2,000	~53%
3	SE Operators	1,873	~1,200	~36%
4	Type Aliases (Spaces/Grids/Fields)	5,143	~3,800	~26%
Total src		35,423	~28,700	~19%

ClimaCore refactoring plan for discussion #2468

Description

ClimaCore Refactoring Plan

Purpose

Phase 1: DataLayouts — Parameterized Layouts

Problem

Possible Approach

Testing Strategy

Done Criterion

Phase 2: Finite Difference Operators — Stencil Abstraction

Problem

Approach

Testing Strategy

Done Criterion

Phase 3: Spectral Element Operators — Dimension-Generic Kernels

Problem

Approach

GPU Performance Implications

Testing Strategy

Done Criterion

Phase 4: Type Alias Cleanup and Structural Simplification

Problem

Approach

GPU Performance Implications

Testing Strategy

Done Criterion

Phase 5: Test Suite Consolidation

Problems

Approach

Tests to Consider Deleting

Tests to Add

Done Criterion

Phase 6: Documentation Addition

Done Criterion

Phasing and Dependencies

Risk Mitigation

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions