stillwater-sc · Ravenwater · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/docs/algorithmic-details/elreal-performance-baseline.md b/docs/algorithmic-details/elreal-performance-baseline.md
@@ -1,10 +1,18 @@
 # elreal Performance Baseline
 
-Phase I of epic #873. This document is a baseline measurement -- not a
-performance target. The numbers below come from a single workstation and
-are intended to identify the cost shape of the shipped implementation, so
-that future optimisation work has a starting point and a way to measure
-progress.
+Phase I of epic #873 (baseline) and Phase K.1 of follow-up epic #903
+(small-buffer optimisation on `_components`). This document is a
+baseline measurement -- not a performance target. The numbers below
+come from a single workstation and are intended to identify the cost
+shape of the shipped implementation, so that future optimisation work
+has a starting point and a way to measure progress.
+
+> **Phase K.1 update (#905)**: The `_components` storage migrated from
+> `std::vector<double>` to a small-buffer-optimised
+> `lazy_component_buffer` (inline 4 doubles + spill, see
+> `include/sw/universal/number/elreal/lazy_component_buffer.hpp`).
+> The headline numbers tables below carry both the original Phase I
+> baseline and the post-K.1 measurements.
 
 ## Measurement setup
 
@@ -18,11 +26,12 @@ progress.
   pattern (`a = a + b`) would build up. Per-operation cost is reported
   via `PerformanceRunner` from `include/sw/universal/benchmark/performance_runner.hpp`.
 
-## Headline numbers
+## Headline numbers (Phase I baseline -- pre-K.1)
 
 Throughput in operations per second, rounded. Same workload, two compilers.
 Both elreal and ereal<N> workloads construct fresh operands inside the
-loop body so the per-iteration allocation pattern matches between sides:
+loop body so the per-iteration allocation pattern matches between sides.
+These were the numbers before the K.1 small-buffer optimisation:
 
 | Operation | Budget | gcc 13.3 | clang 18.1 |
 |---|---|---:|---:|
@@ -43,10 +52,41 @@ loop body so the per-iteration allocation pattern matches between sides:
 | `ereal<4> /` | --- | 639 Kops/s | 721 Kops/s |
 | `ereal<8> /` | --- | 636 Kops/s | 725 Kops/s |
 
-The two compilers track within ~10-50% on the elreal arithmetic (clang
-is notably slower on `elreal *`; see the cost-shape discussion below)
-and within ~5% on ereal. Below, we use the gcc 13.3 numbers as the
-reference unless otherwise noted.
+## Headline numbers (post-K.1, current)
+
+After the K.1 small-buffer optimisation. Same workload, same hardware,
+same compilers. ereal numbers are unchanged from above (K.1 only touched
+elreal):
+
+| Operation | Budget | gcc 13.3 | clang 18.1 | vs Phase I (gcc) |
+|---|---|---:|---:|---:|
+| `elreal +` | depth 0 | 16 Mops/s | 17 Mops/s | **1.8x faster** |
+| `elreal +` | depth 1 | 12 Mops/s | 21 Mops/s | **1.3x faster** |
+| `elreal -` | depth 1 | 14 Mops/s | 17 Mops/s | **1.6x faster** |
+| `elreal *` | depth 1 | 19 Mops/s | 22 Mops/s | **2.4x faster** |
+| `elreal /` | depth 0 | 1 Gops/s | 138 Mops/s | dominated by compiler inlining once heap alloc is gone (see note) |
+| `elreal sqrt` | depth 1 | 30 Mops/s | 30 Mops/s | **2.1x faster** |
+| `elreal exp` | depth 1 | 31 Mops/s | 34 Mops/s | **2.2x faster** |
+| `elreal log` | depth 1 | 24 Mops/s | 28 Mops/s | **1.7x faster** |
+| `elreal + refine_to(106)` | --- | 13 Mops/s | 15 Mops/s | 1.4x |
+| `elreal + refine_to(212)` | --- | 11 Mops/s | 17 Mops/s | 1.4x |
+
+The two compilers no longer differ materially on most operators -- both
+land in the same 12-22 Mops/s range for arithmetic. The clang gap on
+`elreal *` that the Phase I baseline flagged (4 vs 8 Mops/s) is closed.
+
+The `elreal /` Gops/s result is genuine in the workload but worth
+flagging: with the inline-buffer change, the result `elreal` is fully
+stack-allocatable, and `elreal::operator/` happens to be the simplest
+operator (single double divide, no captured generator -- depth-2+
+Newton refinement is deferred to Phase L #906). gcc inlines the whole
+operator and the only remaining work is the double divide itself. In a
+workload where the result needs to be propagated into a more complex
+expression, the throughput drops back to the same range as the other
+operators.
+
+Below, we use the gcc 13.3 post-K.1 numbers as the reference unless
+otherwise noted.
 
 ## Reading the table
 
@@ -126,12 +166,27 @@ in elreal first.
 
 ## When is `elreal` faster than `ereal`?
 
-At today's depth-1 cap, `elreal` is essentially never faster than `ereal`
-on the elementary arithmetic when measured in raw ops per second.
-`ereal<2>` wins at `+ - *` by a factor of 1.2x to 3x at matched
-precision, and the only "win" for `elreal` is on division, where the
-lazy shortcut at depth 0 produces a misleading apples-to-oranges
-result against ereal's iterative full-precision division.
+The picture changed materially with K.1:
+
+| Op | `elreal` post-K.1 (gcc) | `ereal<2>` (gcc) | Winner |
+|---|---:|---:|---|
+| `+` | 12 Mops/s | 24 Mops/s | `ereal<2>` (~ 2x) |
+| `-` | 14 Mops/s | 19 Mops/s | `ereal<2>` (~ 1.4x) |
+| `*` | 19 Mops/s | 10 Mops/s | **`elreal` (~ 1.9x)** |
+| `/` | (apples-to-oranges) | 650 Kops/s | -- |
+| `sqrt`, `exp`, `log` | 24-31 Mops/s | n/a | `elreal` only |
+
+Multiplication has flipped: `elreal *` now beats `ereal<2> *` at matched
+precision because `ereal<N>` multiplication is O(N) in the eager
+expansion product while `elreal *` is essentially a single `two_prod`
+plus the (now inline) result envelope.
+
+Addition and subtraction still favour `ereal<2>` -- those operators
+are O(1) in `ereal<N>` for small N and the per-iteration cost is
+dominated by the result construction, which `ereal<N>` already amortises
+better than the lazy-stream envelope. The remaining gap is what
+Phase K.2 (`std::function` -> tagged-union generator) and Phase K.3
+(reference-counted operand sharing) target.
 
 What `elreal` gives you instead, and what `ereal` cannot:
 
@@ -159,34 +214,34 @@ Profile-guided observation: the bottleneck on every arithmetic operator
 is the same triple of `_components` vector allocation, `_generator`
 function-object packing, and the input copies that go into that pack.
 
-Concrete Phase II candidates, in order of expected payoff:
+Concrete Phase K candidates of the follow-up epic (#905), in order of
+expected payoff:
 
-1. **Small-buffer optimisation on `_components`.** A `std::vector<double>`
-   of size 1-2 is the common case. A `small_vector<double, 4>` (or a
-   `std::array<double, K>` with a runtime `size_`) would eliminate one
-   allocation per op for the typical case. This is the single largest
-   item on the list.
-2. **Generator type erasure that avoids heap.** `std::function` always
-   heap-allocates when the capture exceeds the SBO. Replacing
+1. ~~**Small-buffer optimisation on `_components`**~~ -- **DONE in K.1**.
+   `std::vector<double>` replaced with `lazy_component_buffer` (inline 4
+   doubles + spill via `std::vector`). The common case (depth 1-4)
+   pays no heap allocation. Achieved 1.3-2.4x speedup across
+   arithmetic and math at matched precision.
+2. **Generator type erasure that avoids heap** (Phase K.2). `std::function`
+   always heap-allocates when the capture exceeds the SBO. Replacing
    `std::function<double(std::size_t)>` with a tagged-union or
    intrusive-list-of-known-shapes representation would eliminate the
    second per-op allocation. The known shapes are small: depth-1 EFT
    residuals, depth-1 derivative corrections, constants, degenerate
    (all-zero). Each is a fixed-size POD.
-3. **Reference-counted operand sharing.** The lambda captures *copies*
-   of both inputs. Switching to `std::shared_ptr<const Components>`
+3. **Reference-counted operand sharing** (Phase K.3). The lambda captures
+   *copies* of both inputs. Switching to `std::shared_ptr<const Components>`
    would let multiple results share an ancestor without copying the
-   component vector. This becomes more valuable once Phase II depth-2+
+   component vector. This becomes more valuable once Phase L's depth-2+
    generators chain back to an ancestor.
-4. **SIMD/FMA on `two_sum` and `two_prod` batches.** Once the
-   allocation cost is shrunk, the EFT primitives become a non-trivial
-   fraction of the loop. A batch interface that processes 4-8 EFTs in
-   one SIMD pass would help reductions and dot products. This is a
-   later step -- meaningful only after the allocator hot path is
-   addressed.
-
-None of these are committed to a specific Phase II PR by this baseline;
-they are the natural ordering for follow-up work.
+4. **SIMD/FMA on `two_sum` and `two_prod` batches** (Phase K.4). Once
+   the allocation cost is shrunk, the EFT primitives become a
+   non-trivial fraction of the loop. A batch interface that processes
+   4-8 EFTs in one SIMD pass would help reductions and dot products.
+
+K.2 is the natural next target now that K.1 has shrunk the
+component-vector allocation: the `std::function` capture is the
+remaining per-op allocation cost.
 
 ## Out of scope for Phase I
 

diff --git a/docs/algorithmic-details/multi-component-arithmetic.md b/docs/algorithmic-details/multi-component-arithmetic.md
@@ -635,25 +635,28 @@ clang 18.1, `-O3`). The full numbers and reproduction recipe live in
 `docs/algorithmic-details/elreal-performance-baseline.md`. The summary
 shape for picker purposes:
 
-| Op | `elreal` (depth 1) | `ereal<2>` (~106 bits) | Winner |
+| Op | `elreal` post-K.1 | `ereal<2>` (~106 bits) | Winner |
 |---|---:|---:|---|
-| `+` | ~9 Mops/s  | ~24 Mops/s  | `ereal<2>` (~ 2.7x) |
-| `-` | ~9 Mops/s  | ~19 Mops/s  | `ereal<2>` (~ 2.1x) |
-| `*` | ~8 Mops/s  | ~10 Mops/s  | `ereal<2>` (~ 1.2x) |
-| `/` (elreal depth 0 only) | ~36 Mops/s | ~650 Kops/s | `elreal` (~ 55x; not apples-to-apples; ereal does full precision, elreal does double only) |
-| `sqrt`, `exp`, `log` | ~14 Mops/s | n/a       | `elreal` (ereal has no math functions) |
+| `+` | ~12 Mops/s | ~24 Mops/s | `ereal<2>` (~ 2x) |
+| `-` | ~14 Mops/s | ~19 Mops/s | `ereal<2>` (~ 1.4x) |
+| `*` | ~19 Mops/s | ~10 Mops/s | **`elreal` (~ 1.9x)** |
+| `/` (elreal depth 0 only) | (dominated by inlining) | ~650 Kops/s | `elreal` (apples-to-oranges; ereal does full precision, elreal does double only) |
+| `sqrt`, `exp`, `log` | ~24-31 Mops/s | n/a | `elreal` (ereal has no math functions) |
 
-(All numbers gcc 13.3 on a 12th Gen i7-12700K; both sides constructing
-fresh operands inside the loop body. Reproduction: `make benchmark_elreal_performance`.)
+(All numbers gcc 13.3 on a 12th Gen i7-12700K post-Phase-K.1 of #903;
+both sides constructing fresh operands inside the loop body.
+Reproduction: `make benchmark_elreal_performance`. Phase I baseline
+numbers before K.1 are in `docs/algorithmic-details/elreal-performance-baseline.md`.)
 
 Two reads from the table:
 
-1. **At matched precision on elementary arithmetic, `ereal<2>` is the
-   throughput winner today** by a factor of 1.2-3x. `elreal`'s
-   lazy-stream envelope (vector allocation + std::function capture +
-   operand copy) is the bottleneck; it adds tens to hundreds of
-   nanoseconds per op that `ereal<N>` mostly avoids by carrying no
-   captured-generator lambda alongside its component storage.
+1. **Multiplication has flipped: `elreal` now wins on `*` at matched
+   precision.** `ereal<N>` multiplication is O(N) in the eager expansion
+   product, while `elreal *` is essentially a single `two_prod` plus
+   the (post-K.1 inline) result envelope. Addition and subtraction
+   still favour `ereal<2>` because those operators are O(1) on the
+   eager side and the lazy-stream envelope's `std::function` capture
+   is still the dominant per-op cost (Phase K.2 target).
 2. **What `elreal` actually wins on is *correctness*, not throughput.**
    Decidable sign (Section 4 of `lazy-real-arithmetic.md`),
    precision-on-demand without committed-upfront budget, and access to

diff --git a/include/sw/universal/number/elreal/elreal_impl.hpp b/include/sw/universal/number/elreal/elreal_impl.hpp
@@ -31,7 +31,8 @@
 //    The value is stored as a memoized stream of double-precision components
 //    plus a generator that produces successive components on demand:
 //
-//        mutable std::vector<double>               _components;
+//        mutable lazy_component_buffer             _components;
+//             (4-double inline + spill via std::vector; Phase K.1 of #905)
 //        mutable std::function<double(std::size_t)> _generator;   // (Phase C)
 //        mutable std::size_t                        _computed_depth = 0;
 //
@@ -128,10 +129,11 @@
 //   - Refinement budget is a per-call argument with a sensible default
 //     (`elreal_default_budget = 8` components, ~424 bits cumulative).
 //
-// Triviality is *not* claimed: the type contains `std::vector` and
-// `std::function`, neither of which is trivially constructible. Universal's
-// library-wide `ReportTrivialityOfType` is reported, not asserted, for
-// elastic types -- consistent with `ereal`.
+// Triviality is *not* claimed: the type contains `lazy_component_buffer`
+// (which itself holds a `std::vector` for spill) and `std::function`,
+// neither of which is trivially constructible. Universal's library-wide
+// `ReportTrivialityOfType` is reported, not asserted, for elastic types
+// -- consistent with `ereal`.
 //
 // Deferred to later phases:
 //   - Math functions (Phase E, #878)
@@ -157,6 +159,7 @@
 
 #include <universal/number/elreal/exceptions.hpp>
 #include <universal/number/elreal/elreal_fwd.hpp>
+#include <universal/number/elreal/lazy_component_buffer.hpp>
 #include <universal/number/shared/specific_value_encoding.hpp>
 #include <universal/numerics/error_free_ops.hpp>
 
@@ -396,9 +399,12 @@ class elreal {
 		return s;
 	}
 
-	// Component access for tests and inspection. Returns a copy; the
-	// underlying vector is not mutable through this accessor.
-	const std::vector<double>& components() const noexcept { return _components; }
+	// Component access for tests and inspection. The underlying buffer
+	// is not mutable through this accessor. Phase K.1 (#905) replaced
+	// the storage from std::vector<double> with lazy_component_buffer;
+	// the buffer exposes size() and operator[] only, which is what all
+	// known callers use.
+	const lazy_component_buffer& components() const noexcept { return _components; }
 
 	bool iszero() const noexcept { return _computed_depth == 0 || double(*this) == 0.0; }
 
@@ -438,7 +444,9 @@ class elreal {
 	elreal operator-() const {
 		elreal result;
 		result._components.reserve(_components.size());
-		for (double c : _components) result._components.push_back(-c);
+		for (std::size_t i = 0; i < _components.size(); ++i) {
+			result._components.push_back(-_components[i]);
+		}
 		result._computed_depth = _computed_depth;
 		if (_generator) {
 			auto gen_cap = _generator;
@@ -457,7 +465,7 @@ class elreal {
 private:
 	// The lazy stream of components. Mutable because refinement is invoked
 	// in const contexts (comparison, decode-to-double).
-	mutable std::vector<double> _components;
+	mutable lazy_component_buffer _components;
 
 	// The high-water mark of materialised components. Phase A always has
 	// _computed_depth == _components.size(); Phase C may pre-allocate