Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 92 additions & 37 deletions docs/algorithmic-details/elreal-performance-baseline.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
# elreal Performance Baseline

Phase I of epic #873. This document is a baseline measurement -- not a
performance target. The numbers below come from a single workstation and
are intended to identify the cost shape of the shipped implementation, so
that future optimisation work has a starting point and a way to measure
progress.
Phase I of epic #873 (baseline) and Phase K.1 of follow-up epic #903
(small-buffer optimisation on `_components`). This document is a
baseline measurement -- not a performance target. The numbers below
come from a single workstation and are intended to identify the cost
shape of the shipped implementation, so that future optimisation work
has a starting point and a way to measure progress.

> **Phase K.1 update (#905)**: The `_components` storage migrated from
> `std::vector<double>` to a small-buffer-optimised
> `lazy_component_buffer` (inline 4 doubles + spill, see
> `include/sw/universal/number/elreal/lazy_component_buffer.hpp`).
> The headline numbers tables below carry both the original Phase I
> baseline and the post-K.1 measurements.

## Measurement setup

Expand All @@ -18,11 +26,12 @@ progress.
pattern (`a = a + b`) would build up. Per-operation cost is reported
via `PerformanceRunner` from `include/sw/universal/benchmark/performance_runner.hpp`.

## Headline numbers
## Headline numbers (Phase I baseline -- pre-K.1)

Throughput in operations per second, rounded. Same workload, two compilers.
Both elreal and ereal<N> workloads construct fresh operands inside the
loop body so the per-iteration allocation pattern matches between sides:
loop body so the per-iteration allocation pattern matches between sides.
These were the numbers before the K.1 small-buffer optimisation:

| Operation | Budget | gcc 13.3 | clang 18.1 |
|---|---|---:|---:|
Expand All @@ -43,10 +52,41 @@ loop body so the per-iteration allocation pattern matches between sides:
| `ereal<4> /` | --- | 639 Kops/s | 721 Kops/s |
| `ereal<8> /` | --- | 636 Kops/s | 725 Kops/s |

The two compilers track within ~10-50% on the elreal arithmetic (clang
is notably slower on `elreal *`; see the cost-shape discussion below)
and within ~5% on ereal. Below, we use the gcc 13.3 numbers as the
reference unless otherwise noted.
## Headline numbers (post-K.1, current)

After the K.1 small-buffer optimisation. Same workload, same hardware,
same compilers. ereal numbers are unchanged from above (K.1 only touched
elreal):

| Operation | Budget | gcc 13.3 | clang 18.1 | vs Phase I (gcc) |
|---|---|---:|---:|---:|
| `elreal +` | depth 0 | 16 Mops/s | 17 Mops/s | **1.8x faster** |
| `elreal +` | depth 1 | 12 Mops/s | 21 Mops/s | **1.3x faster** |
| `elreal -` | depth 1 | 14 Mops/s | 17 Mops/s | **1.6x faster** |
| `elreal *` | depth 1 | 19 Mops/s | 22 Mops/s | **2.4x faster** |
| `elreal /` | depth 0 | 1 Gops/s | 138 Mops/s | dominated by compiler inlining once heap alloc is gone (see note) |
| `elreal sqrt` | depth 1 | 30 Mops/s | 30 Mops/s | **2.1x faster** |
| `elreal exp` | depth 1 | 31 Mops/s | 34 Mops/s | **2.2x faster** |
| `elreal log` | depth 1 | 24 Mops/s | 28 Mops/s | **1.7x faster** |
| `elreal + refine_to(106)` | --- | 13 Mops/s | 15 Mops/s | 1.4x |
| `elreal + refine_to(212)` | --- | 11 Mops/s | 17 Mops/s | 1.4x |

The two compilers no longer differ materially on most operators -- both
land in the same 12-22 Mops/s range for arithmetic. The clang gap on
`elreal *` that the Phase I baseline flagged (4 vs 8 Mops/s) is closed.

The `elreal /` Gops/s result is genuine in the workload but worth
flagging: with the inline-buffer change, the result `elreal` is fully
stack-allocatable, and `elreal::operator/` happens to be the simplest
operator (single double divide, no captured generator -- depth-2+
Newton refinement is deferred to Phase L #906). gcc inlines the whole
operator and the only remaining work is the double divide itself. In a
workload where the result needs to be propagated into a more complex
expression, the throughput drops back to the same range as the other
operators.

Below, we use the gcc 13.3 post-K.1 numbers as the reference unless
otherwise noted.

## Reading the table

Expand Down Expand Up @@ -126,12 +166,27 @@ in elreal first.

## When is `elreal` faster than `ereal`?

At today's depth-1 cap, `elreal` is essentially never faster than `ereal`
on the elementary arithmetic when measured in raw ops per second.
`ereal<2>` wins at `+ - *` by a factor of 1.2x to 3x at matched
precision, and the only "win" for `elreal` is on division, where the
lazy shortcut at depth 0 produces a misleading apples-to-oranges
result against ereal's iterative full-precision division.
The picture changed materially with K.1:

| Op | `elreal` post-K.1 (gcc) | `ereal<2>` (gcc) | Winner |
|---|---:|---:|---|
| `+` | 12 Mops/s | 24 Mops/s | `ereal<2>` (~ 2x) |
| `-` | 14 Mops/s | 19 Mops/s | `ereal<2>` (~ 1.4x) |
| `*` | 19 Mops/s | 10 Mops/s | **`elreal` (~ 1.9x)** |
| `/` | (apples-to-oranges) | 650 Kops/s | -- |
| `sqrt`, `exp`, `log` | 24-31 Mops/s | n/a | `elreal` only |

Multiplication has flipped: `elreal *` now beats `ereal<2> *` at matched
precision because `ereal<N>` multiplication is O(N) in the eager
expansion product while `elreal *` is essentially a single `two_prod`
plus the (now inline) result envelope.

Addition and subtraction still favour `ereal<2>` -- those operators
are O(1) in `ereal<N>` for small N and the per-iteration cost is
dominated by the result construction, which `ereal<N>` already amortises
better than the lazy-stream envelope. The remaining gap is what
Phase K.2 (`std::function` -> tagged-union generator) and Phase K.3
(reference-counted operand sharing) target.

What `elreal` gives you instead, and what `ereal` cannot:

Expand Down Expand Up @@ -159,34 +214,34 @@ Profile-guided observation: the bottleneck on every arithmetic operator
is the same triple of `_components` vector allocation, `_generator`
function-object packing, and the input copies that go into that pack.

Concrete Phase II candidates, in order of expected payoff:
Concrete Phase K candidates of the follow-up epic (#905), in order of
expected payoff:

1. **Small-buffer optimisation on `_components`.** A `std::vector<double>`
of size 1-2 is the common case. A `small_vector<double, 4>` (or a
`std::array<double, K>` with a runtime `size_`) would eliminate one
allocation per op for the typical case. This is the single largest
item on the list.
2. **Generator type erasure that avoids heap.** `std::function` always
heap-allocates when the capture exceeds the SBO. Replacing
1. ~~**Small-buffer optimisation on `_components`**~~ -- **DONE in K.1**.
`std::vector<double>` replaced with `lazy_component_buffer` (inline 4
doubles + spill via `std::vector`). The common case (depth 1-4)
pays no heap allocation. Achieved 1.3-2.4x speedup across
arithmetic and math at matched precision.
2. **Generator type erasure that avoids heap** (Phase K.2). `std::function`
always heap-allocates when the capture exceeds the SBO. Replacing
`std::function<double(std::size_t)>` with a tagged-union or
intrusive-list-of-known-shapes representation would eliminate the
second per-op allocation. The known shapes are small: depth-1 EFT
residuals, depth-1 derivative corrections, constants, degenerate
(all-zero). Each is a fixed-size POD.
3. **Reference-counted operand sharing.** The lambda captures *copies*
of both inputs. Switching to `std::shared_ptr<const Components>`
3. **Reference-counted operand sharing** (Phase K.3). The lambda captures
*copies* of both inputs. Switching to `std::shared_ptr<const Components>`
would let multiple results share an ancestor without copying the
component vector. This becomes more valuable once Phase II depth-2+
component vector. This becomes more valuable once Phase L's depth-2+
generators chain back to an ancestor.
4. **SIMD/FMA on `two_sum` and `two_prod` batches.** Once the
allocation cost is shrunk, the EFT primitives become a non-trivial
fraction of the loop. A batch interface that processes 4-8 EFTs in
one SIMD pass would help reductions and dot products. This is a
later step -- meaningful only after the allocator hot path is
addressed.

None of these are committed to a specific Phase II PR by this baseline;
they are the natural ordering for follow-up work.
4. **SIMD/FMA on `two_sum` and `two_prod` batches** (Phase K.4). Once
the allocation cost is shrunk, the EFT primitives become a
non-trivial fraction of the loop. A batch interface that processes
4-8 EFTs in one SIMD pass would help reductions and dot products.

K.2 is the natural next target now that K.1 has shrunk the
component-vector allocation: the `std::function` capture is the
remaining per-op allocation cost.

## Out of scope for Phase I

Expand Down
31 changes: 17 additions & 14 deletions docs/algorithmic-details/multi-component-arithmetic.md
Original file line number Diff line number Diff line change
Expand Up @@ -635,25 +635,28 @@ clang 18.1, `-O3`). The full numbers and reproduction recipe live in
`docs/algorithmic-details/elreal-performance-baseline.md`. The summary
shape for picker purposes:

| Op | `elreal` (depth 1) | `ereal<2>` (~106 bits) | Winner |
| Op | `elreal` post-K.1 | `ereal<2>` (~106 bits) | Winner |
|---|---:|---:|---|
| `+` | ~9 Mops/s | ~24 Mops/s | `ereal<2>` (~ 2.7x) |
| `-` | ~9 Mops/s | ~19 Mops/s | `ereal<2>` (~ 2.1x) |
| `*` | ~8 Mops/s | ~10 Mops/s | `ereal<2>` (~ 1.2x) |
| `/` (elreal depth 0 only) | ~36 Mops/s | ~650 Kops/s | `elreal` (~ 55x; not apples-to-apples; ereal does full precision, elreal does double only) |
| `sqrt`, `exp`, `log` | ~14 Mops/s | n/a | `elreal` (ereal has no math functions) |
| `+` | ~12 Mops/s | ~24 Mops/s | `ereal<2>` (~ 2x) |
| `-` | ~14 Mops/s | ~19 Mops/s | `ereal<2>` (~ 1.4x) |
| `*` | ~19 Mops/s | ~10 Mops/s | **`elreal` (~ 1.9x)** |
| `/` (elreal depth 0 only) | (dominated by inlining) | ~650 Kops/s | `elreal` (apples-to-oranges; ereal does full precision, elreal does double only) |
| `sqrt`, `exp`, `log` | ~24-31 Mops/s | n/a | `elreal` (ereal has no math functions) |

(All numbers gcc 13.3 on a 12th Gen i7-12700K; both sides constructing
fresh operands inside the loop body. Reproduction: `make benchmark_elreal_performance`.)
(All numbers gcc 13.3 on a 12th Gen i7-12700K post-Phase-K.1 of #903;
both sides constructing fresh operands inside the loop body.
Reproduction: `make benchmark_elreal_performance`. Phase I baseline
numbers before K.1 are in `docs/algorithmic-details/elreal-performance-baseline.md`.)

Two reads from the table:

1. **At matched precision on elementary arithmetic, `ereal<2>` is the
throughput winner today** by a factor of 1.2-3x. `elreal`'s
lazy-stream envelope (vector allocation + std::function capture +
operand copy) is the bottleneck; it adds tens to hundreds of
nanoseconds per op that `ereal<N>` mostly avoids by carrying no
captured-generator lambda alongside its component storage.
1. **Multiplication has flipped: `elreal` now wins on `*` at matched
precision.** `ereal<N>` multiplication is O(N) in the eager expansion
product, while `elreal *` is essentially a single `two_prod` plus
the (post-K.1 inline) result envelope. Addition and subtraction
still favour `ereal<2>` because those operators are O(1) on the
eager side and the lazy-stream envelope's `std::function` capture
is still the dominant per-op cost (Phase K.2 target).
2. **What `elreal` actually wins on is *correctness*, not throughput.**
Decidable sign (Section 4 of `lazy-real-arithmetic.md`),
precision-on-demand without committed-upfront budget, and access to
Expand Down
28 changes: 18 additions & 10 deletions include/sw/universal/number/elreal/elreal_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@
// The value is stored as a memoized stream of double-precision components
// plus a generator that produces successive components on demand:
//
// mutable std::vector<double> _components;
// mutable lazy_component_buffer _components;
// (4-double inline + spill via std::vector; Phase K.1 of #905)
// mutable std::function<double(std::size_t)> _generator; // (Phase C)
// mutable std::size_t _computed_depth = 0;
//
Expand Down Expand Up @@ -128,10 +129,11 @@
// - Refinement budget is a per-call argument with a sensible default
// (`elreal_default_budget = 8` components, ~424 bits cumulative).
//
// Triviality is *not* claimed: the type contains `std::vector` and
// `std::function`, neither of which is trivially constructible. Universal's
// library-wide `ReportTrivialityOfType` is reported, not asserted, for
// elastic types -- consistent with `ereal`.
// Triviality is *not* claimed: the type contains `lazy_component_buffer`
// (which itself holds a `std::vector` for spill) and `std::function`,
// neither of which is trivially constructible. Universal's library-wide
// `ReportTrivialityOfType` is reported, not asserted, for elastic types
// -- consistent with `ereal`.
//
// Deferred to later phases:
// - Math functions (Phase E, #878)
Expand All @@ -157,6 +159,7 @@

#include <universal/number/elreal/exceptions.hpp>
#include <universal/number/elreal/elreal_fwd.hpp>
#include <universal/number/elreal/lazy_component_buffer.hpp>
#include <universal/number/shared/specific_value_encoding.hpp>
#include <universal/numerics/error_free_ops.hpp>

Expand Down Expand Up @@ -396,9 +399,12 @@ class elreal {
return s;
}

// Component access for tests and inspection. Returns a copy; the
// underlying vector is not mutable through this accessor.
const std::vector<double>& components() const noexcept { return _components; }
// Component access for tests and inspection. The underlying buffer
// is not mutable through this accessor. Phase K.1 (#905) replaced
// the storage from std::vector<double> with lazy_component_buffer;
// the buffer exposes size() and operator[] only, which is what all
// known callers use.
const lazy_component_buffer& components() const noexcept { return _components; }

bool iszero() const noexcept { return _computed_depth == 0 || double(*this) == 0.0; }

Expand Down Expand Up @@ -438,7 +444,9 @@ class elreal {
elreal operator-() const {
elreal result;
result._components.reserve(_components.size());
for (double c : _components) result._components.push_back(-c);
for (std::size_t i = 0; i < _components.size(); ++i) {
result._components.push_back(-_components[i]);
}
result._computed_depth = _computed_depth;
if (_generator) {
auto gen_cap = _generator;
Expand All @@ -457,7 +465,7 @@ class elreal {
private:
// The lazy stream of components. Mutable because refinement is invoked
// in const contexts (comparison, decode-to-double).
mutable std::vector<double> _components;
mutable lazy_component_buffer _components;

// The high-water mark of materialised components. Phase A always has
// _computed_depth == _components.size(); Phase C may pre-allocate
Expand Down
Loading
Loading