perf(elreal): Phase K.1 -- small-buffer optimisation on _components#912
Conversation
Phase K.1 of follow-up epic #903. First sub-phase of Phase K (#905): allocator hot-path optimisation. The Phase I baseline identified per-operator vector allocation as the dominant cost; this PR closes that part of the gap. What landed ----------- - include/sw/universal/number/elreal/lazy_component_buffer.hpp -- new small-buffer storage type: 4-double inline array + spill via std::vector. 64 bytes total (one cache line). API surface is the minimum elreal needs: push_back, operator[], size, clear, reserve. - include/sw/universal/number/elreal/elreal_impl.hpp -- migrated _components from std::vector<double> to lazy_component_buffer. The components() accessor now returns const lazy_component_buffer&; the single external user in manipulators.hpp uses .size() and operator[] which are unchanged. - docs/algorithmic-details/elreal-performance-baseline.md -- new headline-numbers table for the post-K.1 measurements. Phase I baseline numbers retained for the before/after comparison. - docs/algorithmic-details/multi-component-arithmetic.md -- section 7.1 picker rule updated: multiplication now favors elreal over ereal<2> at matched precision (was 1.2x ereal-favored, now 1.9x elreal-favored). Headline numbers (gcc 13.3 on 12th Gen Intel i7-12700K) -------------------------------------------------------- | Op | Phase I | Phase K.1 | Speedup | |-----------------|--------:|----------:|--------:| | elreal + d1 | 9 Mops | 12 Mops | 1.3x | | elreal - d1 | 9 Mops | 14 Mops | 1.6x | | elreal * d1 | 8 Mops | 19 Mops | 2.4x | | elreal / d0 | 36 Mops | (gops) | dominated by inlining once heap alloc is gone | | elreal sqrt d1 | 14 Mops | 30 Mops | 2.1x | | elreal exp d1 | 14 Mops | 31 Mops | 2.2x | | elreal log d1 | 14 Mops | 24 Mops | 1.7x | clang 18.1 shows even larger gains (the Phase I clang-vs-gcc gap on multiplication, 4 vs 8 Mops/s, is closed -- both now land at ~ 20 Mops/s). The crossover with ereal<2> at matched precision: - Addition and subtraction: ereal<2> still wins by 1.4-2x (was 2-2.7x). - Multiplication: elreal now beats ereal<2> by 1.9x. - Division: not apples-to-apples (elreal depth-0 only until Phase L). - sqrt / exp / log / pow / trig / hyperbolic: elreal-only (ereal has no math functions). Validation ---------- All 30 elreal regression tests PASS under both gcc 13.3 and clang 18.1. Phase J oracle sweep across dd / qd / dd_cascade / td_cascade / qd_cascade / ereal<N> PASS under both compilers. What this does NOT do --------------------- - Eliminate the std::function generator capture. That's Phase K.2. - Eliminate the operand copies into the lambda capture. That's Phase K.3 (refcounted operand sharing). - Add SIMD/FMA batching to the EFTs. That's Phase K.4. Part of #905 (Phase K of the elreal follow-up epic #903). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR implements Phase K.1: introduces a small-buffer-optimized lazy_component_buffer, migrates elreal's _components from std::vector to that buffer (updating code and APIs), and updates performance documentation with post-K.1 benchmark results and revised operator crossover narratives. ChangesPhase K.1 _components Buffer Optimization
Sequence Diagram(s)sequenceDiagram
participant Caller
participant elreal
participant lazy_component_buffer as buffer
Caller->>elreal: components()
elreal->>buffer: return const ref
Caller->>buffer: size()
buffer-->>Caller: size
Caller->>buffer: operator[](i)
buffer-->>Caller: double value (inline or spill)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
include/sw/universal/number/elreal/lazy_component_buffer.hpp (1)
96-98: ⚡ Quick winAdd a debug bounds check in
operator[].This accessor currently allows silent UB on stale/out-of-range indexes. A debug-only assert keeps zero-cost release behavior while catching misuse quickly.
Proposed patch
+#include <cassert> ... double operator[](std::size_t i) const noexcept { + assert(i < _size && "lazy_component_buffer index out of range"); return (i < inline_capacity) ? _inline[i] : _spill[i - inline_capacity]; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/sw/universal/number/elreal/lazy_component_buffer.hpp` around lines 96 - 98, Add a debug-only bounds check to the const subscript operator[] to prevent silent UB: inside double operator[](std::size_t i) const noexcept (the function using inline_capacity, _inline and _spill) assert that i is less than the current element count (e.g., assert(i < size()) or assert(i < _size) depending on the class member that tracks length); include <cassert> and keep the function noexcept so this only affects debug builds and preserves zero-cost release behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@include/sw/universal/number/elreal/lazy_component_buffer.hpp`:
- Around line 96-98: Add a debug-only bounds check to the const subscript
operator[] to prevent silent UB: inside double operator[](std::size_t i) const
noexcept (the function using inline_capacity, _inline and _spill) assert that i
is less than the current element count (e.g., assert(i < size()) or assert(i <
_size) depending on the class member that tracks length); include <cassert> and
keep the function noexcept so this only affects debug builds and preserves
zero-cost release behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5913b625-0ed2-46c0-940f-fbea0e2d795d
📒 Files selected for processing (4)
docs/algorithmic-details/elreal-performance-baseline.mddocs/algorithmic-details/multi-component-arithmetic.mdinclude/sw/universal/number/elreal/elreal_impl.hppinclude/sw/universal/number/elreal/lazy_component_buffer.hpp
Add debug-only bounds check in lazy_component_buffer::operator[]. CodeRabbit flagged the missing assert as a quick win: zero-cost in release (NDEBUG strips the assert), catches misuse early in debug builds. Applied directly per the proposed patch. Validation: all sampled elreal tests PASS under gcc 13.3 and clang 18.1 with the assert added (the invariant `i < size()` already held across all elreal indexing sites; the assert just makes it explicit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed the single CodeRabbit nitpick in 329c989: added The invariant Validation: all sampled elreal tests (api, arithmetic, math, constants, oracle sweep) PASS under both gcc 13.3 and clang 18.1. |
Coverage Report for CI Build 26257273704Coverage increased (+0.01%) to 84.245%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions2 previously-covered lines in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
…std::function Phase K.2 of follow-up epic #903. The Phase I baseline identified the per-op std::function heap allocation (~216-byte capture per binary op) as a major cost. K.1 (#912) closed the _components vector alloc; this PR closes the _generator function-object alloc. Design (#905 K.2 + K.3 combined) -------------------------------- elreal's _generator field migrates from `std::function<double(std::size_t)>` to a `std::variant` of small POD shapes: gen_unary_linear (1 handle + 1 double = 24 bytes) coeff * a.at(1) at depth 1; covers all 19 math functions gen_binary_linear (2 handles + 3 doubles = 56 bytes) c0 + ca*a.at(1) + cb*b.at(1); covers +, -, *, pow, atan2 gen_sqrt (1 handle + 1 double + 1 size_t = 32 bytes) sqrt-specific EFT residual gen_unary_neg (1 handle = 16 bytes) trampoline through wrapped.at(k) gen_rational_residual (3 doubles = 24 bytes) elreal(p, q) constructor residual std::monostate (0 bytes) default; depth-0-only result Operand captures are std::shared_ptr<const elreal> (16 bytes each) so the variant fits comfortably inline. No std::function, no heap allocation per generator state. Files ----- - include/sw/universal/number/elreal/elreal_data.hpp -- NEW. Variant alternative types + lazy_generator alias + evaluate_generator forward declaration. - include/sw/universal/number/elreal/elreal_impl.hpp -- _generator migrated. All ~25 operator/math function generators rewritten as variant emplacements. at() walks via evaluate_generator. The evaluator is defined after the class so each alternative can call elreal::at() on its operand handle. - docs/algorithmic-details/elreal-performance-baseline.md -- K.2 numbers added. K.1 numbers retained for the before/after comparison. - docs/algorithmic-details/multi-component-arithmetic.md -- section 7.1 picker rule updated: arithmetic gap with ereal<2> has nearly closed. Results ------- 12th Gen Intel i7-12700K, gcc 13.3, -O3: | Op | Phase I | K.1 | K.2 | vs Phase I | |---------------|--------:|--------:|--------:|-----------:| | elreal + d1 | 9 Mops | 12 Mops | 17 Mops | 1.9x | | elreal - d1 | 9 Mops | 14 Mops | 17 Mops | 1.9x | | elreal * d1 | 8 Mops | 19 Mops | 16 Mops | 2.0x | | elreal sqrt | 14 Mops | 30 Mops | 36 Mops | 2.6x | | elreal exp | 14 Mops | 31 Mops | 43 Mops | 3.1x | | elreal log | 14 Mops | 24 Mops | 38 Mops | 2.7x | Versus ereal<2> at matched precision: + (17 vs 19, ereal +12%), - (17 vs 20, ereal +18%), * (16 vs 11, elreal +45%). The arithmetic gap with ereal<2> on +/- has shrunk from 2x at K.1 to within ~20% at K.2. Multiplication still favors elreal. Validation ---------- All 30 elreal regression tests PASS under gcc 13.3 and clang 18.1. Phase J oracle sweep across dd / qd / dd_cascade / td_cascade / qd_cascade / ereal<N> PASS under both compilers. No public API changes (elreal looks the same to existing callers). What this does NOT do --------------------- - Convert elreal itself to a shared-ptr handle (would shrink sizeof(elreal) from 104 to 16 bytes at the cost of an additional indirection on every access). Trade-off not clearly worth it given the current numbers; deferred. - Address per-op make_shared<const elreal>(a) operand wrapping. Each binary op still pays 2 small heap allocs (one per operand). On modern allocators with small-object pools this is typically cheap; measured throughput is excellent. - Phase K.4 (SIMD batching on EFT primitives). Part of #905 (Phase K of the elreal follow-up epic #903). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…std::function (#916) * fix(cmake): support FetchContent in dependent repo workflows * perf(elreal): Phase K.2 -- tagged-union generator (variant) replaces std::function Phase K.2 of follow-up epic #903. The Phase I baseline identified the per-op std::function heap allocation (~216-byte capture per binary op) as a major cost. K.1 (#912) closed the _components vector alloc; this PR closes the _generator function-object alloc. Design (#905 K.2 + K.3 combined) -------------------------------- elreal's _generator field migrates from `std::function<double(std::size_t)>` to a `std::variant` of small POD shapes: gen_unary_linear (1 handle + 1 double = 24 bytes) coeff * a.at(1) at depth 1; covers all 19 math functions gen_binary_linear (2 handles + 3 doubles = 56 bytes) c0 + ca*a.at(1) + cb*b.at(1); covers +, -, *, pow, atan2 gen_sqrt (1 handle + 1 double + 1 size_t = 32 bytes) sqrt-specific EFT residual gen_unary_neg (1 handle = 16 bytes) trampoline through wrapped.at(k) gen_rational_residual (3 doubles = 24 bytes) elreal(p, q) constructor residual std::monostate (0 bytes) default; depth-0-only result Operand captures are std::shared_ptr<const elreal> (16 bytes each) so the variant fits comfortably inline. No std::function, no heap allocation per generator state. Files ----- - include/sw/universal/number/elreal/elreal_data.hpp -- NEW. Variant alternative types + lazy_generator alias + evaluate_generator forward declaration. - include/sw/universal/number/elreal/elreal_impl.hpp -- _generator migrated. All ~25 operator/math function generators rewritten as variant emplacements. at() walks via evaluate_generator. The evaluator is defined after the class so each alternative can call elreal::at() on its operand handle. - docs/algorithmic-details/elreal-performance-baseline.md -- K.2 numbers added. K.1 numbers retained for the before/after comparison. - docs/algorithmic-details/multi-component-arithmetic.md -- section 7.1 picker rule updated: arithmetic gap with ereal<2> has nearly closed. Results ------- 12th Gen Intel i7-12700K, gcc 13.3, -O3: | Op | Phase I | K.1 | K.2 | vs Phase I | |---------------|--------:|--------:|--------:|-----------:| | elreal + d1 | 9 Mops | 12 Mops | 17 Mops | 1.9x | | elreal - d1 | 9 Mops | 14 Mops | 17 Mops | 1.9x | | elreal * d1 | 8 Mops | 19 Mops | 16 Mops | 2.0x | | elreal sqrt | 14 Mops | 30 Mops | 36 Mops | 2.6x | | elreal exp | 14 Mops | 31 Mops | 43 Mops | 3.1x | | elreal log | 14 Mops | 24 Mops | 38 Mops | 2.7x | Versus ereal<2> at matched precision: + (17 vs 19, ereal +12%), - (17 vs 20, ereal +18%), * (16 vs 11, elreal +45%). The arithmetic gap with ereal<2> on +/- has shrunk from 2x at K.1 to within ~20% at K.2. Multiplication still favors elreal. Validation ---------- All 30 elreal regression tests PASS under gcc 13.3 and clang 18.1. Phase J oracle sweep across dd / qd / dd_cascade / td_cascade / qd_cascade / ereal<N> PASS under both compilers. No public API changes (elreal looks the same to existing callers). What this does NOT do --------------------- - Convert elreal itself to a shared-ptr handle (would shrink sizeof(elreal) from 104 to 16 bytes at the cost of an additional indirection on every access). Trade-off not clearly worth it given the current numbers; deferred. - Address per-op make_shared<const elreal>(a) operand wrapping. Each binary op still pays 2 small heap allocs (one per operand). On modern allocators with small-object pools this is typically cheap; measured throughput is excellent. - Phase K.4 (SIMD batching on EFT primitives). Part of #905 (Phase K of the elreal follow-up epic #903). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(elreal): address CodeRabbit review on PR #916 Three findings, all valid: 1. elreal-performance-baseline.md mixed pre-K.1 narrative with K.2 current numbers. Restructured the "Reading the table" and "When is elreal faster than ereal" sections to reflect K.2-current state. Moved the pre-K.1 analysis into a new "Historical: Phase I baseline analysis (pre-K.1)" section for archival reference. 2. multi-component-arithmetic.md section 7.1 heading said "Phase I baseline" but the table held post-K.2 numbers. Updated heading to "post-Phase-K.2" and rewrote the body paragraph that referenced "Phase II of epic #873 targets the allocation hot path" -- K.1 and K.2 have shipped, so the picker rule conclusion is now "elreal is throughput-competitive AND exposes correctness features ereal<N> lacks". 3. fetch-content-architecture.md contained em-dashes (U+2014) that violate the project's ASCII-only rule. Replaced with -- per the convention used elsewhere in the docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#919) Two related changes: 1. The lazy_component_buffer member `_inline` collides with the MSVC- specific `_inline` reserved keyword (a legacy alias for `inline` that MSVC still treats specially in some parses). Symptom: MSVC's parser sees `_inline{}` in the constructor's member-initializer list and interprets it as `inline {}`, cascading into a string of C2059 syntax errors. gcc and clang accept the original name. Renamed to `_inl_buf` (the inline buffer) -- non-reserved, descriptive, mechanical search/replace. 2. UNIVERSAL_BUILD_CI_LITE (the lite CI tier used by the Windows MSVC CI runner) did not enable UNIVERSAL_BUILD_NUMBER_ELREALS. As a result the K.1 bug above shipped in PR #912 and stayed undetected on main. Added ELREALS to CI_LITE so subsequent MSVC-specific issues in elreal land at PR-review time, not after merge. Background ---------- The CI matrix in `.github/workflows/cmake.yml` separates: - Linux gcc + clang: UNIVERSAL_BUILD_CI=ON (full) - Windows MSVC, macOS, cross-compilers: UNIVERSAL_BUILD_CI_LITE=ON The lite tier exists to keep the Windows runner fast (~ 8 min vs ~ 25 min for full). It includes one representative type from each major category: integer, fixpnt, cfloat, posit, lns, one cascade, the MX block formats, one elastic (einteger). It did NOT include the lazy elreal -- the recent multi-PR follow-up epic #903 grew elreal but the CI matrix wasn't updated to track it. MinGW cross-compilation, also in the CI matrix, uses gcc as the compiler (just targeting Windows) and so doesn't surface the MSVC `_inline` keyword issue. Only the actual MSVC build catches it. Validation ---------- Local gcc 13.3 and clang 18.1 builds pass after the rename (behavior unchanged; just a member-name swap). The CI matrix change adds elreal coverage to the Windows MSVC runner, which would have caught this bug at the K.1 (#912) PR-review stage. Fixes the MSVC build break observed at HEAD of main as of 2026-05-22.
Summary
Phase K.1 of follow-up epic #903. First sub-phase of Phase K (#905): allocator hot-path optimisation. The Phase I baseline identified per-operator vector allocation as the dominant cost; this PR closes that part of the gap.
K is a multi-PR phase. K.1 is the first and largest payoff item per the Phase I baseline doc: replace `std::vector` storage with a small-buffer-optimised type so the common case (depth 1-4 components) pays no heap allocation.
What landed
Headline numbers (gcc 13.3 on 12th Gen Intel i7-12700K)
clang 18.1 shows even larger gains -- the Phase I clang-vs-gcc gap on multiplication (4 vs 8 Mops/s) is closed; both compilers now land at ~ 20 Mops/s.
Crossover with `ereal<2>` at matched precision
Test plan
What this PR does NOT do
Part of #905 (Phase K of follow-up epic #903).
🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
Refactor