Skip to content

perf(elreal): Phase K.1 -- small-buffer optimisation on _components#912

Merged
Ravenwater merged 2 commits into
mainfrom
feat/elreal-phase-k1
May 21, 2026
Merged

perf(elreal): Phase K.1 -- small-buffer optimisation on _components#912
Ravenwater merged 2 commits into
mainfrom
feat/elreal-phase-k1

Conversation

@Ravenwater

@Ravenwater Ravenwater commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase K.1 of follow-up epic #903. First sub-phase of Phase K (#905): allocator hot-path optimisation. The Phase I baseline identified per-operator vector allocation as the dominant cost; this PR closes that part of the gap.

K is a multi-PR phase. K.1 is the first and largest payoff item per the Phase I baseline doc: replace `std::vector` storage with a small-buffer-optimised type so the common case (depth 1-4 components) pays no heap allocation.

What landed

  • `include/sw/universal/number/elreal/lazy_component_buffer.hpp` -- new small-buffer storage type: 4-double inline array + spill via `std::vector`. 64 bytes total (one cache line). API surface is the minimum elreal needs: `push_back`, `operator[]`, `size`, `clear`, `reserve`.
  • `include/sw/universal/number/elreal/elreal_impl.hpp` -- migrated `_components` from `std::vector` to `lazy_component_buffer`. The `components()` accessor now returns `const lazy_component_buffer&`; the single external user in `manipulators.hpp` uses `.size()` and `operator[]` which are unchanged.
  • Two range-for loops rewritten as indexed loops (the buffer does not provide iterators -- intentional, see the design header).
  • `docs/algorithmic-details/elreal-performance-baseline.md` -- new headline-numbers table for the post-K.1 measurements. Phase I baseline numbers retained for the before/after comparison.
  • `docs/algorithmic-details/multi-component-arithmetic.md` -- section 7.1 picker rule updated: multiplication now favors elreal over `ereal<2>` at matched precision (was 1.2x ereal-favored, now 1.9x elreal-favored).

Headline numbers (gcc 13.3 on 12th Gen Intel i7-12700K)

Op Phase I Phase K.1 Speedup
elreal + d1 9 Mops/s 12 Mops/s 1.3x
elreal - d1 9 Mops/s 14 Mops/s 1.6x
elreal * d1 8 Mops/s 19 Mops/s 2.4x
elreal / d0 36 Mops/s ~1 Gops/s dominated by inlining once heap alloc is gone
elreal sqrt d1 14 Mops/s 30 Mops/s 2.1x
elreal exp d1 14 Mops/s 31 Mops/s 2.2x
elreal log d1 14 Mops/s 24 Mops/s 1.7x

clang 18.1 shows even larger gains -- the Phase I clang-vs-gcc gap on multiplication (4 vs 8 Mops/s) is closed; both compilers now land at ~ 20 Mops/s.

Crossover with `ereal<2>` at matched precision

  • Addition and subtraction: ereal<2> still wins by 1.4-2x (was 2-2.7x). The remaining gap is the `std::function` generator capture, which Phase K.2 targets.
  • Multiplication: elreal now beats ereal<2> by 1.9x (was 1.2x ereal-favored). `ereal` multiplication is O(N) in the eager expansion product; `elreal *` is essentially a single `two_prod` plus the now-inline result envelope.
  • Division: not apples-to-apples (elreal depth-0 only until Phase L elreal Phase L: Depth-2+ arithmetic refinement -- Newton for / and sqrt #906).
  • sqrt / exp / log / pow / trig / hyperbolic: elreal-only (ereal has no math functions today).

Test plan

  • Builds clean under gcc 13.3 (`build_elreal/`)
  • Builds clean under clang 18.1 (`build_clang_elreal/`)
  • All 30 elreal regression tests PASS under both compilers (api, conversion, logic, arithmetic, math, geometry, oracle)
  • Phase J oracle sweep across dd / qd / dd_cascade / td_cascade / qd_cascade / ereal PASS under both compilers
  • `benchmark_elreal_performance` shows the documented gains under both compilers
  • CI fast tier green (gcc + clang lite)
  • CodeRabbit review

What this PR does NOT do

  • Eliminate the `std::function` generator capture. That's Phase K.2 -- the natural next target now that K.1 has shrunk the component allocation.
  • Eliminate the operand copies into the lambda capture. That's Phase K.3 (refcounted operand sharing).
  • Add SIMD/FMA batching to the EFTs. That's Phase K.4.

Part of #905 (Phase K of follow-up epic #903).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Updated performance baseline with Phase K.1 post-migration results, new post-K.1 tables, per-operator throughput comparisons, and clarified winner profiles and notes on compiler behaviour
    • Revised arithmetic benchmark docs: new throughput table and narrative showing multiplication now favors elreal at matched precision while addition/subtraction still favor ereal<2>
  • Refactor

    • Optimized component storage with a small-buffer optimization, improving throughput in key cases

Review Change Stack

Phase K.1 of follow-up epic #903. First sub-phase of Phase K (#905):
allocator hot-path optimisation. The Phase I baseline identified
per-operator vector allocation as the dominant cost; this PR closes
that part of the gap.

What landed
-----------

- include/sw/universal/number/elreal/lazy_component_buffer.hpp -- new
  small-buffer storage type: 4-double inline array + spill via
  std::vector. 64 bytes total (one cache line). API surface is the
  minimum elreal needs: push_back, operator[], size, clear, reserve.

- include/sw/universal/number/elreal/elreal_impl.hpp -- migrated
  _components from std::vector<double> to lazy_component_buffer.
  The components() accessor now returns const lazy_component_buffer&;
  the single external user in manipulators.hpp uses .size() and
  operator[] which are unchanged.

- docs/algorithmic-details/elreal-performance-baseline.md -- new
  headline-numbers table for the post-K.1 measurements. Phase I
  baseline numbers retained for the before/after comparison.

- docs/algorithmic-details/multi-component-arithmetic.md -- section
  7.1 picker rule updated: multiplication now favors elreal over
  ereal<2> at matched precision (was 1.2x ereal-favored, now 1.9x
  elreal-favored).

Headline numbers (gcc 13.3 on 12th Gen Intel i7-12700K)
--------------------------------------------------------

| Op              | Phase I | Phase K.1 | Speedup |
|-----------------|--------:|----------:|--------:|
| elreal + d1     | 9 Mops  | 12 Mops   | 1.3x    |
| elreal - d1     | 9 Mops  | 14 Mops   | 1.6x    |
| elreal * d1     | 8 Mops  | 19 Mops   | 2.4x    |
| elreal / d0     | 36 Mops | (gops)    | dominated by inlining once heap alloc is gone |
| elreal sqrt d1  | 14 Mops | 30 Mops   | 2.1x    |
| elreal exp  d1  | 14 Mops | 31 Mops   | 2.2x    |
| elreal log  d1  | 14 Mops | 24 Mops   | 1.7x    |

clang 18.1 shows even larger gains (the Phase I clang-vs-gcc gap on
multiplication, 4 vs 8 Mops/s, is closed -- both now land at ~ 20 Mops/s).

The crossover with ereal<2> at matched precision:
- Addition and subtraction: ereal<2> still wins by 1.4-2x (was 2-2.7x).
- Multiplication: elreal now beats ereal<2> by 1.9x.
- Division: not apples-to-apples (elreal depth-0 only until Phase L).
- sqrt / exp / log / pow / trig / hyperbolic: elreal-only (ereal has
  no math functions).

Validation
----------

All 30 elreal regression tests PASS under both gcc 13.3 and clang 18.1.
Phase J oracle sweep across dd / qd / dd_cascade / td_cascade /
qd_cascade / ereal<N> PASS under both compilers.

What this does NOT do
---------------------

- Eliminate the std::function generator capture. That's Phase K.2.
- Eliminate the operand copies into the lambda capture. That's
  Phase K.3 (refcounted operand sharing).
- Add SIMD/FMA batching to the EFTs. That's Phase K.4.

Part of #905 (Phase K of the elreal follow-up epic #903).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 705a2a69-e0bd-40a3-b1f4-f7bc4aa73ff5

📥 Commits

Reviewing files that changed from the base of the PR and between 6cb2071 and 329c989.

📒 Files selected for processing (1)
  • include/sw/universal/number/elreal/lazy_component_buffer.hpp

📝 Walkthrough

Walkthrough

This PR implements Phase K.1: introduces a small-buffer-optimized lazy_component_buffer, migrates elreal's _components from std::vector to that buffer (updating code and APIs), and updates performance documentation with post-K.1 benchmark results and revised operator crossover narratives.

Changes

Phase K.1 _components Buffer Optimization

Layer / File(s) Summary
Small-buffer container implementation
include/sw/universal/number/elreal/lazy_component_buffer.hpp
Introduces lazy_component_buffer: hybrid inline (4 doubles) / spill (std::vector<double>) storage with constructors, push_back, clear, indexed operator[], size/empty, and reserve.
elreal type migration to lazy_component_buffer
include/sw/universal/number/elreal/elreal_impl.hpp
Updates elreal to use lazy_component_buffer for _components: adds include, changes member and components() accessor types, updates documentation comments, and rewrites operator-() to iterate by size()/operator[].
Post-K.1 performance documentation
docs/algorithmic-details/elreal-performance-baseline.md, docs/algorithmic-details/multi-component-arithmetic.md
Refreshes baseline and Phase I docs to include post-K.1 measurements and per-operator throughput deltas, revises winner profile (elreal now wins on multiplication at matched precision), and marks the _components small-buffer optimisation as DONE while reordering remaining allocation-phase items.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant elreal
  participant lazy_component_buffer as buffer
  Caller->>elreal: components()
  elreal->>buffer: return const ref
  Caller->>buffer: size()
  buffer-->>Caller: size
  Caller->>buffer: operator[](i)
  buffer-->>Caller: double value (inline or spill)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 A clever buffer snug and small,
Four doubles inline, then vector for all,
Phase K.1 hops through components with glee,
Multiplication speeds up — hooray for me!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: implementing Phase K.1 of elreal optimization by replacing std::vector with a small-buffer-optimized lazy_component_buffer for _components storage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/elreal-phase-k1

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ravenwater Ravenwater self-assigned this May 21, 2026
@Ravenwater Ravenwater added this to the V4 milestone May 21, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/sw/universal/number/elreal/lazy_component_buffer.hpp (1)

96-98: ⚡ Quick win

Add a debug bounds check in operator[].

This accessor currently allows silent UB on stale/out-of-range indexes. A debug-only assert keeps zero-cost release behavior while catching misuse quickly.

Proposed patch
+#include <cassert>
 ...
 	double operator[](std::size_t i) const noexcept {
+		assert(i < _size && "lazy_component_buffer index out of range");
 		return (i < inline_capacity) ? _inline[i] : _spill[i - inline_capacity];
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/sw/universal/number/elreal/lazy_component_buffer.hpp` around lines 96
- 98, Add a debug-only bounds check to the const subscript operator[] to prevent
silent UB: inside double operator[](std::size_t i) const noexcept (the function
using inline_capacity, _inline and _spill) assert that i is less than the
current element count (e.g., assert(i < size()) or assert(i < _size) depending
on the class member that tracks length); include <cassert> and keep the function
noexcept so this only affects debug builds and preserves zero-cost release
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@include/sw/universal/number/elreal/lazy_component_buffer.hpp`:
- Around line 96-98: Add a debug-only bounds check to the const subscript
operator[] to prevent silent UB: inside double operator[](std::size_t i) const
noexcept (the function using inline_capacity, _inline and _spill) assert that i
is less than the current element count (e.g., assert(i < size()) or assert(i <
_size) depending on the class member that tracks length); include <cassert> and
keep the function noexcept so this only affects debug builds and preserves
zero-cost release behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5913b625-0ed2-46c0-940f-fbea0e2d795d

📥 Commits

Reviewing files that changed from the base of the PR and between 0571b64 and 6cb2071.

📒 Files selected for processing (4)
  • docs/algorithmic-details/elreal-performance-baseline.md
  • docs/algorithmic-details/multi-component-arithmetic.md
  • include/sw/universal/number/elreal/elreal_impl.hpp
  • include/sw/universal/number/elreal/lazy_component_buffer.hpp

Add debug-only bounds check in lazy_component_buffer::operator[].

CodeRabbit flagged the missing assert as a quick win: zero-cost in
release (NDEBUG strips the assert), catches misuse early in debug
builds. Applied directly per the proposed patch.

Validation: all sampled elreal tests PASS under gcc 13.3 and clang
18.1 with the assert added (the invariant `i < size()` already held
across all elreal indexing sites; the assert just makes it explicit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Ravenwater

Copy link
Copy Markdown
Contributor Author

Addressed the single CodeRabbit nitpick in 329c989: added assert(i < _size && "lazy_component_buffer index out of range"); to operator[] along with #include <cassert>. Zero-cost in release (NDEBUG strips it), catches misuse in debug builds.

The invariant i < _components.size() already held across all elreal indexing sites (every _components[i] call is guarded by a check against _computed_depth, which is kept equal to _components.size()). The assert just makes it explicit so future contributors get an early signal if they break the invariant.

Validation: all sampled elreal tests (api, arithmetic, math, constants, oracle sweep) PASS under both gcc 13.3 and clang 18.1.

@Ravenwater Ravenwater marked this pull request as ready for review May 21, 2026 22:40
@coveralls

Copy link
Copy Markdown

Coverage Report for CI Build 26257273704

Coverage increased (+0.01%) to 84.245%

Details

  • Coverage increased (+0.01%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 2 coverage regressions across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

2 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
include/sw/universal/number/posito/posito_impl.hpp 2 89.78%

Coverage Stats

Coverage Status
Relevant Lines: 55729
Covered Lines: 46949
Line Coverage: 84.25%
Coverage Strength: 5277739.23 hits per line

💛 - Coveralls

@Ravenwater Ravenwater merged commit 9eddae1 into main May 21, 2026
33 checks passed
@Ravenwater Ravenwater deleted the feat/elreal-phase-k1 branch May 21, 2026 23:12
Ravenwater added a commit that referenced this pull request May 22, 2026
…std::function

Phase K.2 of follow-up epic #903. The Phase I baseline identified the
per-op std::function heap allocation (~216-byte capture per binary op)
as a major cost. K.1 (#912) closed the _components vector alloc; this
PR closes the _generator function-object alloc.

Design (#905 K.2 + K.3 combined)
--------------------------------

elreal's _generator field migrates from
`std::function<double(std::size_t)>` to a `std::variant` of small POD
shapes:

  gen_unary_linear      (1 handle + 1 double = 24 bytes)
    coeff * a.at(1) at depth 1; covers all 19 math functions
  gen_binary_linear     (2 handles + 3 doubles = 56 bytes)
    c0 + ca*a.at(1) + cb*b.at(1); covers +, -, *, pow, atan2
  gen_sqrt              (1 handle + 1 double + 1 size_t = 32 bytes)
    sqrt-specific EFT residual
  gen_unary_neg         (1 handle = 16 bytes)
    trampoline through wrapped.at(k)
  gen_rational_residual (3 doubles = 24 bytes)
    elreal(p, q) constructor residual
  std::monostate        (0 bytes)
    default; depth-0-only result

Operand captures are std::shared_ptr<const elreal> (16 bytes each) so
the variant fits comfortably inline. No std::function, no heap
allocation per generator state.

Files
-----

- include/sw/universal/number/elreal/elreal_data.hpp -- NEW. Variant
  alternative types + lazy_generator alias + evaluate_generator
  forward declaration.
- include/sw/universal/number/elreal/elreal_impl.hpp -- _generator
  migrated. All ~25 operator/math function generators rewritten as
  variant emplacements. at() walks via evaluate_generator. The
  evaluator is defined after the class so each alternative can call
  elreal::at() on its operand handle.
- docs/algorithmic-details/elreal-performance-baseline.md -- K.2
  numbers added. K.1 numbers retained for the before/after comparison.
- docs/algorithmic-details/multi-component-arithmetic.md -- section
  7.1 picker rule updated: arithmetic gap with ereal<2> has nearly
  closed.

Results
-------

12th Gen Intel i7-12700K, gcc 13.3, -O3:

| Op            | Phase I | K.1     | K.2     | vs Phase I |
|---------------|--------:|--------:|--------:|-----------:|
| elreal + d1   | 9 Mops  | 12 Mops | 17 Mops | 1.9x       |
| elreal - d1   | 9 Mops  | 14 Mops | 17 Mops | 1.9x       |
| elreal * d1   | 8 Mops  | 19 Mops | 16 Mops | 2.0x       |
| elreal sqrt   | 14 Mops | 30 Mops | 36 Mops | 2.6x       |
| elreal exp    | 14 Mops | 31 Mops | 43 Mops | 3.1x       |
| elreal log    | 14 Mops | 24 Mops | 38 Mops | 2.7x       |

Versus ereal<2> at matched precision: + (17 vs 19, ereal +12%),
- (17 vs 20, ereal +18%), * (16 vs 11, elreal +45%). The arithmetic
gap with ereal<2> on +/- has shrunk from 2x at K.1 to within ~20%
at K.2. Multiplication still favors elreal.

Validation
----------

All 30 elreal regression tests PASS under gcc 13.3 and clang 18.1.
Phase J oracle sweep across dd / qd / dd_cascade / td_cascade /
qd_cascade / ereal<N> PASS under both compilers. No public API
changes (elreal looks the same to existing callers).

What this does NOT do
---------------------

- Convert elreal itself to a shared-ptr handle (would shrink
  sizeof(elreal) from 104 to 16 bytes at the cost of an additional
  indirection on every access). Trade-off not clearly worth it given
  the current numbers; deferred.
- Address per-op make_shared<const elreal>(a) operand wrapping. Each
  binary op still pays 2 small heap allocs (one per operand). On
  modern allocators with small-object pools this is typically cheap;
  measured throughput is excellent.
- Phase K.4 (SIMD batching on EFT primitives).

Part of #905 (Phase K of the elreal follow-up epic #903).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ravenwater added a commit that referenced this pull request May 22, 2026
…std::function (#916)

* fix(cmake): support FetchContent in dependent repo workflows

* perf(elreal): Phase K.2 -- tagged-union generator (variant) replaces std::function

Phase K.2 of follow-up epic #903. The Phase I baseline identified the
per-op std::function heap allocation (~216-byte capture per binary op)
as a major cost. K.1 (#912) closed the _components vector alloc; this
PR closes the _generator function-object alloc.

Design (#905 K.2 + K.3 combined)
--------------------------------

elreal's _generator field migrates from
`std::function<double(std::size_t)>` to a `std::variant` of small POD
shapes:

  gen_unary_linear      (1 handle + 1 double = 24 bytes)
    coeff * a.at(1) at depth 1; covers all 19 math functions
  gen_binary_linear     (2 handles + 3 doubles = 56 bytes)
    c0 + ca*a.at(1) + cb*b.at(1); covers +, -, *, pow, atan2
  gen_sqrt              (1 handle + 1 double + 1 size_t = 32 bytes)
    sqrt-specific EFT residual
  gen_unary_neg         (1 handle = 16 bytes)
    trampoline through wrapped.at(k)
  gen_rational_residual (3 doubles = 24 bytes)
    elreal(p, q) constructor residual
  std::monostate        (0 bytes)
    default; depth-0-only result

Operand captures are std::shared_ptr<const elreal> (16 bytes each) so
the variant fits comfortably inline. No std::function, no heap
allocation per generator state.

Files
-----

- include/sw/universal/number/elreal/elreal_data.hpp -- NEW. Variant
  alternative types + lazy_generator alias + evaluate_generator
  forward declaration.
- include/sw/universal/number/elreal/elreal_impl.hpp -- _generator
  migrated. All ~25 operator/math function generators rewritten as
  variant emplacements. at() walks via evaluate_generator. The
  evaluator is defined after the class so each alternative can call
  elreal::at() on its operand handle.
- docs/algorithmic-details/elreal-performance-baseline.md -- K.2
  numbers added. K.1 numbers retained for the before/after comparison.
- docs/algorithmic-details/multi-component-arithmetic.md -- section
  7.1 picker rule updated: arithmetic gap with ereal<2> has nearly
  closed.

Results
-------

12th Gen Intel i7-12700K, gcc 13.3, -O3:

| Op            | Phase I | K.1     | K.2     | vs Phase I |
|---------------|--------:|--------:|--------:|-----------:|
| elreal + d1   | 9 Mops  | 12 Mops | 17 Mops | 1.9x       |
| elreal - d1   | 9 Mops  | 14 Mops | 17 Mops | 1.9x       |
| elreal * d1   | 8 Mops  | 19 Mops | 16 Mops | 2.0x       |
| elreal sqrt   | 14 Mops | 30 Mops | 36 Mops | 2.6x       |
| elreal exp    | 14 Mops | 31 Mops | 43 Mops | 3.1x       |
| elreal log    | 14 Mops | 24 Mops | 38 Mops | 2.7x       |

Versus ereal<2> at matched precision: + (17 vs 19, ereal +12%),
- (17 vs 20, ereal +18%), * (16 vs 11, elreal +45%). The arithmetic
gap with ereal<2> on +/- has shrunk from 2x at K.1 to within ~20%
at K.2. Multiplication still favors elreal.

Validation
----------

All 30 elreal regression tests PASS under gcc 13.3 and clang 18.1.
Phase J oracle sweep across dd / qd / dd_cascade / td_cascade /
qd_cascade / ereal<N> PASS under both compilers. No public API
changes (elreal looks the same to existing callers).

What this does NOT do
---------------------

- Convert elreal itself to a shared-ptr handle (would shrink
  sizeof(elreal) from 104 to 16 bytes at the cost of an additional
  indirection on every access). Trade-off not clearly worth it given
  the current numbers; deferred.
- Address per-op make_shared<const elreal>(a) operand wrapping. Each
  binary op still pays 2 small heap allocs (one per operand). On
  modern allocators with small-object pools this is typically cheap;
  measured throughput is excellent.
- Phase K.4 (SIMD batching on EFT primitives).

Part of #905 (Phase K of the elreal follow-up epic #903).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(elreal): address CodeRabbit review on PR #916

Three findings, all valid:

1. elreal-performance-baseline.md mixed pre-K.1 narrative with K.2
   current numbers. Restructured the "Reading the table" and "When
   is elreal faster than ereal" sections to reflect K.2-current
   state. Moved the pre-K.1 analysis into a new "Historical: Phase
   I baseline analysis (pre-K.1)" section for archival reference.

2. multi-component-arithmetic.md section 7.1 heading said "Phase I
   baseline" but the table held post-K.2 numbers. Updated heading
   to "post-Phase-K.2" and rewrote the body paragraph that
   referenced "Phase II of epic #873 targets the allocation hot
   path" -- K.1 and K.2 have shipped, so the picker rule
   conclusion is now "elreal is throughput-competitive AND
   exposes correctness features ereal<N> lacks".

3. fetch-content-architecture.md contained em-dashes (U+2014) that
   violate the project's ASCII-only rule. Replaced with -- per the
   convention used elsewhere in the docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ravenwater added a commit that referenced this pull request May 22, 2026
…#919)

Two related changes:

1. The lazy_component_buffer member `_inline` collides with the MSVC-
   specific `_inline` reserved keyword (a legacy alias for `inline`
   that MSVC still treats specially in some parses). Symptom: MSVC's
   parser sees `_inline{}` in the constructor's member-initializer
   list and interprets it as `inline {}`, cascading into a string of
   C2059 syntax errors. gcc and clang accept the original name.
   Renamed to `_inl_buf` (the inline buffer) -- non-reserved,
   descriptive, mechanical search/replace.

2. UNIVERSAL_BUILD_CI_LITE (the lite CI tier used by the Windows MSVC
   CI runner) did not enable UNIVERSAL_BUILD_NUMBER_ELREALS. As a
   result the K.1 bug above shipped in PR #912 and stayed undetected
   on main. Added ELREALS to CI_LITE so subsequent MSVC-specific
   issues in elreal land at PR-review time, not after merge.

Background
----------

The CI matrix in `.github/workflows/cmake.yml` separates:

- Linux gcc + clang: UNIVERSAL_BUILD_CI=ON (full)
- Windows MSVC, macOS, cross-compilers: UNIVERSAL_BUILD_CI_LITE=ON

The lite tier exists to keep the Windows runner fast (~ 8 min vs
~ 25 min for full). It includes one representative type from each
major category: integer, fixpnt, cfloat, posit, lns, one cascade,
the MX block formats, one elastic (einteger). It did NOT include
the lazy elreal -- the recent multi-PR follow-up epic #903 grew
elreal but the CI matrix wasn't updated to track it.

MinGW cross-compilation, also in the CI matrix, uses gcc as the
compiler (just targeting Windows) and so doesn't surface the MSVC
`_inline` keyword issue. Only the actual MSVC build catches it.

Validation
----------

Local gcc 13.3 and clang 18.1 builds pass after the rename
(behavior unchanged; just a member-name swap). The CI matrix change
adds elreal coverage to the Windows MSVC runner, which would have
caught this bug at the K.1 (#912) PR-review stage.

Fixes the MSVC build break observed at HEAD of main as of 2026-05-22.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants