Skip to content

Commit f3b53ed

Browse files
sifakisclaude
andcommitted
ex_*stencil_cpu: add legacy-transposed pass; BatchAccessor: document §8l
Added a tap-outer, voxel-inner variant of the Legacy path (same leaf-only ReadAccessor, same probeLeaf + getValue mechanics, just the nested loops swapped) as a new `legacy-transposed` benchmark pass in both examples. Checksums match the voxel-outer `legacy` pass byte- for-byte on both synthetic and narrow-band workloads. During the experiment we hit a GCC inlining pitfall: a runtime-args inner lambda `[&](int di, int dj, int dk)` invoked 18 times via parameter-pack fold did *not* get inlined (lambda body contains a 128-iteration loop; GCC's inline budget × 18 is exhausted). Result: 18 explicit call instructions to a 542-byte processTap function with 6-register prologue/epilogue per call, plus tap offsets becoming runtime register arguments (one spilled to stack) — accounting for ~13 % of the observed slowdown vs. Legacy. Fix is a templated lambda `[&]<int DI, int DJ, int DK>() [[gnu::always_inline]]` dispatched via `.template operator()<...>()`. Standalone processTap symbol vanishes; transposed body grows 4.4 → 9.8 KB, matching Legacy's 10.5 KB. Measured at ~32M active voxels on i9-285K (24 threads): - Narrowband taperLER: Legacy 2.2 ns/vox vs Transposed 2.1 ns/vox (marginal, within the ~10 % noise floor) - Synthetic 64M/50%: Legacy 2.4 ns/vox vs Transposed 2.8 ns/vox (+19 %, outside noise) Implementation verdict: LegacyStencilAccessor's voxel-outer moveTo stays the default. Tap-outer has no consistent perf advantage, and voxel-outer wins on cleanliness (self-contained accessor, no scratch arrays, no compiler-inlining fragility, 1:1 mapping to the stencil- operator mental model). `legacy-transposed` kept as a benchmark pass for future reference. BatchAccessor.md §8l captures the experiment, the inlining-pitfall lesson, the measurement matrix, and the implementation-quality rationale behind keeping voxel-outer as the production default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
1 parent 24c2de7 commit f3b53ed

3 files changed

Lines changed: 238 additions & 5 deletions

File tree

nanovdb/nanovdb/examples/ex_narrowband_stencil_cpu/narrowband_stencil_cpu.cpp

Lines changed: 82 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -356,7 +356,8 @@ static void runPerf(
356356
const std::string& passFilter = "all")
357357
{
358358
// wantPass(<name>) returns true if this pass should run under the current filter.
359-
// Supported names: "decode", "stencil", "framing", "legacy". "all" runs everything.
359+
// Supported names: "decode", "stencil", "framing", "legacy",
360+
// "legacy-transposed". "all" runs everything.
360361
auto wantPass = [&](const char* name) {
361362
return passFilter == "all" || passFilter == name;
362363
};
@@ -539,6 +540,79 @@ static void runPerf(
539540
[](uint64_t a, uint64_t b) { return a ^ b; });
540541
} // end wantPass("legacy")
541542

543+
// ---- Legacy transposed: tap-outer, voxel-inner ----
544+
// Same semantics as `legacy`, reordered. For each of the 18 WENO5 taps,
545+
// sweep all BlockWidth voxels — giving long runs of probeLeaf + getValue
546+
// calls with the SAME compile-time tap offset but varying center voxels.
547+
double legacyXposedUs = 0.0;
548+
uint64_t legacyXposedChecksum = 0;
549+
if (wantPass("legacy-transposed")) {
550+
std::fill(sums.begin(), sums.end(), uint64_t(0));
551+
552+
using Weno5Taps = nanovdb::Weno5Stencil::Taps;
553+
static constexpr int SIZE = int(std::tuple_size_v<Weno5Taps>);
554+
555+
legacyXposedUs = timeForEach([&] {
556+
nanovdb::util::forEach(size_t(0), size_t(nBlocks), size_t(1),
557+
[&](const nanovdb::util::Range1D& range) {
558+
alignas(64) uint32_t leafIndex[BlockWidth];
559+
alignas(64) uint16_t voxelOffset[BlockWidth];
560+
alignas(64) nanovdb::Coord centers[BlockWidth];
561+
alignas(64) uint64_t s[BlockWidth];
562+
nanovdb::ReadAccessor<BuildT, 0, -1, -1> acc(grid->tree().root());
563+
uint64_t* bs0 = sums.data();
564+
565+
for (size_t bID = range.begin(); bID != range.end(); ++bID) {
566+
CPUVBM::decodeInverseMaps(
567+
grid, firstLeafID[bID],
568+
&jumpMap[bID * CPUVBM::JumpMapLength],
569+
firstOffset + bID * BlockWidth,
570+
leafIndex, voxelOffset);
571+
572+
for (int i = 0; i < BlockWidth; ++i) {
573+
s[i] = 0;
574+
if (leafIndex[i] == CPUVBM::UnusedLeafIndex) continue;
575+
const uint16_t vo = voxelOffset[i];
576+
const uint32_t li = leafIndex[i];
577+
const nanovdb::Coord cOrigin = firstLeaf[li].origin();
578+
centers[i] = cOrigin + nanovdb::Coord(
579+
(vo >> 6) & 7, (vo >> 3) & 7, vo & 7);
580+
}
581+
582+
auto processTap = [&]<int DI, int DJ, int DK>()
583+
[[gnu::always_inline]]
584+
{
585+
for (int i = 0; i < BlockWidth; ++i) {
586+
if (leafIndex[i] == CPUVBM::UnusedLeafIndex) continue;
587+
const nanovdb::Coord c = centers[i]
588+
+ nanovdb::Coord(DI, DJ, DK);
589+
const LeafT* leaf = acc.probeLeaf(c);
590+
if (!leaf) continue;
591+
const uint32_t offset = (uint32_t(c[0] & 7) << 6)
592+
| (uint32_t(c[1] & 7) << 3)
593+
| uint32_t(c[2] & 7);
594+
s[i] += leaf->data()->getValue(offset);
595+
}
596+
};
597+
598+
[&]<size_t... Is>(std::index_sequence<Is...>) {
599+
(processTap.template operator()<
600+
std::tuple_element_t<Is, Weno5Taps>::di,
601+
std::tuple_element_t<Is, Weno5Taps>::dj,
602+
std::tuple_element_t<Is, Weno5Taps>::dk>(), ...);
603+
}(std::make_index_sequence<SIZE>{});
604+
605+
uint64_t* bs = bs0 + bID * BlockWidth;
606+
for (int i = 0; i < BlockWidth; ++i) bs[i] = s[i];
607+
}
608+
});
609+
});
610+
611+
legacyXposedChecksum =
612+
std::accumulate(sums.begin(), sums.end(), uint64_t(0),
613+
[](uint64_t a, uint64_t b) { return a ^ b; });
614+
} // end wantPass("legacy-transposed")
615+
542616
std::printf("\nEnd-to-end stencil gather (%u blocks, %lu active voxels):\n",
543617
nBlocks, nVoxels);
544618
std::printf(" decodeInverseMaps only: %7.1f ms (%5.1f ns/voxel)\n",
@@ -552,9 +626,14 @@ static void runPerf(
552626
std::printf(" LegacyStencilAccessor : %7.1f ms (%5.1f ns/voxel) [%+5.1f ms over decode] checksum=0x%016lx\n",
553627
legacyUs / 1e3, legacyUs * 1e3 / double(nVoxels),
554628
(legacyUs - decodeUs) / 1e3, legacyChecksum);
629+
std::printf(" Legacy transposed : %7.1f ms (%5.1f ns/voxel) [%+5.1f ms over decode] checksum=0x%016lx\n",
630+
legacyXposedUs / 1e3, legacyXposedUs * 1e3 / double(nVoxels),
631+
(legacyXposedUs - decodeUs) / 1e3, legacyXposedChecksum);
555632

556633
if (stencilChecksum != legacyChecksum)
557-
std::cerr << " WARNING: checksums differ — accessor results disagree!\n";
634+
std::cerr << " WARNING: stencil/legacy checksums differ — accessor results disagree!\n";
635+
if (legacyChecksum != legacyXposedChecksum)
636+
std::cerr << " WARNING: legacy/legacy-transposed checksums differ — ordering bug!\n";
558637
}
559638

560639
// ============================================================
@@ -574,7 +653,7 @@ static void printUsage(const char* argv0)
574653
<< " --grid=<name> Select grid by name (default: first FloatGrid)\n"
575654
<< " --pass=<name> Run one perf pass:\n"
576655
<< " all (default), verify, decode, stencil,\n"
577-
<< " framing, legacy\n"
656+
<< " framing, legacy, legacy-transposed\n"
578657
<< " --threads=<n> Limit TBB parallelism (0 = TBB default)\n"
579658
<< " --skip-validation Skip the sidecar ordering sanity check\n";
580659
}

nanovdb/nanovdb/examples/ex_stencil_gather_cpu/stencil_gather_cpu.cpp

Lines changed: 81 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,8 @@ static void runPerf(
262262
const std::string& passFilter = "all")
263263
{
264264
// wantPass(<name>) returns true if this pass should run under the current filter.
265-
// Supported names: "decode", "stencil", "framing", "legacy". "all" runs everything.
265+
// Supported names: "decode", "stencil", "framing", "legacy",
266+
// "legacy-transposed". "all" runs everything.
266267
auto wantPass = [&](const char* name) {
267268
return passFilter == "all" || passFilter == name;
268269
};
@@ -445,6 +446,79 @@ static void runPerf(
445446
[](uint64_t a, uint64_t b) { return a ^ b; });
446447
} // end wantPass("legacy")
447448

449+
// ---- Legacy transposed: tap-outer, voxel-inner ----
450+
// Same semantics as `legacy`, reordered. For each of the 18 WENO5 taps,
451+
// sweep all BlockWidth voxels — giving long runs of probeLeaf + getValue
452+
// calls with the SAME compile-time tap offset but varying center voxels.
453+
double legacyXposedUs = 0.0;
454+
uint64_t legacyXposedChecksum = 0;
455+
if (wantPass("legacy-transposed")) {
456+
std::fill(sums.begin(), sums.end(), uint64_t(0));
457+
458+
using Weno5Taps = nanovdb::Weno5Stencil::Taps;
459+
static constexpr int SIZE = int(std::tuple_size_v<Weno5Taps>);
460+
461+
legacyXposedUs = timeForEach([&] {
462+
nanovdb::util::forEach(size_t(0), size_t(nBlocks), size_t(1),
463+
[&](const nanovdb::util::Range1D& range) {
464+
alignas(64) uint32_t leafIndex[BlockWidth];
465+
alignas(64) uint16_t voxelOffset[BlockWidth];
466+
alignas(64) nanovdb::Coord centers[BlockWidth];
467+
alignas(64) uint64_t s[BlockWidth];
468+
nanovdb::ReadAccessor<BuildT, 0, -1, -1> acc(grid->tree().root());
469+
uint64_t* bs0 = sums.data();
470+
471+
for (size_t bID = range.begin(); bID != range.end(); ++bID) {
472+
CPUVBM::decodeInverseMaps(
473+
grid, firstLeafID[bID],
474+
&jumpMap[bID * CPUVBM::JumpMapLength],
475+
firstOffset + bID * BlockWidth,
476+
leafIndex, voxelOffset);
477+
478+
for (int i = 0; i < BlockWidth; ++i) {
479+
s[i] = 0;
480+
if (leafIndex[i] == CPUVBM::UnusedLeafIndex) continue;
481+
const uint16_t vo = voxelOffset[i];
482+
const uint32_t li = leafIndex[i];
483+
const nanovdb::Coord cOrigin = firstLeaf[li].origin();
484+
centers[i] = cOrigin + nanovdb::Coord(
485+
(vo >> 6) & 7, (vo >> 3) & 7, vo & 7);
486+
}
487+
488+
auto processTap = [&]<int DI, int DJ, int DK>()
489+
[[gnu::always_inline]]
490+
{
491+
for (int i = 0; i < BlockWidth; ++i) {
492+
if (leafIndex[i] == CPUVBM::UnusedLeafIndex) continue;
493+
const nanovdb::Coord c = centers[i]
494+
+ nanovdb::Coord(DI, DJ, DK);
495+
const LeafT* leaf = acc.probeLeaf(c);
496+
if (!leaf) continue;
497+
const uint32_t offset = (uint32_t(c[0] & 7) << 6)
498+
| (uint32_t(c[1] & 7) << 3)
499+
| uint32_t(c[2] & 7);
500+
s[i] += leaf->data()->getValue(offset);
501+
}
502+
};
503+
504+
[&]<size_t... Is>(std::index_sequence<Is...>) {
505+
(processTap.template operator()<
506+
std::tuple_element_t<Is, Weno5Taps>::di,
507+
std::tuple_element_t<Is, Weno5Taps>::dj,
508+
std::tuple_element_t<Is, Weno5Taps>::dk>(), ...);
509+
}(std::make_index_sequence<SIZE>{});
510+
511+
uint64_t* bs = bs0 + bID * BlockWidth;
512+
for (int i = 0; i < BlockWidth; ++i) bs[i] = s[i];
513+
}
514+
});
515+
});
516+
517+
legacyXposedChecksum =
518+
std::accumulate(sums.begin(), sums.end(), uint64_t(0),
519+
[](uint64_t a, uint64_t b) { return a ^ b; });
520+
} // end wantPass("legacy-transposed")
521+
448522
std::printf("\nEnd-to-end stencil gather (%u blocks, %lu active voxels):\n",
449523
nBlocks, nVoxels);
450524
std::printf(" decodeInverseMaps only: %7.1f ms (%5.1f ns/voxel)\n",
@@ -458,9 +532,14 @@ static void runPerf(
458532
std::printf(" LegacyStencilAccessor : %7.1f ms (%5.1f ns/voxel) [%+5.1f ms over decode] checksum=0x%016lx\n",
459533
legacyUs / 1e3, legacyUs * 1e3 / double(nVoxels),
460534
(legacyUs - decodeUs) / 1e3, legacyChecksum);
535+
std::printf(" Legacy transposed : %7.1f ms (%5.1f ns/voxel) [%+5.1f ms over decode] checksum=0x%016lx\n",
536+
legacyXposedUs / 1e3, legacyXposedUs * 1e3 / double(nVoxels),
537+
(legacyXposedUs - decodeUs) / 1e3, legacyXposedChecksum);
461538

462539
if (stencilChecksum != legacyChecksum)
463-
std::cerr << " WARNING: checksums differ — accessor results disagree!\n";
540+
std::cerr << " WARNING: stencil/legacy checksums differ — accessor results disagree!\n";
541+
if (legacyChecksum != legacyXposedChecksum)
542+
std::cerr << " WARNING: legacy/legacy-transposed checksums differ — ordering bug!\n";
464543
}
465544

466545
// ============================================================

nanovdb/nanovdb/util/BatchAccessor.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1467,6 +1467,71 @@ follow-ons implied by this work but not pursued here:
14671467
legacy compatibility shim (to be retired after a deprecation window)
14681468
rather than a permanent toggle.
14691469
1470+
### 8l. Follow-up: tap-outer loop ordering in the Legacy path
1471+
1472+
Tested whether flipping `legacy`'s loop nest to tap-outer,
1473+
voxel-inner helps on spatially-coherent workloads where many voxels in
1474+
a batch are likely to share the same `valueMask` word. Added a
1475+
`legacy-transposed` benchmark pass in both `ex_stencil_gather_cpu` and
1476+
`ex_narrowband_stencil_cpu`; checksums match `legacy` byte-for-byte on
1477+
both workloads.
1478+
1479+
#### 8l.1 Inlining pitfall
1480+
1481+
First attempt used a runtime-args inner lambda
1482+
`[&](int di, int dj, int dk) { for (int i = 0; i < 128; ++i) ... }`
1483+
invoked 18 times via a parameter-pack fold. GCC refused to inline the
1484+
18 instantiations — the lambda body contains a 128-iteration loop with
1485+
`probeLeaf` + `getValue` inside, which blew past the per-caller inline
1486+
budget × 18. Result: 18 explicit `call` instructions to a 542-byte
1487+
`processTap` function with a 6-register prologue/epilogue per call,
1488+
and tap offsets `(di, dj, dk)` as runtime register arguments (one
1489+
spilled to stack) — so the compiler also couldn't specialise the loop
1490+
body per tap. That alone accounted for ~10 ms (~13 %) of the observed
1491+
slowdown vs. Legacy.
1492+
1493+
Fix: templated lambda
1494+
`[&]<int DI, int DJ, int DK>() [[gnu::always_inline]] { ... }`
1495+
dispatched via `.template operator()<di, dj, dk>()` inside the fold.
1496+
The standalone `processTap` symbol disappears; transposed body grows
1497+
from 4.4 KB → 9.8 KB (matching Legacy's 10.5 KB), and only cold-path
1498+
tree-walk helpers remain as call targets.
1499+
1500+
#### 8l.2 Results
1501+
1502+
Measured at ~32M active voxel scale on i9-285K (24 threads, no HT):
1503+
1504+
| Workload | Legacy (voxel-outer) | Transposed (tap-outer) | Δ |
1505+
|----------|---------------------:|-----------------------:|--:|
1506+
| Narrowband taperLER.vdb | 2.2 ns/vox | 2.1 ns/vox | −3 to −6 % (within noise) |
1507+
| Synthetic 64M/50% | 2.4 ns/vox | 2.8 ns/vox | +19 % |
1508+
1509+
The narrowband tap-outer edge is marginal and within the ~10 %
1510+
run-to-run noise floor observed on this host. Synthetic's tap-outer
1511+
slowdown is clearly outside noise. Not a consistent win.
1512+
1513+
#### 8l.3 Implementation verdict: voxel-outer stays the default
1514+
1515+
`LegacyStencilAccessor`'s voxel-outer `moveTo(center)` kept as the
1516+
production default:
1517+
1518+
- **Clean abstraction**: `moveTo(center)` + indexed tap access maps
1519+
1:1 to the stencil operator's mental model. A tap-outer batched
1520+
form would need external accumulator state and a centers-array
1521+
input, with no natural class boundary.
1522+
- **No scratch arrays**: voxel-outer keeps the per-voxel accumulator
1523+
in a register and the 18-tap buffer inside the accessor; tap-outer
1524+
needs stack-local `centers[128]` and `s[128]`.
1525+
- **Compiler robustness**: voxel-outer's 18-call single source
1526+
location is reliably collapsed by GCC. Tap-outer relies on an
1527+
explicit `[[gnu::always_inline]]` workaround that, if lost during
1528+
future refactors, would silently regress performance by ~13 %.
1529+
1530+
`legacy-transposed` retained as a benchmark pass for reference and as
1531+
a datapoint reinforcing why the hybrid `StencilAccessor` is structured
1532+
tap-outer (SIMD direction-computation inherently amortises across
1533+
lanes at the same tap).
1534+
14701535
---
14711536
14721537
## 9. Relationship to Phase 1 Prototype
@@ -1559,6 +1624,16 @@ follow-ons implied by this work but not pursued here:
15591624
Scope: benchmark-only; the library default is unchanged (right default
15601625
for `probeValue`/`probeLeaf`/mixed workloads).
15611626
1627+
- **Tap-outer loop ordering evaluation in the Legacy path (§8l)**:
1628+
added `legacy-transposed` benchmark pass (checksums match byte-for-byte)
1629+
and tested on both workloads at matched ~32M-voxel scale. Narrowband:
1630+
marginal tap-outer edge, within noise. Synthetic: tap-outer ~19 %
1631+
slower. Uncovered a GCC inlining pitfall for runtime-args inner lambdas
1632+
(fixed via templated lambda + `[[gnu::always_inline]]`).
1633+
**Verdict**: voxel-outer `LegacyStencilAccessor` remains the default —
1634+
cleaner abstraction, no scratch arrays, no compiler-inlining fragility,
1635+
and no consistent perf advantage to tap-outer.
1636+
15621637
### Remaining
15631638
15641639
- **`[[gnu::always_inline]]` on `Simd.h` helpers** (§8f) vs

0 commit comments

Comments
 (0)