Add NKI partition equivalence and hardware scaling tests by scttfrdmn · Pull Request #49 · trnsci/trnrand

scttfrdmn · 2026-04-25T17:14:32Z

Phase 4 PR 3 — simulator and hardware tests for counter-partitioned multi-chip RNG

Closes portion of #20 (hardware validation items).

What this adds

tests/test_nki_partitioning.py — all tests skipped automatically on dev hosts
without neuronxcc. Two gate markers:

@pytest.mark.nki_simulator — requires TRNRAND_USE_SIMULATOR=1 + nki>=0.3.0
@pytest.mark.neuron — trn1.32xlarge hardware only, run manually

`TestPartitionEquivalencePerTile` (nki_simulator)

Bit-exact partition equivalence for uniform, normal, exponential via the
per-tile dispatch path (threefry_uniform_nki / threefry_normal_nki), P=2 and P=4.

Concatenating P-chip draws in rank order must equal the 1-chip draw — torch.equal.

`TestPartitionEquivalenceStreaming` (nki_simulator)

Same invariant tested via threefry_stream_normal / threefry_stream_uniform
directly, so counter_offset arithmetic in streaming dispatch is independently
validated without going through GeneratorProgram.

Counter offset per chip: r × ceil(n / 16384) × _PROGRAM_TILES.

`TestPartitionEquivalenceProgram` (nki_simulator)

Validates GeneratorProgram.stream_into applies partition offsets correctly.
Includes a multi-distribution program test (normal + uniform in the same program, P=2).

`TestAdvanceResumeEquivalence` (nki_simulator)

advance(N_skip) followed by continued generation matches the tail of an
uninterrupted N_skip + N_draw run. Also validates advance_to(N) == advance(N).

`TestHardwarePartition` (neuron — trn1.32xlarge, manual)

test_zero_cross_chip_coordination — smoke test; profiler confirms no collective ops during kernel (NEURON_PPROF_SINK=/tmp/profile)
test_near_linear_strong_scaling — 1-chip baseline for ≥87% efficiency (≥28× 1-chip) target at 32 chips
test_partition_equivalence_P32_hardware — chip 0 of 32 matches first slice of 1-chip output (structural validation of counter arithmetic at P=32 without requiring 32 physical chips)
test_neff_cache_reuse_with_partition — second stream_into with partition args completes in <100ms (NEFF cache hit)

Partition equivalence boundary

Equivalence holds when N is a multiple of 512 × P (per-tile) or 16,384 × P (streaming).
Test sizes are chosen to satisfy this. Partial-batch behavior is already covered in
test_partitioning.py.

Running

# Dev host (all marked tests skip automatically)
pytest tests/test_nki_partitioning.py -v

# NKI simulator
TRNRAND_USE_SIMULATOR=1 pytest tests/test_nki_partitioning.py -m nki_simulator -v

# Hardware
pytest tests/test_nki_partitioning.py -m neuron -v

Covers the simulator-gated and hardware-gated tests for Phase 4 counter-partitioned multi-chip RNG: - TestPartitionEquivalencePerTile (nki_simulator): bit-exact partition equivalence for uniform/normal/exponential via per-tile dispatch, P=2,4 - TestPartitionEquivalenceStreaming (nki_simulator): equivalence via threefry_stream_normal/uniform direct dispatch, P=2,4 - TestPartitionEquivalenceProgram (nki_simulator): GeneratorProgram.stream_into partition equivalence; multi-distribution program P=2 test - TestAdvanceResumeEquivalence (nki_simulator): advance() then draw matches uninterrupted tail; advance_to(N) == advance(N) - TestHardwarePartition (neuron): smoke test, near-linear strong scaling benchmark, P=32 chip-0 slice check, NEFF cache reuse with partition args Tests run on dev hosts without neuronxcc: all nki_simulator/neuron-marked tests skip automatically. Skipping criteria documented in autouse fixture.

Phase 4 (issue #20): bit-exact reproducibility across chip counts as a first-class API property. A 1-chip and P-chip run with the same seed produce the same combined stream byte-for-byte with zero cross-chip coordination. - Generator partition API: partition_rank/size, advance, position, advance_to, _chip_counter_offset, _advance_by_elements (#47) - Dispatch wiring: uniform/normal/exponential + _into variants route counter_offset through Generator partition state; ProgramBuilder and GeneratorProgram apply per-step partition offsets (#48) - Simulator + hardware tests: 43 CPU tests, nki_simulator partition equivalence suite, neuron scaling/profiler tests (#49) Deferred: gamma/beta/chi_squared/truncated_normal partition wiring (data-dependent batch counts); hardware validation run pending.

scttfrdmn merged commit 5dd7b8a into main Apr 28, 2026
3 of 5 checks passed

scttfrdmn mentioned this pull request Apr 28, 2026

Release v0.7.0 — counter-partitioned multi-chip RNG #50

Merged

scttfrdmn mentioned this pull request Apr 29, 2026

Fix 4 simulator CI failures introduced in v0.7.0 #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NKI partition equivalence and hardware scaling tests#49

Add NKI partition equivalence and hardware scaling tests#49
scttfrdmn merged 1 commit intomainfrom
feat/partition-simulator-tests

scttfrdmn commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scttfrdmn commented Apr 25, 2026

Phase 4 PR 3 — simulator and hardware tests for counter-partitioned multi-chip RNG

What this adds

TestPartitionEquivalencePerTile (nki_simulator)

TestPartitionEquivalenceStreaming (nki_simulator)

TestPartitionEquivalenceProgram (nki_simulator)

TestAdvanceResumeEquivalence (nki_simulator)

TestHardwarePartition (neuron — trn1.32xlarge, manual)

Partition equivalence boundary

Running

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`TestPartitionEquivalencePerTile` (nki_simulator)

`TestPartitionEquivalenceStreaming` (nki_simulator)

`TestPartitionEquivalenceProgram` (nki_simulator)

`TestAdvanceResumeEquivalence` (nki_simulator)

`TestHardwarePartition` (neuron — trn1.32xlarge, manual)