Add NKI partition equivalence and hardware scaling tests#49
Merged
Conversation
Covers the simulator-gated and hardware-gated tests for Phase 4 counter-partitioned multi-chip RNG: - TestPartitionEquivalencePerTile (nki_simulator): bit-exact partition equivalence for uniform/normal/exponential via per-tile dispatch, P=2,4 - TestPartitionEquivalenceStreaming (nki_simulator): equivalence via threefry_stream_normal/uniform direct dispatch, P=2,4 - TestPartitionEquivalenceProgram (nki_simulator): GeneratorProgram.stream_into partition equivalence; multi-distribution program P=2 test - TestAdvanceResumeEquivalence (nki_simulator): advance() then draw matches uninterrupted tail; advance_to(N) == advance(N) - TestHardwarePartition (neuron): smoke test, near-linear strong scaling benchmark, P=32 chip-0 slice check, NEFF cache reuse with partition args Tests run on dev hosts without neuronxcc: all nki_simulator/neuron-marked tests skip automatically. Skipping criteria documented in autouse fixture.
scttfrdmn
added a commit
that referenced
this pull request
Apr 28, 2026
Phase 4 (issue #20): bit-exact reproducibility across chip counts as a first-class API property. A 1-chip and P-chip run with the same seed produce the same combined stream byte-for-byte with zero cross-chip coordination. - Generator partition API: partition_rank/size, advance, position, advance_to, _chip_counter_offset, _advance_by_elements (#47) - Dispatch wiring: uniform/normal/exponential + _into variants route counter_offset through Generator partition state; ProgramBuilder and GeneratorProgram apply per-step partition offsets (#48) - Simulator + hardware tests: 43 CPU tests, nki_simulator partition equivalence suite, neuron scaling/profiler tests (#49) Deferred: gamma/beta/chi_squared/truncated_normal partition wiring (data-dependent batch counts); hardware validation run pending.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 4 PR 3 — simulator and hardware tests for counter-partitioned multi-chip RNG
Closes portion of #20 (hardware validation items).
What this adds
tests/test_nki_partitioning.py— all tests skipped automatically on dev hostswithout neuronxcc. Two gate markers:
@pytest.mark.nki_simulator— requiresTRNRAND_USE_SIMULATOR=1+nki>=0.3.0@pytest.mark.neuron— trn1.32xlarge hardware only, run manuallyTestPartitionEquivalencePerTile(nki_simulator)Bit-exact partition equivalence for
uniform,normal,exponentialvia theper-tile dispatch path (
threefry_uniform_nki/threefry_normal_nki), P=2 and P=4.Concatenating P-chip draws in rank order must equal the 1-chip draw —
torch.equal.TestPartitionEquivalenceStreaming(nki_simulator)Same invariant tested via
threefry_stream_normal/threefry_stream_uniformdirectly, so counter_offset arithmetic in streaming dispatch is independently
validated without going through GeneratorProgram.
Counter offset per chip:
r × ceil(n / 16384) × _PROGRAM_TILES.TestPartitionEquivalenceProgram(nki_simulator)Validates
GeneratorProgram.stream_intoapplies partition offsets correctly.Includes a multi-distribution program test (normal + uniform in the same program, P=2).
TestAdvanceResumeEquivalence(nki_simulator)advance(N_skip)followed by continued generation matches the tail of anuninterrupted
N_skip + N_drawrun. Also validatesadvance_to(N) == advance(N).TestHardwarePartition(neuron — trn1.32xlarge, manual)test_zero_cross_chip_coordination— smoke test; profiler confirms no collective ops during kernel (NEURON_PPROF_SINK=/tmp/profile)test_near_linear_strong_scaling— 1-chip baseline for ≥87% efficiency (≥28× 1-chip) target at 32 chipstest_partition_equivalence_P32_hardware— chip 0 of 32 matches first slice of 1-chip output (structural validation of counter arithmetic at P=32 without requiring 32 physical chips)test_neff_cache_reuse_with_partition— secondstream_intowith partition args completes in <100ms (NEFF cache hit)Partition equivalence boundary
Equivalence holds when N is a multiple of 512 × P (per-tile) or 16,384 × P (streaming).
Test sizes are chosen to satisfy this. Partial-batch behavior is already covered in
test_partitioning.py.Running