Skip to content

Add NKI partition equivalence and hardware scaling tests#49

Merged
scttfrdmn merged 1 commit intomainfrom
feat/partition-simulator-tests
Apr 28, 2026
Merged

Add NKI partition equivalence and hardware scaling tests#49
scttfrdmn merged 1 commit intomainfrom
feat/partition-simulator-tests

Conversation

@scttfrdmn
Copy link
Copy Markdown
Collaborator

Phase 4 PR 3 — simulator and hardware tests for counter-partitioned multi-chip RNG

Closes portion of #20 (hardware validation items).

What this adds

tests/test_nki_partitioning.py — all tests skipped automatically on dev hosts
without neuronxcc. Two gate markers:

  • @pytest.mark.nki_simulator — requires TRNRAND_USE_SIMULATOR=1 + nki>=0.3.0
  • @pytest.mark.neuron — trn1.32xlarge hardware only, run manually

TestPartitionEquivalencePerTile (nki_simulator)

Bit-exact partition equivalence for uniform, normal, exponential via the
per-tile dispatch path (threefry_uniform_nki / threefry_normal_nki), P=2 and P=4.

Concatenating P-chip draws in rank order must equal the 1-chip draw — torch.equal.

TestPartitionEquivalenceStreaming (nki_simulator)

Same invariant tested via threefry_stream_normal / threefry_stream_uniform
directly, so counter_offset arithmetic in streaming dispatch is independently
validated without going through GeneratorProgram.

Counter offset per chip: r × ceil(n / 16384) × _PROGRAM_TILES.

TestPartitionEquivalenceProgram (nki_simulator)

Validates GeneratorProgram.stream_into applies partition offsets correctly.
Includes a multi-distribution program test (normal + uniform in the same program, P=2).

TestAdvanceResumeEquivalence (nki_simulator)

advance(N_skip) followed by continued generation matches the tail of an
uninterrupted N_skip + N_draw run. Also validates advance_to(N) == advance(N).

TestHardwarePartition (neuron — trn1.32xlarge, manual)

  • test_zero_cross_chip_coordination — smoke test; profiler confirms no collective ops during kernel (NEURON_PPROF_SINK=/tmp/profile)
  • test_near_linear_strong_scaling — 1-chip baseline for ≥87% efficiency (≥28× 1-chip) target at 32 chips
  • test_partition_equivalence_P32_hardware — chip 0 of 32 matches first slice of 1-chip output (structural validation of counter arithmetic at P=32 without requiring 32 physical chips)
  • test_neff_cache_reuse_with_partition — second stream_into with partition args completes in <100ms (NEFF cache hit)

Partition equivalence boundary

Equivalence holds when N is a multiple of 512 × P (per-tile) or 16,384 × P (streaming).
Test sizes are chosen to satisfy this. Partial-batch behavior is already covered in
test_partitioning.py.

Running

# Dev host (all marked tests skip automatically)
pytest tests/test_nki_partitioning.py -v

# NKI simulator
TRNRAND_USE_SIMULATOR=1 pytest tests/test_nki_partitioning.py -m nki_simulator -v

# Hardware
pytest tests/test_nki_partitioning.py -m neuron -v

Covers the simulator-gated and hardware-gated tests for Phase 4
counter-partitioned multi-chip RNG:

- TestPartitionEquivalencePerTile (nki_simulator): bit-exact partition
  equivalence for uniform/normal/exponential via per-tile dispatch, P=2,4
- TestPartitionEquivalenceStreaming (nki_simulator): equivalence via
  threefry_stream_normal/uniform direct dispatch, P=2,4
- TestPartitionEquivalenceProgram (nki_simulator): GeneratorProgram.stream_into
  partition equivalence; multi-distribution program P=2 test
- TestAdvanceResumeEquivalence (nki_simulator): advance() then draw matches
  uninterrupted tail; advance_to(N) == advance(N)
- TestHardwarePartition (neuron): smoke test, near-linear strong scaling
  benchmark, P=32 chip-0 slice check, NEFF cache reuse with partition args

Tests run on dev hosts without neuronxcc: all nki_simulator/neuron-marked
tests skip automatically. Skipping criteria documented in autouse fixture.
@scttfrdmn scttfrdmn merged commit 5dd7b8a into main Apr 28, 2026
3 of 5 checks passed
scttfrdmn added a commit that referenced this pull request Apr 28, 2026
Phase 4 (issue #20): bit-exact reproducibility across chip counts as a
first-class API property. A 1-chip and P-chip run with the same seed produce
the same combined stream byte-for-byte with zero cross-chip coordination.

- Generator partition API: partition_rank/size, advance, position,
  advance_to, _chip_counter_offset, _advance_by_elements (#47)
- Dispatch wiring: uniform/normal/exponential + _into variants route
  counter_offset through Generator partition state; ProgramBuilder and
  GeneratorProgram apply per-step partition offsets (#48)
- Simulator + hardware tests: 43 CPU tests, nki_simulator partition
  equivalence suite, neuron scaling/profiler tests (#49)

Deferred: gamma/beta/chi_squared/truncated_normal partition wiring
(data-dependent batch counts); hardware validation run pending.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant