From 942713575d66d7dff0309cdb4318ed949d3d92b5 Mon Sep 17 00:00:00 2001 From: scttfrdmn <3011922+scttfrdmn@users.noreply.github.com> Date: Mon, 27 Apr 2026 18:20:00 -0700 Subject: [PATCH] =?UTF-8?q?Release=20v0.7.0=20=E2=80=94=20counter-partitio?= =?UTF-8?q?ned=20multi-chip=20RNG?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 4 (issue #20): bit-exact reproducibility across chip counts as a first-class API property. A 1-chip and P-chip run with the same seed produce the same combined stream byte-for-byte with zero cross-chip coordination. - Generator partition API: partition_rank/size, advance, position, advance_to, _chip_counter_offset, _advance_by_elements (#47) - Dispatch wiring: uniform/normal/exponential + _into variants route counter_offset through Generator partition state; ProgramBuilder and GeneratorProgram apply per-step partition offsets (#48) - Simulator + hardware tests: 43 CPU tests, nki_simulator partition equivalence suite, neuron scaling/profiler tests (#49) Deferred: gamma/beta/chi_squared/truncated_normal partition wiring (data-dependent batch counts); hardware validation run pending. --- CHANGELOG.md | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++ pyproject.toml | 2 +- 2 files changed, 95 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 6f3f943..dd5a794 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,100 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.7.0] - 2026-04-27 + +### Added + +#### Counter-partitioned multi-chip RNG — bit-exact reproducibility across chip counts (#47 #48 #49) + +The central claim: a 1-chip run and a P-chip run with the same seed produce the +same combined sample stream — byte-for-byte — with zero cross-chip coordination. +Concatenating P-chip draws in rank order exactly reproduces the single-chip output. +GPU libraries cannot offer this without user-side key-threading (JAX split) or +manual offset management (cuRAND). On Trainium, it falls out of the Threefry +counter structure for free. + +**`Generator` partition API (#47)** + +- `Generator(seed, partition_rank=r, partition_size=P)` — declares this generator + as chip `r` of `P`. Defaults `rank=0, size=1` so all existing single-chip code + is unchanged. +- `partition_rank` / `partition_size` validation at construction time: `size ≥ 1`, + `0 ≤ rank < size`. +- `generator.position() → int` — logical sample count consumed (rounded up to the + nearest 512-sample Threefry batch). Authoritative for checkpoint / resume. +- `generator.advance(n_samples)` — skip `n_samples` forward without generating + them. Increments internal counter by `ceil(n / 512)` batches. Cheap — no RNG + work performed. +- `generator.advance_to(position)` — jump to an absolute sample position for + checkpoint / resume. Rounded down to the nearest batch boundary. +- `generator._chip_counter_offset(n)` — counter offset for chip `r` of `P` + dispatching `n` elements: `partition_rank × ceil(n / 512) + _counter`. +- `generator._advance_by_elements(n)` — advance internal counter by batches + consumed by `n` samples. Called by NKI dispatch after each kernel launch. +- `manual_seed()` resets `_counter` to 0 — reseeding starts a fresh stream. +- `_BATCH_SIZE = 512` exported at module level (`generator.py`) for use in tests + and dispatch without NKI dependency. + +**Dispatch wiring (#48)** + +- `uniform()`, `normal()`, `exponential()` (per-tile NKI path): pass + `counter_offset = gen._chip_counter_offset(n)` to the NKI kernel, then call + `gen._advance_by_elements(n)`. Single-chip behaviour is unchanged + (`rank=0` → offset = `_counter`, same as before). +- `normal_into()`, `uniform_into()`, `exponential_into()` (streaming path): + `counter_offset = partition_rank × streaming_batches + gen._counter`; + `gen._counter += streaming_batches` after dispatch. +- `ProgramBuilder` captures `partition_rank`, `partition_size`, and `_counter` + from the `Generator` at `.new_program()` time. +- `GeneratorProgram._stream_into_nki` applies per-step partition offset: + `counter_offset = self._counter + partition_rank × streaming_batches`. + Counter advances by `streaming_batches` per step (not `partition_size × + streaming_batches`), so consecutive single-chip calls remain independent. + +**Simulator and hardware tests (#49)** + +- `tests/test_partitioning.py` — 43 CPU tests (no NKI required): + `TestGeneratorPartitionAPI` (30 tests), `TestPartitionEquivalenceCPU` using + `threefry_uniform_cpu` with 4-sample block units (8 tests), + `TestProgramBuilderPartitionContext` (5 tests). +- `tests/test_nki_partitioning.py` — gated tests (skip automatically on dev hosts): + - `nki_simulator` marker: per-tile equivalence (uniform/normal/exponential, P=2,4), + streaming equivalence (stream_normal/stream_uniform, P=2,4), GeneratorProgram + equivalence including multi-distribution programs, advance/resume equivalence. + - `neuron` marker (trn1.32xlarge, manual): zero cross-chip coordination smoke test + (profiler confirms no collective ops), near-linear strong scaling benchmark (≥87% + efficiency / ≥28× 1-chip target), P=32 chip-0 slice structural check, NEFF cache + reuse with partition args (<100ms second launch). + +### Architecture notes + +**Counter unit distinction**: `threefry_uniform_cpu` uses 4-sample Threefry output +blocks as its `counter_offset` unit; the NKI kernel aggregates 128 lanes into +512-sample batches. `Generator._counter` and `_BATCH_SIZE = 512` track NKI batch +units. CPU-reference tests use `ceil(n / 4)` block units directly. + +**NEFF is not partition-aware at compile time.** Partition rank and counter offset +live in HBM arguments passed at runtime. The same compiled NEFF serves all chips in +a partition — no per-rank recompilation. + +**Partition equivalence boundary**: N must be a multiple of `512 × P` (per-tile) +or `16,384 × P` (streaming) for exact batch alignment. Partial-batch tails +produce correct but non-partitionable outputs — documented and tested. + +### Deferred + +- **Partition wiring for `gamma`, `beta`, `chi_squared`, `truncated_normal`** — + rejection-loop batch counts are data-dependent and cannot be tracked at the + `Generator` level without significant complexity. These distributions remain + single-chip on the NKI path. Partition support will be added once a fixed-batch + rejection kernel is available. + +- **Hardware validation** — zero cross-chip coordination profiler trace and + near-linear strong scaling numbers (trn1.32xlarge, 32 chips) are pending a + manual hardware run. The `neuron`-marked tests in `test_nki_partitioning.py` + capture the assertions; results will be documented in a follow-up release note. + ## [0.6.0] - 2026-04-22 ### Added diff --git a/pyproject.toml b/pyproject.toml index 47b2285..39b21e8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "trnrand" -version = "0.6.0" +version = "0.7.0" description = "Random number generation for AWS Trainium via NKI" readme = "README.md" license = "Apache-2.0"