Skip to content

Commit 9427135

Browse files
committed
Release v0.7.0 — counter-partitioned multi-chip RNG
Phase 4 (issue #20): bit-exact reproducibility across chip counts as a first-class API property. A 1-chip and P-chip run with the same seed produce the same combined stream byte-for-byte with zero cross-chip coordination. - Generator partition API: partition_rank/size, advance, position, advance_to, _chip_counter_offset, _advance_by_elements (#47) - Dispatch wiring: uniform/normal/exponential + _into variants route counter_offset through Generator partition state; ProgramBuilder and GeneratorProgram apply per-step partition offsets (#48) - Simulator + hardware tests: 43 CPU tests, nki_simulator partition equivalence suite, neuron scaling/profiler tests (#49) Deferred: gamma/beta/chi_squared/truncated_normal partition wiring (data-dependent batch counts); hardware validation run pending.
1 parent 5dd7b8a commit 9427135

2 files changed

Lines changed: 95 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,100 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.7.0] - 2026-04-27
11+
12+
### Added
13+
14+
#### Counter-partitioned multi-chip RNG — bit-exact reproducibility across chip counts (#47 #48 #49)
15+
16+
The central claim: a 1-chip run and a P-chip run with the same seed produce the
17+
same combined sample stream — byte-for-byte — with zero cross-chip coordination.
18+
Concatenating P-chip draws in rank order exactly reproduces the single-chip output.
19+
GPU libraries cannot offer this without user-side key-threading (JAX split) or
20+
manual offset management (cuRAND). On Trainium, it falls out of the Threefry
21+
counter structure for free.
22+
23+
**`Generator` partition API (#47)**
24+
25+
- `Generator(seed, partition_rank=r, partition_size=P)` — declares this generator
26+
as chip `r` of `P`. Defaults `rank=0, size=1` so all existing single-chip code
27+
is unchanged.
28+
- `partition_rank` / `partition_size` validation at construction time: `size ≥ 1`,
29+
`0 ≤ rank < size`.
30+
- `generator.position() → int` — logical sample count consumed (rounded up to the
31+
nearest 512-sample Threefry batch). Authoritative for checkpoint / resume.
32+
- `generator.advance(n_samples)` — skip `n_samples` forward without generating
33+
them. Increments internal counter by `ceil(n / 512)` batches. Cheap — no RNG
34+
work performed.
35+
- `generator.advance_to(position)` — jump to an absolute sample position for
36+
checkpoint / resume. Rounded down to the nearest batch boundary.
37+
- `generator._chip_counter_offset(n)` — counter offset for chip `r` of `P`
38+
dispatching `n` elements: `partition_rank × ceil(n / 512) + _counter`.
39+
- `generator._advance_by_elements(n)` — advance internal counter by batches
40+
consumed by `n` samples. Called by NKI dispatch after each kernel launch.
41+
- `manual_seed()` resets `_counter` to 0 — reseeding starts a fresh stream.
42+
- `_BATCH_SIZE = 512` exported at module level (`generator.py`) for use in tests
43+
and dispatch without NKI dependency.
44+
45+
**Dispatch wiring (#48)**
46+
47+
- `uniform()`, `normal()`, `exponential()` (per-tile NKI path): pass
48+
`counter_offset = gen._chip_counter_offset(n)` to the NKI kernel, then call
49+
`gen._advance_by_elements(n)`. Single-chip behaviour is unchanged
50+
(`rank=0` → offset = `_counter`, same as before).
51+
- `normal_into()`, `uniform_into()`, `exponential_into()` (streaming path):
52+
`counter_offset = partition_rank × streaming_batches + gen._counter`;
53+
`gen._counter += streaming_batches` after dispatch.
54+
- `ProgramBuilder` captures `partition_rank`, `partition_size`, and `_counter`
55+
from the `Generator` at `.new_program()` time.
56+
- `GeneratorProgram._stream_into_nki` applies per-step partition offset:
57+
`counter_offset = self._counter + partition_rank × streaming_batches`.
58+
Counter advances by `streaming_batches` per step (not `partition_size ×
59+
streaming_batches`), so consecutive single-chip calls remain independent.
60+
61+
**Simulator and hardware tests (#49)**
62+
63+
- `tests/test_partitioning.py` — 43 CPU tests (no NKI required):
64+
`TestGeneratorPartitionAPI` (30 tests), `TestPartitionEquivalenceCPU` using
65+
`threefry_uniform_cpu` with 4-sample block units (8 tests),
66+
`TestProgramBuilderPartitionContext` (5 tests).
67+
- `tests/test_nki_partitioning.py` — gated tests (skip automatically on dev hosts):
68+
- `nki_simulator` marker: per-tile equivalence (uniform/normal/exponential, P=2,4),
69+
streaming equivalence (stream_normal/stream_uniform, P=2,4), GeneratorProgram
70+
equivalence including multi-distribution programs, advance/resume equivalence.
71+
- `neuron` marker (trn1.32xlarge, manual): zero cross-chip coordination smoke test
72+
(profiler confirms no collective ops), near-linear strong scaling benchmark (≥87%
73+
efficiency / ≥28× 1-chip target), P=32 chip-0 slice structural check, NEFF cache
74+
reuse with partition args (<100ms second launch).
75+
76+
### Architecture notes
77+
78+
**Counter unit distinction**: `threefry_uniform_cpu` uses 4-sample Threefry output
79+
blocks as its `counter_offset` unit; the NKI kernel aggregates 128 lanes into
80+
512-sample batches. `Generator._counter` and `_BATCH_SIZE = 512` track NKI batch
81+
units. CPU-reference tests use `ceil(n / 4)` block units directly.
82+
83+
**NEFF is not partition-aware at compile time.** Partition rank and counter offset
84+
live in HBM arguments passed at runtime. The same compiled NEFF serves all chips in
85+
a partition — no per-rank recompilation.
86+
87+
**Partition equivalence boundary**: N must be a multiple of `512 × P` (per-tile)
88+
or `16,384 × P` (streaming) for exact batch alignment. Partial-batch tails
89+
produce correct but non-partitionable outputs — documented and tested.
90+
91+
### Deferred
92+
93+
- **Partition wiring for `gamma`, `beta`, `chi_squared`, `truncated_normal`**
94+
rejection-loop batch counts are data-dependent and cannot be tracked at the
95+
`Generator` level without significant complexity. These distributions remain
96+
single-chip on the NKI path. Partition support will be added once a fixed-batch
97+
rejection kernel is available.
98+
99+
- **Hardware validation** — zero cross-chip coordination profiler trace and
100+
near-linear strong scaling numbers (trn1.32xlarge, 32 chips) are pending a
101+
manual hardware run. The `neuron`-marked tests in `test_nki_partitioning.py`
102+
capture the assertions; results will be documented in a follow-up release note.
103+
10104
## [0.6.0] - 2026-04-22
11105

12106
### Added

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "trnrand"
7-
version = "0.6.0"
7+
version = "0.7.0"
88
description = "Random number generation for AWS Trainium via NKI"
99
readme = "README.md"
1010
license = "Apache-2.0"

0 commit comments

Comments
 (0)