@@ -7,6 +7,100 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
88## [ Unreleased]
99
10+ ## [ 0.7.0] - 2026-04-27
11+
12+ ### Added
13+
14+ #### Counter-partitioned multi-chip RNG — bit-exact reproducibility across chip counts (#47 #48 #49 )
15+
16+ The central claim: a 1-chip run and a P-chip run with the same seed produce the
17+ same combined sample stream — byte-for-byte — with zero cross-chip coordination.
18+ Concatenating P-chip draws in rank order exactly reproduces the single-chip output.
19+ GPU libraries cannot offer this without user-side key-threading (JAX split) or
20+ manual offset management (cuRAND). On Trainium, it falls out of the Threefry
21+ counter structure for free.
22+
23+ ** ` Generator ` partition API (#47 )**
24+
25+ - ` Generator(seed, partition_rank=r, partition_size=P) ` — declares this generator
26+ as chip ` r ` of ` P ` . Defaults ` rank=0, size=1 ` so all existing single-chip code
27+ is unchanged.
28+ - ` partition_rank ` / ` partition_size ` validation at construction time: ` size ≥ 1 ` ,
29+ ` 0 ≤ rank < size ` .
30+ - ` generator.position() → int ` — logical sample count consumed (rounded up to the
31+ nearest 512-sample Threefry batch). Authoritative for checkpoint / resume.
32+ - ` generator.advance(n_samples) ` — skip ` n_samples ` forward without generating
33+ them. Increments internal counter by ` ceil(n / 512) ` batches. Cheap — no RNG
34+ work performed.
35+ - ` generator.advance_to(position) ` — jump to an absolute sample position for
36+ checkpoint / resume. Rounded down to the nearest batch boundary.
37+ - ` generator._chip_counter_offset(n) ` — counter offset for chip ` r ` of ` P `
38+ dispatching ` n ` elements: ` partition_rank × ceil(n / 512) + _counter ` .
39+ - ` generator._advance_by_elements(n) ` — advance internal counter by batches
40+ consumed by ` n ` samples. Called by NKI dispatch after each kernel launch.
41+ - ` manual_seed() ` resets ` _counter ` to 0 — reseeding starts a fresh stream.
42+ - ` _BATCH_SIZE = 512 ` exported at module level (` generator.py ` ) for use in tests
43+ and dispatch without NKI dependency.
44+
45+ ** Dispatch wiring (#48 )**
46+
47+ - ` uniform() ` , ` normal() ` , ` exponential() ` (per-tile NKI path): pass
48+ ` counter_offset = gen._chip_counter_offset(n) ` to the NKI kernel, then call
49+ ` gen._advance_by_elements(n) ` . Single-chip behaviour is unchanged
50+ (` rank=0 ` → offset = ` _counter ` , same as before).
51+ - ` normal_into() ` , ` uniform_into() ` , ` exponential_into() ` (streaming path):
52+ ` counter_offset = partition_rank × streaming_batches + gen._counter ` ;
53+ ` gen._counter += streaming_batches ` after dispatch.
54+ - ` ProgramBuilder ` captures ` partition_rank ` , ` partition_size ` , and ` _counter `
55+ from the ` Generator ` at ` .new_program() ` time.
56+ - ` GeneratorProgram._stream_into_nki ` applies per-step partition offset:
57+ ` counter_offset = self._counter + partition_rank × streaming_batches ` .
58+ Counter advances by ` streaming_batches ` per step (not `partition_size ×
59+ streaming_batches`), so consecutive single-chip calls remain independent.
60+
61+ ** Simulator and hardware tests (#49 )**
62+
63+ - ` tests/test_partitioning.py ` — 43 CPU tests (no NKI required):
64+ ` TestGeneratorPartitionAPI ` (30 tests), ` TestPartitionEquivalenceCPU ` using
65+ ` threefry_uniform_cpu ` with 4-sample block units (8 tests),
66+ ` TestProgramBuilderPartitionContext ` (5 tests).
67+ - ` tests/test_nki_partitioning.py ` — gated tests (skip automatically on dev hosts):
68+ - ` nki_simulator ` marker: per-tile equivalence (uniform/normal/exponential, P=2,4),
69+ streaming equivalence (stream_normal/stream_uniform, P=2,4), GeneratorProgram
70+ equivalence including multi-distribution programs, advance/resume equivalence.
71+ - ` neuron ` marker (trn1.32xlarge, manual): zero cross-chip coordination smoke test
72+ (profiler confirms no collective ops), near-linear strong scaling benchmark (≥87%
73+ efficiency / ≥28× 1-chip target), P=32 chip-0 slice structural check, NEFF cache
74+ reuse with partition args (<100ms second launch).
75+
76+ ### Architecture notes
77+
78+ ** Counter unit distinction** : ` threefry_uniform_cpu ` uses 4-sample Threefry output
79+ blocks as its ` counter_offset ` unit; the NKI kernel aggregates 128 lanes into
80+ 512-sample batches. ` Generator._counter ` and ` _BATCH_SIZE = 512 ` track NKI batch
81+ units. CPU-reference tests use ` ceil(n / 4) ` block units directly.
82+
83+ ** NEFF is not partition-aware at compile time.** Partition rank and counter offset
84+ live in HBM arguments passed at runtime. The same compiled NEFF serves all chips in
85+ a partition — no per-rank recompilation.
86+
87+ ** Partition equivalence boundary** : N must be a multiple of ` 512 × P ` (per-tile)
88+ or ` 16,384 × P ` (streaming) for exact batch alignment. Partial-batch tails
89+ produce correct but non-partitionable outputs — documented and tested.
90+
91+ ### Deferred
92+
93+ - ** Partition wiring for ` gamma ` , ` beta ` , ` chi_squared ` , ` truncated_normal ` ** —
94+ rejection-loop batch counts are data-dependent and cannot be tracked at the
95+ ` Generator ` level without significant complexity. These distributions remain
96+ single-chip on the NKI path. Partition support will be added once a fixed-batch
97+ rejection kernel is available.
98+
99+ - ** Hardware validation** — zero cross-chip coordination profiler trace and
100+ near-linear strong scaling numbers (trn1.32xlarge, 32 chips) are pending a
101+ manual hardware run. The ` neuron ` -marked tests in ` test_nki_partitioning.py `
102+ capture the assertions; results will be documented in a follow-up release note.
103+
10104## [ 0.6.0] - 2026-04-22
11105
12106### Added
0 commit comments