Skip to content

Commit cb37049

Browse files
authored
Release v0.7.1 — patch fix for multi-step partition counter bug (#53)
Fixes a logic error in GeneratorProgram._stream_into_nki where _counter was advanced by per-chip streaming_batches instead of full-step batches, breaking partition equivalence in multi-distribution programs (e.g. ProgramBuilder.normal().uniform().build() with partition_size > 1). Also resolves 3 NKI simulator CI test issues: timing test meaningless under simulator (skip), bit-exact tests moved to neuron marker, ruff format.
1 parent 4adff24 commit cb37049

2 files changed

Lines changed: 35 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.7.1] - 2026-04-29
11+
12+
### Fixed
13+
14+
#### Multi-step `GeneratorProgram` partition counter bug (#52)
15+
16+
`GeneratorProgram._stream_into_nki` advanced `_counter` by the per-chip
17+
`streaming_batches` after each distribution step. In a multi-step partitioned
18+
program (`ProgramBuilder.normal(…).uniform(…).build()`) this caused the base
19+
counter for step 2 onward to be misaligned: chip r's step 2 started at the
20+
per-chip offset instead of the correct full-step offset. Concretely, the
21+
uniform step in a 2-distribution P=2 program drew from the wrong region of
22+
the Threefry stream, breaking partition equivalence.
23+
24+
Fix: advance `_counter` by `streaming_batches × partition_size` (the full
25+
single-chip step width) rather than `streaming_batches` alone. For
26+
`partition_size=1` this is a no-op; all single-step programs and single-chip
27+
programs are unaffected.
28+
29+
Detected by `test_multi_distribution_program_equivalence_P2` in the NKI
30+
simulator CI job introduced in #49.
31+
32+
#### Simulator CI test fixes (#52)
33+
34+
- `test_neff_cache_reuse_timing` — skip when `TRNRAND_USE_SIMULATOR=1`. The
35+
NKI simulator compiles every kernel invocation fresh (~10s each); asserting
36+
<500ms is meaningless without actual NEFF caching.
37+
- `TestSimulatorBitExact` renamed `TestHardwareBitExact`, moved from
38+
`nki_simulator` to `neuron` marker. The simulator's `nl.static_range` tile
39+
execution order differs from sequential per-tile invocations, producing
40+
completely different output sequences (max diff 5.58 for normal). On
41+
hardware, NEFF compilation guarantees consistent deterministic output.
42+
- `ruff format` — reformatted `program.py` to resolve CI lint failure.
43+
1044
## [0.7.0] - 2026-04-27
1145

1246
### Added

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "trnrand"
7-
version = "0.7.0"
7+
version = "0.7.1"
88
description = "Random number generation for AWS Trainium via NKI"
99
readme = "README.md"
1010
license = "Apache-2.0"

0 commit comments

Comments
 (0)