Skip to content

Zubax Kulibin Float (ZKF) -- a compact IEEE 754-like simple floating point library#4

Draft
pavel-kirienko wants to merge 242 commits into
mainfrom
dev
Draft

Zubax Kulibin Float (ZKF) -- a compact IEEE 754-like simple floating point library#4
pavel-kirienko wants to merge 242 commits into
mainfrom
dev

Conversation

@pavel-kirienko

@pavel-kirienko pavel-kirienko commented Jun 13, 2026

Copy link
Copy Markdown
Member

pavel-kirienko and others added 16 commits June 8, 2026 21:02
The shared _zkf_cordic_m* table exposes inv_tau/kinv_mag/kinv for atan2's
use; zkf_sincos (MODE=0) does not consume them. Leaving them unconnected
is fine for Icarus but verilator treats the missing pins as fatal
(Warning-PINMISSING), failing the sincos-verilator PR matrix. Tie them off
with explicit empty connections, matching the existing .busy() style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pavel-kirienko pavel-kirienko changed the title Float library verification and timing fixes Zubax Kulibin Float (ZKF) -- a compact IEEE 754-like simple floating point library Jun 13, 2026
pavel-kirienko and others added 13 commits June 13, 2026 18:14
zkf_atan2: the generic angle unmap emitted the out-of-range -1/2 turn
(sign bit set) for finite x<0, |y|->0, but the documented theta range is
the half-open (-0.5, +0.5]. Fold -1/2 to the canonical +1/2 (mirrors
turn8's k==4 / _atan2_turn frac==1/2 clamp). Timing-neutral: an
exponent-only detect + sign-bit-only fold keeps the wide magnitude
datapath a plain mux into the registered output (atan2_w8m36 stays at
~111 MHz). Mirrored in atan2_reference/atan2_true, which shared the
defect via a false "no endpoint clamp needed" assumption.

Remaining items from the float HDL audit:
- Harden exp2/log2/sincos/atan2 _true mpmath oracles to high working
  precision; the default mis-rounds the last bit at WMAN >= 32.
- Enforce the <=1 ULP faithful-rounding contract in CI: new
  verify-float-accuracy target (zkf_transcendental/zkf_trig --check,
  reference vs _true) as the float-accuracy deep job. The per-PR cocotb
  suite only checks RTL==reference (a shared oracle).
- Coverage: log2 STAGE_NORMALIZE_OUTPUT, from_int @WMAN53 +
  STAGE_NORMALIZE=2, div/resize/round @WMAN48, directed atan2
  negative-x-axis (both y signs) and div inf/0.
- Docs: hypot overflow-boundary wording, to_int knob/guard note,
  zkf_div_eq pipeline-depth prose, zkf_matrix sno comment; drop the
  vestigial lod_reference model helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI verify-deep on 55722d0 surfaced two issues in the just-added atan2 work:

- float-accuracy (zkf_trig --check) reported theta_ulp=28671 at 5/11 for a
  near-(-0.5)-axis pair: the reference rounds the magnitude to exactly 1/2 and
  canonicalizes to +0.5 while the mpmath oracle keeps -0.5+1ULP. The two are
  1 ULP apart on the angle circle, but the linear signed-magnitude metric maps
  +0.5 and -0.5 to opposite ends. Add _theta_ulp_diff, a circular metric
  (min(d, ring-d), ring = 2*index(+0.5)) used for atan2 THETA only; sin/cos and
  hypot magnitude keep the linear metric. A real >1 ULP boundary error still
  measures its true (small) distance, so only the spurious wrap collapses.
  Verified: the 5/11 gate now passes (theta_ulp=1 over 1,032,313 pairs).

- The new constant half_pos net added uncovered toggle points to the deep
  verilator coverage gate; wrap it in // verilator coverage_off like the
  module's other constant cones.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CI toolchain image is upgraded to 2026-06-14 (Verilator 5.048), which
instruments generate-block and wide-datapath toggle/branch coverage the prior
5.032 build skipped. Close the resulting deep coverage-float-gate-full gap by
preferring bona-fide coverage over suppression.

Bona-fide coverage (deep matrix, float/tb/zkf_matrix.py):
- exp2/log2 w8m36 across STAGE_PRODUCT=2/3/4 fill the shared _zkf_pmul DSP-grid
  reduction trees (csum/r_p full-width product sums) and widen the significands
  so the hidden-bit MSBs become ordinary toggling bits.
- exp2 STAGE_REDUCE=1 at the wide WEXP drives the range-reduction lost-sticky
  path (r0_lost_sticky).
- atan2 w8m24 turns the divider/shamt/significand/magnitude bits into ordinary
  mid-bits instead of leaving them above the w5m11 format ceiling.
- sincos w8m24 plus a PARALLEL=1 row cover the wide CORDIC datapath, the
  decoupled sigma-replay engine, and the phi_seen fast-skip branch.
This adds ~12.8k bona-fide-covered toggle points.

Honest structural suppression (residual only), each per-net justified:
- per-row/column partial-product accumulators bounded below WP at any format
  (the full-sum csum/r_p are covered, not suppressed);
- sign/hidden bits at the widest covered format, engine constants, the phase
  FSM top state bit, accumulator/divider/CORDIC datapath headroom;
- the back-pressure output-hold path (the streaming matrix holds out_ready high);
- config-fixed phi/MODE branches and a few constant-sigma replay slots.

CI: verify.yml / verify-deep.yml -> ghcr image 2026-06-14.

#ci-float

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lost_sticky

Two deep-tier follow-ups to the 5.048 coverage closure (the prior commit's
float-extended-verilator coverage gate already passes; these fix float-accuracy
and the per-PR line gate).

float-accuracy: the atan2 faithful-rounding check reported theta_ulp=1024 on a
large-ratio input -- a tiny angle that one side rounds to +-MIN_NORMAL and the
other to 0. _ordered_index counted the no-subnormal gap between 0 and the
smallest normal as a full binade (1<<wfrac) rather than one representable step,
so a genuine <=1 ULP straddle of the zero/underflow boundary measured 1024.
Make _ordered_index a dense rank (adjacent representable values one apart); the
+-0.5 circular wrap and genuine >1-ULP straddles are unchanged.

per-PR coverage-float-gate: zkf_exp2.v r0_lost_sticky's line is uncovered under
Verilator 5.048 -- at the only STAGE_REDUCE=1 coverage format (w2m16, WEXP<=4)
lost_sticky is structurally 0, so its registered store folds to a constant the
line gate cannot reach. Re-suppress the registered copy (eval_lost_sticky stays
covered and the logic is verified alive by a directed sim) and drop the now
purposeless wide STAGE_REDUCE=1 grid rows.

#ci-float

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#ci-float

The hard toggle-coverage gate proved unsustainable: keeping it green required
262 `// verilator coverage_off`/`coverage_on` suppressions littering float/hdl/.
Line and branch coverage stay mandatory exactly as before; toggle coverage is
demoted to advisory -- reported during RTL development but never fatal.

- zkf_coverage.py: --full now gates on v_line + v_branch only; v_toggle is
  reported as [advisory] and never returns nonzero. Docstrings/help updated.
- Makefile: deep-gate comment block updated to line+branch mandatory / toggle
  advisory.
- zkf_trig.py, zkf_transcendental.py: stop emitting the blanket coverage_off
  wrapper around generated tables; regenerated all 25 _tables/*.v.
- float/hdl/*.v: removed all 262 toggle-driven coverage_off pragmas and their
  justification comments (net -744 lines). Re-introduced 14 narrow blocks for
  the genuinely-unreachable line/branch points only:
    * 8 deep line/branch breakers (dead default arms, MODE-fixed ternaries,
      elaboration-only constant functions, small-W overshoot, r0_lost_sticky).
    * 6 per-PR-only line breakers (unused sideband tie-offs, backpressure-hold
      registers, deep-only force_zero path).
- zkf_matrix.py: simplified a stale sincos comment that referenced removed
  pragma line numbers.

RTL logic is byte-identical modulo comments (one benign cd_xn/cd_yn
declaration re-merge). Lint clean; Verilator-parsed all touched modules
(no BADVLTPRAGMA).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#ci-float

The per-PR `--gate` was failing on `_zkf_fixed_to_float.v:45` (the `force_zero`
control input) even though every per-PR line was covered. Root cause: `--gate`
gated on the merged LCOV info, which is ptype-blind -- `verilator_coverage
--write-info` collapses line/branch/toggle into one DA record per source line.
force_zero is never asserted in the per-PR matrix, so its isolated *toggle*
point (line 45 has no line point of its own) showed as a 0-hit DA record and
failed the line gate. This ptype-blindness is exactly what forced the old
`// verilator coverage_off` sprawl: toggle points had to be suppressed to keep
the "line" gate green.

Fix: gate the per-PR tier on ptype-aware v_line points only (the same
merged_points data the --full tier already uses), consistent with the demotion
of toggle to advisory. Branch stays mandatory at the deep tier; toggle is
advisory everywhere and never gated.

With the gate no longer ptype-blind, the 6 per-PR `coverage_off` wraps added in
the previous commit are unnecessary -- all 6 suppressed toggle-only points
(unused sideband tie-offs and backpressure-hold register declarations, none of
which carry a v_line point, verified by building each module with
--coverage-line). Removed them; the 8 deep line/branch breakers remain.

Removed the now-dead ptype-blind `uncovered_lines()` helper. RTL logic is
byte-identical (comment-only deltas). Verified: both gates RC=0 against the
163-config deep coverage set (0 line, 0 branch, 4462 toggle advisory); lint
clean; no BADVLTPRAGMA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up review caught three stale doc spots that still asserted the old
toggle-mandatory / ptype-blind behavior after the gate semantics changed:

- zkf_coverage.py module docstring: --gate is per-PR line-only (not "historical"),
  --full is line+branch mandatory with toggle advisory (not "line, branch, OR toggle").
- zkf_coverage.py HTML report empty-state: "every line and branch covered (toggle
  advisory)" instead of implying toggle is gated.
- Makefile: the verify-float-extended block said "line+branch+toggle coverage
  closure"; toggle is advisory, so "line+branch coverage closure (toggle advisory)".

Docs/comments only; no logic or RTL change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The verify-deep workflow took ~2h41m, gated by two near-equal poles:
float-extended-icarus (161m, 789 configs at 2-way xdist on a 2-core runner) and
verify-base (155m), which ran in parallel.

- Shard float-extended-icarus across 4 runners via strategy.matrix. The Icarus
  sweep is pure pass/fail correctness with no coverage artifacts, so it shards
  cleanly with no cross-job merge. New FLOAT_SHARD='k/N' env in
  test_float_matrix.py keeps configs whose stable md5(run.id)%N == k-1; unset
  keeps everything (bare `make verify` unchanged). Stable hash so every shard
  process partitions identically -- shards exactly tile the matrix (verified
  196+208+172+213 = 789). ~161m -> ~40m; bump to 6 shards to reach the ~30m
  float-accuracy floor.

- Remove the verify-base job. It re-ran the exact `make lint && make verify`
  that verify.yml already runs on the same push, so it was pure duplication
  (155m). The per-PR baseline (lint + sims + line-coverage gate) now lives only
  in verify.yml. Note: a manual `workflow_dispatch` of verify-deep no longer
  runs that baseline; normal #ci-float/main pushes still get it from verify.yml.

Expected deep wall-clock: ~2h41m -> ~40m (next pole float-accuracy at ~30m).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant