Skip to content

fix(deps): update adder to fix silent reconnect#54

Merged
wcatz merged 4 commits intomasterfrom
fix/adder-reconnect-fix
Feb 9, 2026
Merged

fix(deps): update adder to fix silent reconnect#54
wcatz merged 4 commits intomasterfrom
fix/adder-reconnect-fix

Conversation

@wcatz
Copy link
Copy Markdown
Owner

@wcatz wcatz commented Feb 9, 2026

Summary

Context

The bug caused incomplete block data in the nonce tracking DB, leading to wrong epoch nonce computation and incorrect leader schedules. Epoch 612 nonce was 52f585... instead of the correct 8fcd93....

Test plan

  • go build passes
  • Wipe DB and resync from genesis — verify nonces across Shelley/Mary/Alonzo/Babbage eras match Koios
  • Monitor for silent stalls after reconnection events

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added a configurable node-query timeout (new config option, default "10m"); startup logs show the active timeout.
  • Refactor
    • Nonce evolution now uses XOR-based semantics and carries block-hash through processing, changing epoch nonce behavior.
  • Tests
    • Updated unit and integration tests to validate XOR-based nonce evolution.
  • Documentation
    • Docs updated to reflect XOR nonce semantics, freeze threshold, and batch processing notes.
  • Chores
    • Updated a dependency to a newer release.

Updates blinklabs-io/adder to commit 460d03e which preserves event
channels during auto-reconnect, preventing silent block delivery
stalls.

Fixes: blinklabs-io/adder#611

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

Bumped adder dependency; added helm value and ConfigMap conditional for ntcQueryTimeout; made NodeQueryClient accept a per-instance queryTimeout and threaded it through initialization; propagated block_hash through DB/store row scanning and updated nonce evolution from Blake2b to XOR across nonce logic, tests, and epoch handling.

Changes

Cohort / File(s) Summary
Dependency
go.mod
Updated github.com.blinklabs-io/adder from v0.37.0 to v0.37.1-0.20260209154719-460d03ed24c1.
Helm config & templates
helm-chart/values.yaml, helm-chart/templates/configmap.yaml
Added config.leaderlog.ntcQueryTimeout (default "10m") and conditional rendering of ntcQueryTimeout in the ConfigMap leaderlog section.
Node query client wiring
localquery.go, main.go
Added queryTimeout time.Duration field to NodeQueryClient; changed NewNodeQueryClient signature to accept queryTimeout (defaults to 10m when zero); passed ntcQueryTimeout through main initialization and updated related logging.
Storage / DB row plumbing
store.go, db.go
Extended SELECTs to include block_hash; added blockHash field to row iterator structs; changed Scan() signatures to return blockHash; updated Next/Scan implementations and StreamBlockNonces query.
Nonce algorithm and epoch logic
nonce.go, epoch612_integration_test.go, nonce_test.go
Replaced Blake2b-based nonce evolution with XOR-based semigroup (added xorBytes helper); updated evolveNonce, ComputeEpochNonce, BackfillNonces, epoch612 integration test, and unit tests to use XOR and to track previous block hash/prevHashNonce and lastBlockHash through transitions.
Docs/notes
README.md, CLAUDE.md
Updated documentation and notes to reflect XOR-based nonce semantics, new epoch boundary rules (freeze behavior), and bumped adder version.

Sequence Diagram(s)

sequenceDiagram
  participant Config as "Helm / Config"
  participant Main as "main.go"
  participant NtC as "NodeQueryClient"
  participant Store as "Store / DB"
  participant Nonce as "Nonce logic"

  Config->>Main: provide ntcQueryTimeout
  Main->>NtC: NewNodeQueryClient(host, magic, ntcQueryTimeout)
  NtC->>NtC: use queryTimeout for node queries
  Main->>Store: StreamBlockNonces()  (select epoch, slot, nonce_value, block_hash)
  Store->>Main: rows(stream of epoch,slot,nonceValue,blockHash)
  Main->>Nonce: evolveNonce(currentEta, nonceValue)  (xorBytes)
  Nonce->>Nonce: track lastBlockHash -> prevHashNonce
  Note right of Nonce: On epoch boundary\nepochNonce = etaC XOR prevHashNonce
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I hopped through bytes and hashes too,
I xor'd the nonce where hash once grew,
I carried block hashes, timeout in tow,
Configs, constructors — off we go!
Hoppity-hop, the pipeline's new.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(deps): update adder to fix silent reconnect' directly and clearly describes the main change: updating the adder dependency to fix a silent reconnect bug. It is concise, specific, and accurately reflects the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/adder-reconnect-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

wcatz and others added 2 commits February 9, 2026 13:34
Add leaderlog.ntcQueryTimeout config option (Go duration string).
Defaults to 10m if not set. Configurable via helm values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dano spec

The Cardano Nonce semigroup defines: Nonce a <> Nonce b = Nonce (xor a b)
The code was incorrectly using BLAKE2b-256(a || b) instead of XOR for:
1. Per-block nonce evolution (evolveNonce)
2. Epoch nonce transition (TICKN rule)

Additionally, the epoch transition was using the previous epoch nonce as
the second operand instead of η_ph (prev block hash nonce from TICKN state).

Correct formula per cardano-ledger TICKN rule:
  η(new) = η_c ⊕ η_ph
where η_c = candidate nonce, η_ph = last block hash of prior epoch boundary.

Changes:
- evolveNonce: BLAKE2b-256(a||b) → XOR(a,b)
- Epoch transition: hash(etaC||eta0) → XOR(etaC, prevHashNonce)
- StreamBlockNonces now returns block_hash for η_ph tracking
- Updated both SQLite and PostgreSQL store implementations
- Updated all tests to verify XOR behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nonce.go (1)

334-344: ⚠️ Potential issue | 🟠 Major

Post-loop final nonce has the same silent hex.DecodeString error discard.

Same issue as line 301. If the last block hash in the DB is corrupted, this silently produces a wrong nonce rather than erroring.

🤖 Fix all issues with AI agents
In `@nonce.go`:
- Around line 299-302: The code silently ignores errors from hex.DecodeString
when decoding lastBlockHash before calling xorBytes, which can yield a nil/short
prevHashNonce and cause panics or wrong nonces; update each occurrence (the
blocks setting prevHashNonce before calling xorBytes at the xorBytes(etaC,
prevHashNonce) call sites and inside BackfillNonces) to check the error returned
by hex.DecodeString(lastBlockHash), handle it (return or propagate an error) and
only call xorBytes when prevHashNonce was successfully decoded; ensure any error
includes context (e.g., lastBlockHash value or a descriptive message) so callers
can handle corrupted/invalid hex data.
🧹 Nitpick comments (2)
nonce.go (1)

80-88: xorBytes will panic if either input is shorter than 32 bytes.

If a caller passes a slice shorter than 32 (e.g., from a failed hex.DecodeString), this will panic with an index-out-of-range. A defensive length check would prevent a hard crash.

🛡️ Proposed fix
 func xorBytes(a, b []byte) []byte {
+	if len(a) < 32 || len(b) < 32 {
+		panic(fmt.Sprintf("xorBytes: expected 32-byte inputs, got %d and %d", len(a), len(b)))
+	}
 	result := make([]byte, 32)
 	for i := 0; i < 32; i++ {
 		result[i] = a[i] ^ b[i]
 	}
 	return result
 }
epoch612_integration_test.go (1)

75-84: Same silent hex.DecodeString error discard as in nonce.go.

In test code this is lower risk, but if a block hash in the DB is corrupted, this would silently produce a wrong nonce and a confusing test failure rather than a clear error message. Consider at least t.Fatalf on decode error for debuggability.

Also applies to: 115-118

Comment on lines +299 to +302
eta0 = xorBytes(etaC, prevHashNonce)
if lastBlockHash != "" {
prevHashNonce, _ = hex.DecodeString(lastBlockHash)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent discard of hex.DecodeString error could cause a downstream panic in xorBytes.

If lastBlockHash contains invalid hex (e.g., corrupted DB data or odd-length string), hex.DecodeString returns a short/nil slice and the discarded error hides the root cause. The next call to xorBytes(etaC, prevHashNonce) would then panic or produce a wrong nonce.

The same pattern appears at lines 339–341 and in BackfillNonces at lines 394–396.

🛡️ Proposed fix (apply to all three occurrences)
 			if lastBlockHash != "" {
-				prevHashNonce, _ = hex.DecodeString(lastBlockHash)
+				decoded, decErr := hex.DecodeString(lastBlockHash)
+				if decErr != nil || len(decoded) != 32 {
+					return nil, fmt.Errorf("invalid block hash %q for η_ph: %w", lastBlockHash, decErr)
+				}
+				prevHashNonce = decoded
 			}
🤖 Prompt for AI Agents
In `@nonce.go` around lines 299 - 302, The code silently ignores errors from
hex.DecodeString when decoding lastBlockHash before calling xorBytes, which can
yield a nil/short prevHashNonce and cause panics or wrong nonces; update each
occurrence (the blocks setting prevHashNonce before calling xorBytes at the
xorBytes(etaC, prevHashNonce) call sites and inside BackfillNonces) to check the
error returned by hex.DecodeString(lastBlockHash), handle it (return or
propagate an error) and only call xorBytes when prevHashNonce was successfully
decoded; ensure any error includes context (e.g., lastBlockHash value or a
descriptive message) so callers can handle corrupted/invalid hex data.

…y window

- README: nonce evolution uses XOR per Cardano Nonce semigroup
- README: epoch transition uses TICKN rule (η_c XOR η_ph)
- CLAUDE.md: fix stability window references from 70% to 60%
- CLAUDE.md: update nonce evolution description to XOR
- Update adder dependency version reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

245-245: ⚠️ Potential issue | 🟡 Minor

Stability window math is internally inconsistent.

This line says “60% into the current epoch” but then equates it to 4k/f = 172,800 slots, which is 40% of 432,000. Elsewhere you cite 259,200 slots for 60%. Please pick one and make the math consistent across docs.

✏️ Suggested edit (if 60% is correct)
-The next epoch's nonce becomes available after the stability window — 60% into the current epoch (Conway era, `4k/f` = 172,800 slots). duckBot automatically triggers leader schedule calculation at this point.
+The next epoch's nonce becomes available after the stability window — 60% into the current epoch (Conway era, 259,200 slots on mainnet). duckBot automatically triggers leader schedule calculation at this point.
🤖 Fix all issues with AI agents
In `@README.md`:
- Line 13: Update the "Epoch Nonces" blurb to correct the backfill timing:
replace the phrase "~400 epochs in under 2 minutes" with a figure that aligns
with the later README statement (e.g., reflect the measured full Shelley-to-tip
sync of ~43 minutes or remove the numeric claim), and ensure the revised
sentence mentions the same sync context as "full Shelley-to-tip" to avoid
inconsistency with the rest of the README.

**Leader Schedule** — Pure Go CPRAOS implementation checking every slot per epoch against your VRF key. Calculates next epoch schedule automatically at the stability window (60% into epoch). On-demand via `/leaderlog`.

**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via BLAKE2b-256, and freezing at the stability window. Backfills ~400 epochs in under 2 minutes.
**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via XOR (Cardano Nonce semigroup), and freezing at the stability window. Backfills ~400 epochs in under 2 minutes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the backfill timing claim to match actual sync performance.

Line 13 says “~400 epochs in under 2 minutes,” but later in this README the full Shelley-to-tip sync is ~43 minutes. Please align these statements to avoid misleading expectations.

✏️ Suggested edit
-**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via XOR (Cardano Nonce semigroup), and freezing at the stability window. Backfills ~400 epochs in under 2 minutes.
+**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via XOR (Cardano Nonce semigroup), and freezing at the stability window. Full Shelley‑to‑tip backfill completes in ~43 minutes.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via XOR (Cardano Nonce semigroup), and freezing at the stability window. Backfills ~400 epochs in under 2 minutes.
**Epoch Nonces** — In full mode, streams every block from Shelley genesis extracting VRF outputs per era, evolving the nonce via XOR (Cardano Nonce semigroup), and freezing at the stability window. Full Shelley‑to‑tip backfill completes in ~43 minutes.
🤖 Prompt for AI Agents
In `@README.md` at line 13, Update the "Epoch Nonces" blurb to correct the
backfill timing: replace the phrase "~400 epochs in under 2 minutes" with a
figure that aligns with the later README statement (e.g., reflect the measured
full Shelley-to-tip sync of ~43 minutes or remove the numeric claim), and ensure
the revised sentence mentions the same sync context as "full Shelley-to-tip" to
avoid inconsistency with the rest of the README.

@wcatz wcatz merged commit 96872b3 into master Feb 9, 2026
2 checks passed
@wcatz wcatz deleted the fix/adder-reconnect-fix branch February 9, 2026 20:49
wcatz added a commit that referenced this pull request Feb 18, 2026
* fix(deps): update adder to fix silent reconnect channel orphaning

Updates blinklabs-io/adder to commit 460d03e which preserves event
channels during auto-reconnect, preventing silent block delivery
stalls.

Fixes: blinklabs-io/adder#611

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(config): make NtC query timeout configurable

Add leaderlog.ntcQueryTimeout config option (Go duration string).
Defaults to 10m if not set. Configurable via helm values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(nonce): use XOR for nonce evolution and epoch transitions per Cardano spec

The Cardano Nonce semigroup defines: Nonce a <> Nonce b = Nonce (xor a b)
The code was incorrectly using BLAKE2b-256(a || b) instead of XOR for:
1. Per-block nonce evolution (evolveNonce)
2. Epoch nonce transition (TICKN rule)

Additionally, the epoch transition was using the previous epoch nonce as
the second operand instead of η_ph (prev block hash nonce from TICKN state).

Correct formula per cardano-ledger TICKN rule:
  η(new) = η_c ⊕ η_ph
where η_c = candidate nonce, η_ph = last block hash of prior epoch boundary.

Changes:
- evolveNonce: BLAKE2b-256(a||b) → XOR(a,b)
- Epoch transition: hash(etaC||eta0) → XOR(etaC, prevHashNonce)
- StreamBlockNonces now returns block_hash for η_ph tracking
- Updated both SQLite and PostgreSQL store implementations
- Updated all tests to verify XOR behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: fix nonce evolution description (XOR not BLAKE2b), 60% stability window

- README: nonce evolution uses XOR per Cardano Nonce semigroup
- README: epoch transition uses TICKN rule (η_c XOR η_ph)
- CLAUDE.md: fix stability window references from 70% to 60%
- CLAUDE.md: update nonce evolution description to XOR
- Update adder dependency version reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant