Skip to content

fix(leaderlog): verify and enforce chain-derived epoch nonces before calculating schedules#71

Closed
wcatz wants to merge 1 commit intomasterfrom
codex/identify-issue-with-leaderlog-output-for-epoch-612
Closed

fix(leaderlog): verify and enforce chain-derived epoch nonces before calculating schedules#71
wcatz wants to merge 1 commit intomasterfrom
codex/identify-issue-with-leaderlog-output-for-epoch-612

Conversation

@wcatz
Copy link
Copy Markdown
Owner

@wcatz wcatz commented Feb 12, 2026

Motivation

  • Leader schedule output could be wrong when the DB contained stale or incorrect final_nonce values, causing cached schedules to be served with the wrong nonce.
  • The system needs a canonical, self-derived nonce for full mode (no Koios dependency) so schedule generation is guaranteed correct.

Description

  • Added GetVerifiedNonceForEpoch(epoch int) to NonceTracker, which in full mode recomputes the epoch nonce from local chain data (ComputeEpochNonce) and repairs the DB cache if it differs, while delegating to the existing path in lite mode.
  • Replaced nonce lookups used for schedule generation and validation to call the verified path, covering manual /leaderlog, /nextblock, automatic calculateAndPostLeaderlog, schedule cache validation (scheduleNonceMatches), and /nonce output.
  • Fixed historical backfill schedule logic to use the correct nonce epoch (schedule for epoch N uses the nonce from epoch N-1) and made the code skip cached schedules when their nonces no longer match the verified value.
  • Minor formatting cleanup (gofmt) and small struct/variable layout normalization.

Testing

  • Formatted modified files with gofmt -w nonce.go commands.go main.go and the formatter completed successfully.
  • Ran targeted tests with go test ./... -run 'Test(Nonce|Store)' -count=1 which completed and returned OK (no tests to run in this environment).
  • Ran full test suite with go test ./... under a timeout, which completed successfully (exit 0).

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced validation of cached leader schedules against current nonce states to prevent stale data usage
    • Improved nonce verification consistency and accuracy across schedule calculation workflows
    • Strengthened cache invalidation mechanisms to ensure live computation when necessary for schedule recalculation

@wcatz
Copy link
Copy Markdown
Owner Author

wcatz commented Feb 12, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

This PR enhances nonce verification by introducing GetVerifiedNonceForEpoch method to validate epoch nonces against cached values, and adds scheduleNonceMatches helper to guard cached leader schedules. The changes systematically replace nonce retrieval calls throughout commands and leaderlog flows to use verified nonces and prevent using stale cached data.

Changes

Cohort / File(s) Summary
Nonce Verification Core
nonce.go
Introduces GetVerifiedNonceForEpoch method that computes epoch nonce from local chain data, compares against cached value, and persists corrected nonce with source label. Full mode performs verification; lite mode delegates to existing GetNonceForEpoch.
Command Flows
commands.go
Adds internal helper scheduleNonceMatches to validate cached LeaderSchedule against currently expected epoch nonce. Replaces GetNonceForEpoch calls with GetVerifiedNonceForEpoch and guards cached schedules with early nonce validation to avoid stale data.
Leaderlog Trigger & Calculation
main.go
Updates checkLeaderlogTrigger to require scheduleNonceMatches check for nextEpoch schedules; changes nonce retrieval to verified path in calculateAndPostLeaderlog and backfillSchedules; aligns epoch-1 nonce fetching with verified nonce for backfill logic.

Sequence Diagram

sequenceDiagram
    participant Client
    participant CommandHandler as Command<br/>Handler
    participant NonceTracker as NonceTracker
    participant Cache as Local<br/>Cache
    participant Ledger as Ledger/Chain
    
    Client->>CommandHandler: Request (e.g., nextblock/leaderlog)
    CommandHandler->>Cache: Check cached LeaderSchedule
    alt Schedule exists in cache
        CommandHandler->>NonceTracker: GetVerifiedNonceForEpoch(epoch)
        NonceTracker->>Ledger: Compute nonce from chain data
        NonceTracker->>Cache: Compare computed vs cached final_nonce
        alt Nonces match
            NonceTracker-->>CommandHandler: Return verified nonce
            CommandHandler->>CommandHandler: scheduleNonceMatches = true
            CommandHandler-->>Client: Use cached schedule
        else Nonces mismatch (stale)
            NonceTracker->>Cache: Update cache with correction
            NonceTracker-->>CommandHandler: Return verified nonce
            CommandHandler->>CommandHandler: scheduleNonceMatches = false
            CommandHandler->>NonceTracker: Recompute schedule live
            CommandHandler-->>Client: Use fresh schedule
        end
    else No cached schedule
        CommandHandler->>NonceTracker: GetVerifiedNonceForEpoch(epoch)
        NonceTracker-->>CommandHandler: Return verified nonce
        CommandHandler->>NonceTracker: Compute schedule with verified nonce
        CommandHandler-->>Client: Return computed schedule
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

Poem

🐰 Hop through the epochs, verify as you go,
Cache-checked nonces ensure schedules don't grow stale,
No more stale schedules lurking in the shadows—
Verified nonces stand guard, fresh and true!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: verification and enforcement of chain-derived epoch nonces before calculating leader schedules, which is the core motivation and primary change across commands.go, main.go, and nonce.go.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/identify-issue-with-leaderlog-output-for-epoch-612

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
main.go (1)

1256-1263: ⚠️ Potential issue | 🟡 Minor

Canceled context passed to scheduleNonceMatches.

Line 1258 calls cancel(), then line 1259 passes the now-canceled ctx to scheduleNonceMatches. Currently this is harmless because scheduleNonceMatches never uses the ctx parameter (it creates its own contexts internally), but it's fragile — if someone later adds a DB call using ctx inside scheduleNonceMatches, it will silently fail.

Suggested fix: defer the cancel or create a fresh context
 		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
 		existing, err := i.store.GetLeaderSchedule(ctx, nextEpoch)
-		cancel()
 		if err == nil && existing != nil && i.scheduleNonceMatches(ctx, existing) {
🤖 Fix all issues with AI agents
In `@commands.go`:
- Around line 922-927: The ctxShort scope is canceled via cancelShort() before
being passed to i.scheduleNonceMatches, which is fragile; either postpone
calling cancelShort() until after i.scheduleNonceMatches(...) completes or call
i.scheduleNonceMatches with a fresh context (e.g., context.Background() or a
newly created short-lived context) so you never pass a canceled ctx to
scheduleNonceMatches; update the code around ctxShort/cancelShort,
i.store.GetLeaderSchedule(ctxShort, currentEpoch) and the subsequent call to
i.scheduleNonceMatches to use one of these two approaches.

In `@nonce.go`:
- Around line 319-356: GetVerifiedNonceForEpoch currently always recomputes the
epoch nonce in fullMode which can block hot paths like
scheduleNonceMatches/checkLeaderlogTrigger; add an in-memory cache on
NonceTracker (e.g., verifiedNonceCache map[int][]byte) protected by the existing
mutex so GetVerifiedNonceForEpoch first checks the cache and returns immediately
if present, otherwise proceeds to ComputeEpochNonce and then stores the verified
result into verifiedNonceCache before returning; ensure you update/evict entries
from verifiedNonceCache when final nonces are updated from SetFinalNonce or
other write paths so cache stays consistent.
🧹 Nitpick comments (2)
commands.go (2)

1070-1087: scheduleNonceMatches accepts ctx but never uses it.

The ctx parameter is unused — GetVerifiedNonceForEpoch creates its own internal contexts. This is confusing for callers (some of whom pass already-canceled contexts). Either remove the parameter or propagate it into the nonce verification call.

Option A: drop the unused ctx parameter
-func (i *Indexer) scheduleNonceMatches(ctx context.Context, schedule *LeaderSchedule) bool {
+func (i *Indexer) scheduleNonceMatches(schedule *LeaderSchedule) bool {

Then update all call sites to remove the ctx argument.


637-637: /nonce command now returns verified nonce — note latency impact in full mode.

For /nonce next, this calls GetVerifiedNonceForEpoch(epoch+1) which in full mode triggers ComputeEpochNonce (streaming all blocks from genesis). The user-facing Telegram command could take several minutes to respond. Consider showing a "computing..." message before the call, similar to the leaderlog commands.

Comment on lines 922 to +927
ctxShort, cancelShort := context.WithTimeout(context.Background(), 10*time.Second)
schedule, err := i.store.GetLeaderSchedule(ctxShort, currentEpoch)
cancelShort()
if err == nil && schedule != nil && !i.scheduleNonceMatches(ctxShort, schedule) {
schedule = nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same canceled-context pattern as checkLeaderlogTrigger.

ctxShort is canceled on line 924, then passed to scheduleNonceMatches on line 925. Currently safe because the ctx isn't used inside that function, but fragile.

Suggested fix
 	ctxShort, cancelShort := context.WithTimeout(context.Background(), 10*time.Second)
 	schedule, err := i.store.GetLeaderSchedule(ctxShort, currentEpoch)
-	cancelShort()
-	if err == nil && schedule != nil && !i.scheduleNonceMatches(ctxShort, schedule) {
+	if err == nil && schedule != nil && !i.scheduleNonceMatches(ctxShort, schedule) {
 		schedule = nil
 	}
+	cancelShort()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ctxShort, cancelShort := context.WithTimeout(context.Background(), 10*time.Second)
schedule, err := i.store.GetLeaderSchedule(ctxShort, currentEpoch)
cancelShort()
if err == nil && schedule != nil && !i.scheduleNonceMatches(ctxShort, schedule) {
schedule = nil
}
ctxShort, cancelShort := context.WithTimeout(context.Background(), 10*time.Second)
schedule, err := i.store.GetLeaderSchedule(ctxShort, currentEpoch)
if err == nil && schedule != nil && !i.scheduleNonceMatches(ctxShort, schedule) {
schedule = nil
}
cancelShort()
🤖 Prompt for AI Agents
In `@commands.go` around lines 922 - 927, The ctxShort scope is canceled via
cancelShort() before being passed to i.scheduleNonceMatches, which is fragile;
either postpone calling cancelShort() until after i.scheduleNonceMatches(...)
completes or call i.scheduleNonceMatches with a fresh context (e.g.,
context.Background() or a newly created short-lived context) so you never pass a
canceled ctx to scheduleNonceMatches; update the code around
ctxShort/cancelShort, i.store.GetLeaderSchedule(ctxShort, currentEpoch) and the
subsequent call to i.scheduleNonceMatches to use one of these two approaches.

Comment on lines +319 to +356
// GetVerifiedNonceForEpoch returns a nonce that is verified against canonical
// data for that epoch, repairing stale DB cache entries if needed.
//
// Full mode: always recompute from local chain data and upsert DB cache.
// Lite mode: use existing lookup priority (DB -> Koios).
func (nt *NonceTracker) GetVerifiedNonceForEpoch(epoch int) ([]byte, error) {
if !nt.fullMode {
return nt.GetNonceForEpoch(epoch)
}

computeCtx, computeCancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer computeCancel()
computed, err := nt.ComputeEpochNonce(computeCtx, epoch)
if err != nil {
return nil, fmt.Errorf("failed to verify nonce for epoch %d: %w", epoch, err)
}

checkCtx, checkCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer checkCancel()
cached, cacheErr := nt.store.GetFinalNonce(checkCtx, epoch)
if cacheErr == nil && cached != nil && bytes.Equal(cached, computed) {
return cached, nil
}

source := "computed-verified"
if cacheErr == nil && cached != nil && !bytes.Equal(cached, computed) {
log.Printf("Correcting stale cached nonce for epoch %d: cached %x != computed %x", epoch, cached, computed)
source = "computed-correction"
}

storeCtx, storeCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer storeCancel()
if err := nt.store.SetFinalNonce(storeCtx, epoch, computed, source); err != nil {
log.Printf("Failed to persist verified nonce for epoch %d: %v", epoch, err)
}

return computed, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

GetVerifiedNonceForEpoch always recomputes from genesis in full mode — consider callers on hot paths.

In full mode this streams every block from Shelley genesis (up to 10 minutes). This is fine for on-demand commands (/leaderlog, /nonce) but problematic when called from scheduleNonceMatches inside checkLeaderlogTrigger, which runs synchronously on the block-processing goroutine. After a restart with an existing schedule in DB, the first block past the stability window will block the adder pipeline's event loop for the entire recomputation duration.

Consider caching the verified result per epoch in-memory (e.g., a map[int][]byte guarded by the existing mutex) so repeated calls for the same epoch return instantly after the first verification pass.

Sketch: add in-memory verified nonce cache
 type NonceTracker struct {
 	mu             sync.Mutex
 	store          Store
 	koiosClient    *koios.Client
 	evolvingNonce  []byte
 	currentEpoch   int
 	blockCount     int
 	candidateFroze bool
 	networkMagic   int
 	fullMode       bool
+	verifiedNonces map[int][]byte // cache: epoch -> verified nonce
 }

Then in GetVerifiedNonceForEpoch:

 func (nt *NonceTracker) GetVerifiedNonceForEpoch(epoch int) ([]byte, error) {
 	if !nt.fullMode {
 		return nt.GetNonceForEpoch(epoch)
 	}
+	nt.mu.Lock()
+	if cached, ok := nt.verifiedNonces[epoch]; ok {
+		nt.mu.Unlock()
+		return cached, nil
+	}
+	nt.mu.Unlock()
 
 	computeCtx, computeCancel := context.WithTimeout(context.Background(), 10*time.Minute)
 	defer computeCancel()
 	computed, err := nt.ComputeEpochNonce(computeCtx, epoch)
 	if err != nil {
 		return nil, fmt.Errorf("failed to verify nonce for epoch %d: %w", epoch, err)
 	}
+	nt.mu.Lock()
+	nt.verifiedNonces[epoch] = computed
+	nt.mu.Unlock()
 	// ... rest of DB cache check/repair
🤖 Prompt for AI Agents
In `@nonce.go` around lines 319 - 356, GetVerifiedNonceForEpoch currently always
recomputes the epoch nonce in fullMode which can block hot paths like
scheduleNonceMatches/checkLeaderlogTrigger; add an in-memory cache on
NonceTracker (e.g., verifiedNonceCache map[int][]byte) protected by the existing
mutex so GetVerifiedNonceForEpoch first checks the cache and returns immediately
if present, otherwise proceeds to ComputeEpochNonce and then stores the verified
result into verifiedNonceCache before returning; ensure you update/evict entries
from verifiedNonceCache when final nonces are updated from SetFinalNonce or
other write paths so cache stays consistent.

wcatz added a commit that referenced this pull request Feb 12, 2026
1. Fix canceled context in /nextblock command:
   - Move cancelShort() after scheduleNonceMatches() call
   - Prevents passing canceled context to function

2. Add in-memory verified nonce cache:
   - Add verifiedNonces map[int][]byte to NonceTracker
   - Check cache before recomputing from genesis (10min operation)
   - Prevents blocking adder pipeline on hot paths
   - Cache guards with existing mutex

Resolves CodeRabbit findings from PR #71.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wcatz
Copy link
Copy Markdown
Owner Author

wcatz commented Feb 12, 2026

CodeRabbit issues addressed in #72:

  1. Fixed canceled context in /nextblock command
  2. Added in-memory cache to prevent repeated genesis recomputation (10min operation blocking hot paths)

Please review and merge #72 into this PR before final merge to master.

wcatz added a commit that referenced this pull request Feb 12, 2026
* fix(leaderlog): verify nonce from chain before serving schedules

* fix(pr71): address CodeRabbit performance and context issues

1. Fix canceled context in /nextblock command:
   - Move cancelShort() after scheduleNonceMatches() call
   - Prevents passing canceled context to function

2. Add in-memory verified nonce cache:
   - Add verifiedNonces map[int][]byte to NonceTracker
   - Check cache before recomputing from genesis (10min operation)
   - Prevents blocking adder pipeline on hot paths
   - Cache guards with existing mutex

Resolves CodeRabbit findings from PR #71.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@wcatz wcatz closed this Feb 12, 2026
wcatz added a commit that referenced this pull request Feb 12, 2026
1. Fix canceled context in /nextblock command:
   - Move cancelShort() after scheduleNonceMatches() call
   - Prevents passing canceled context to function

2. Add in-memory verified nonce cache:
   - Add verifiedNonces map[int][]byte to NonceTracker
   - Check cache before recomputing from genesis (10min operation)
   - Prevents blocking adder pipeline on hot paths
   - Cache guards with existing mutex

Resolves CodeRabbit findings from PR #71.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
wcatz added a commit that referenced this pull request Feb 18, 2026
* fix(leaderlog): verify nonce from chain before serving schedules

* fix(pr71): address CodeRabbit performance and context issues

1. Fix canceled context in /nextblock command:
   - Move cancelShort() after scheduleNonceMatches() call
   - Prevents passing canceled context to function

2. Add in-memory verified nonce cache:
   - Add verifiedNonces map[int][]byte to NonceTracker
   - Check cache before recomputing from genesis (10min operation)
   - Prevents blocking adder pipeline on hot paths
   - Cache guards with existing mutex

Resolves CodeRabbit findings from PR #71.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant