Skip to content

[KLC-2388] Gate klv_node_type / klv_peer_type writes on cache miss#67

Open
nickgs1337 wants to merge 6 commits into
developfrom
klc-2388-skip-metric-on-cache-miss
Open

[KLC-2388] Gate klv_node_type / klv_peer_type writes on cache miss#67
nickgs1337 wants to merge 6 commits into
developfrom
klc-2388-skip-metric-on-cache-miss

Conversation

@nickgs1337
Copy link
Copy Markdown
Contributor

@nickgs1337 nickgs1337 commented May 29, 2026

On validator startup, the heartbeat sender's updateMetrics was overwriting the correct startup-time klv_node_type=validator with observer, because peerTypeProvider.ComputeForPubKey returns ObserverList silently on cache miss and the cache isn't populated until the first epoch-start event fires. Same root cause hit klv_peer_type. Operators and monitoring tooling saw the wrong node type for the entire bootstrap window — minutes to hours depending on epoch length.

The fix adds an IsCachePopulated() readiness gate on PeerTypeProvider that flips true only when the epoch-start handler fires. Sender.updateMetrics early-returns when the gate is closed, preserving the startup-init values. A fourth commit also seeds klv_peer_type at startup alongside klv_node_type, so consumers never observe an empty/absent field during the bootstrap window.

What changed

4 atomic commits:

  • Add IsCachePopulated readiness gate to PeerTypeProvider — new isReady field flipped inside the existing epoch-start handler.
  • Expose IsCachePopulated on PeerTypeProviderHandler interface — wired through the heartbeat-side interface + mock stub.
  • Skip klv_node_type and klv_peer_type writes on cache miss — early-return in Sender.updateMetrics + sender tests via an UpdateMetrics exporter.
  • Initialize klv_peer_type at startup alongside klv_node_type — one-line seed of ObserverList so the field is never empty pre-gate.

Validation

  1. Unit tests (4 new): IsCachePopulated false at construction, flips true after epoch-start event; updateMetrics skips writes when not ready, writes correct values when ready. All green via go test ./node/heartbeat/process/... ./core/process/peer/....

  2. Binary verification: extracted the validator binary from the docker image and confirmed via go tool nm that IsCachePopulated + markReady symbols compile in, and via go tool objdump -s 'Sender.*updateMetrics' that the gate is actually called at sender.go:177 before computePeerList at sender.go:181.

  3. In-the-wild observation: during this sprint's KLC-1920 multi-validator harness work, captured the live bug on the fallback container reporting klv_node_type=observer despite klv_redundancy_level=2 — independent pre-fix proof.

  4. End-to-end Docker A/B/C/D runs on a 3-validator + 1-redundancy-2-fallback localnet (kleverchain-localnet harness, both images built from the same Klever HEAD):

    • Run A — pre-fix baseline, `slotsPerEpoch=300`: fallback flips to `observer` at T+25s — KLC-2388 bug reproduces cleanly.
    • Run B — patched gate-only, same config: fallback stays at `validator` for 4+ minutes through 50+ post-genesis slots; `klv_peer_type` empty (gate holds).
    • Run C — patched gate-only, `slotsPerEpoch=20`: captures the gate opening at slot 26 when the first epoch-start event fires; transitions to legitimate values (`validator/elected` for primaries, `observer/observer` for the fallback's observer-key heartbeat).
    • Run D — patched + `klv_peer_type` startup-init, `slotsPerEpoch=20`: confirms `peer_type=observer` from t=0 (vs empty in B/C), held through gate-closed window, transitions cleanly at gate open. No dangling-dash TUI artifact, no empty Prometheus label.
  5. Side-effect audit: grep'd every reader of MetricNodeType / MetricPeerType in the codebase. Only consumers are the statusHandler presenter (TUI dashboard) and the /node/status + Prometheus exposers. No consensus, sync, fork-detection, heartbeat-message, or block-processing code branches on these values. Fix has no internal impact beyond observability.

Notes

The KLC-2388 bug requires the peerTypeProvider cache to be unpopulated when the heartbeat first ticks (a production peer-churn condition). On localnet the cache populates correctly at construction, so the A/B runs deliberately exercise the cache-not-yet-populated codepath. The gate is correct regardless of whether the underlying cache-miss is reproducible in any specific environment.

One build-system gotcha worth flagging for reviewers: the first two patched docker builds produced a binary where the `IsCachePopulated` symbol existed but `updateMetrics` wasn't calling it — BuildKit cache mounts (`--mount=type=cache,target=/root/.cache/go-build`) persist across `docker build --no-cache` and reused a stale `.a` for `node/heartbeat/process`. Dropping the cache mounts temporarily for the rebuild defeated it; the upstream Dockerfile is unchanged in this PR.

This PR fixes a startup observability bug by preventing the heartbeat sender from emitting incorrect peer-type metrics during node bootstrap. It gates metric writes on a peer-type cache readiness flag and seeds a bootstrap peer-type value to avoid empty metrics before the cache is populated.

Impact assessment

  • Affected components:
    • Heartbeat sender metrics (node/heartbeat/process/sender.go / Sender.updateMetrics)
    • Peer-type provider cache/state and readiness API (core/process/peer/peerTypeProvider.go)
    • Observability surface: TUI, /node/status, Prometheus exporter (metrics and status codepaths)
  • Blockchain-critical components: none. The change is confined to observability and the peer-type cache readiness API. It does not change consensus, transaction processing, state management, KVM, or networking decision logic.
  • Node stability and data integrity: unchanged. No on-chain state, consensus flows, or persistent node state are modified. The gating only prevents emission of misleading metric labels during bootstrap.

Key changes

  • Readiness gate
    • PeerTypeProvider gains an isReady field and an exported IsCachePopulated() bool (protected by RW locks). isReady is set inside existing cache-refresh/updateCache logic (including epoch-start refreshes) when a non-empty cache is produced.
  • Interface and test stubs
    • PeerTypeProviderHandler interface now includes IsCachePopulated(); mocks/stubs updated to support test control.
  • Metric gating and bootstrap seeding
    • Sender.updateMetrics early-returns when IsCachePopulated() is false, skipping ComputeForPubKey and writes of klv_node_type / klv_peer_type to avoid falsely writing "observer" on cache miss.
    • InitMetrics seeds core.MetricPeerType at startup (ObserverList placeholder) alongside core.MetricNodeType so klv_peer_type is not empty before the gate opens.
  • Concurrency and error handling
    • isReady is read/written under locks to ensure safe concurrent access by the heartbeat sender.
    • The gate prevents fallback-to-observer behavior that previously occurred when ComputeForPubKey ran on cache miss or error during bootstrap.

Validation

  • Unit tests: added coverage for IsCachePopulated transitions (construction and epoch-start) and for Sender.UpdateMetrics behavior (skips writes when not ready; writes correct values when ready). Tests passed for relevant packages.
  • Binary inspection: go tool nm/objdump verified IsCachePopulated symbol and that updateMetrics checks the gate before computing peer lists.
  • Local harness: Docker runs reproduced the pre-fix misreporting; the gate + startup seed preserved correct metrics through bootstrap and transitioned correctly when the cache populated.

Notes

  • A stale Docker build-cache produced a binary where updateMetrics didn't call the new gate; rebuilding without cache resolved it.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a93756bc-6881-4ea4-a2ab-1ea1b783d87e

📥 Commits

Reviewing files that changed from the base of the PR and between bd70a0e and c660f88.

📒 Files selected for processing (1)
  • cmd/node/metrics/metrics.go
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (Custom checks)

**/*.go: Verify that any new or modified concurrent code (goroutines, channels, mutexes, sync primitives) is free of race conditions. Check for: proper lock/unlock pairing, no goroutine leaks, correct channel lifecycle management, and proper context cancellation propagation.
Verify that errors are not silently discarded. Check for: unchecked error returns, error wrapping with context, proper error propagation up the call chain, and no bare panic() calls outside of init() functions.

Files:

  • cmd/node/metrics/metrics.go
🧠 Learnings (3)
📚 Learning: 2026-04-07T14:36:46.394Z
Learnt from: RomuloSiebra
Repo: klever-io/klever-go PR: 35
File: network/p2p/libp2p/peerid_stability_test.go:100-116
Timestamp: 2026-04-07T14:36:46.394Z
Learning: In `network/p2p/libp2p/peerid_stability_test.go` (Go, klever-go repo), the empty-seed tests (`TestCreateP2PPrivKey_EmptySeed_NoError` and `TestCreateP2PPrivKey_EmptySeed_LegacySeed_NoError`) intentionally only assert `NoError/NotNil`. A previous `NotEqual` assertion across two `crypto/rand`-backed calls was deliberately removed because it is a probabilistic assertion that re-verifies OS/stdlib entropy rather than project logic. Do not suggest adding `NotEqual` comparisons for empty-seed / `crypto/rand` paths in this codebase.

Applied to files:

  • cmd/node/metrics/metrics.go
📚 Learning: 2026-04-21T20:12:22.959Z
Learnt from: phcarneirobc
Repo: klever-io/klever-go PR: 38
File: indexer/eventsProcessor.go:188-211
Timestamp: 2026-04-21T20:12:22.959Z
Learning: In Go structs that are JSON-marshaled, if a field is a `bool` and has the `json:"...,omitempty"` tag, then leaving that field at its zero value (`false`) is functionally equivalent (in the resulting JSON) to explicitly setting `Foundation: false`. Reviewers should not flag struct literals that omit such `bool` fields as an inconsistency; they will serialize identically because `omitempty` suppresses `false` values.

Applied to files:

  • cmd/node/metrics/metrics.go
📚 Learning: 2026-05-23T22:52:58.065Z
Learnt from: fbsobreira
Repo: klever-io/klever-go PR: 65
File: data/blockchain/blockchain.go:170-172
Timestamp: 2026-05-23T22:52:58.065Z
Learning: In Go, the pattern `append([]byte(nil), src...)` should be treated as preserving nil identity when `src` is a nil `[]byte`: spreading a nil slice contributes zero variadic arguments, so `append` performs no allocation and returns the original nil destination slice unchanged (i.e., result is nil, not an empty non-nil slice). Do not flag this as an incorrect empty-slice conversion; it intentionally maintains `nil`.

Applied to files:

  • cmd/node/metrics/metrics.go
🔇 Additional comments (1)
cmd/node/metrics/metrics.go (1)

47-57: LGTM!


Walkthrough

Adds a cache-populated readiness flag to PeerTypeProvider, exposes IsCachePopulated() via the heartbeat interface and mock, seeds MetricPeerType with ObserverList at startup, and prevents sender metric updates until the peer-type cache is populated. Tests cover provider readiness and sender behavior.

Changes

Peer-type cache readiness gating

Layer / File(s) Summary
Cache-readiness interface and mock
node/heartbeat/interface.go, node/heartbeat/mock/peerTypeProviderStub.go
Adds IsCachePopulated() bool to PeerTypeProviderHandler and a configurable IsCachePopulated() on the test stub (optional callback, defaults true).
PeerTypeProvider readiness state
core/process/peer/peerTypeProvider.go
Adds isReady field, IsCachePopulated() read-locked accessor, and sets isReady = true inside updateCache when a non-empty cache is built.
PeerTypeProvider readiness tests
core/process/peer/peerTypeProvider_test.go
Tests: false when coordinator empty at construction, true when coordinator has data at construction, transitions to true after epoch-start that populates cache, and remains false if epoch-start yields empty data.
Bootstrap peer-type metric initialization
cmd/node/metrics/metrics.go
InitMetrics seeds core.MetricPeerType with core.ObserverList as a temporary bootstrap value.
Sender metric guard and tests
node/heartbeat/process/sender.go, node/heartbeat/process/export_test.go, node/heartbeat/process/sender_test.go
Sender.updateMetrics early-exits when IsCachePopulated() is false. Adds Sender.UpdateMetrics wrapper for tests and tests verifying no metric writes when cache unpopulated and correct metric writes when populated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

breaking-change

🚥 Pre-merge checks | ✅ 4 | ❌ 4

❌ Failed checks (1 warning, 3 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Concurrency Safety ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Error Handling ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
State Consistency ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title follows the required format [KLC-XXXX] type: description with a valid JIRA key and type prefix, and accurately describes the main change: gating metric writes on cache availability.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch klc-2388-skip-metric-on-cache-miss

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 29, 2026
Copy link
Copy Markdown
Member

@fbsobreira fbsobreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the readiness gate — needs verification: the metric suppression is correct on cold start, but on a mid-epoch reboot where the cache is already valid it may suppress correct data. Please validate whether the coordinator is populated at construction time. Observability-only, non-blocking. Concurrency on isReady and the test coverage look good.

Comment thread core/process/peer/peerTypeProvider.go Outdated
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 1, 2026
@nickgs1337 nickgs1337 requested a review from fbsobreira June 2, 2026 14:10
// is populated. The heartbeat sender's IsCachePopulated gate keeps this
// startup value in place until the first epoch-start event refreshes the
// cache, at which point the real peer-list classification takes over.
appStatusHandler.SetStringValue(core.MetricPeerType, string(core.ObserverList))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I want to check before this goes in: seeding peer_type=observer here is fine, but node_type is still seeded to validator up at line 47 (it's hardcoded to NodeTypeValidator in startup.go). Once we gate the sender, both seeds get frozen until the cache populates — so during the whole bootstrap window the node publishes node_type=validator + peer_type=observer, which contradict each other.

The part that worries me more than the cosmetics: on develop the sender corrected a bad node_type on the first heartbeat, so an observer node only showed validator for a second or two. With the gate, that wrong validator now sticks for the entire bootstrap window until the first cache refresh. So we fix "validator briefly shown as observer" but introduce "observer shown as validator for longer."

Is keeping node_type=validator stable the intent here, or should bootstrap be conservative? Two ways to square it:

  • seed both to observer (matches what the sender derives from ObserverList anyway — ObserverList → NodeTypeObserver), so observers are correct from t=0 and validators flip up once the cache loads; or
  • keep node_type as-is but drop a comment that peer_type is intentionally pessimistic and the two are expected to diverge until the cache is ready.

Either way it's a one-liner — just want to make sure we pick deliberately since this is the exact metric the PR is meant to make trustworthy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on picking deliberately. Seeding both to observer re-introduces the exact symptom this PR is for (validators briefly showing observer on cache miss), so going the other way: keep node_type=validator, accept that observers briefly show validator in the bootstrap window. Pushed a comment in metrics.go making the choice explicit.


ptp.mutCache.Lock()
ptp.cache = newCache
if len(newCache) > 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR — more a note for whoever touches this next. The isReady latch only ever flips false→true, but updateCache still swaps in the new map unconditionally, and createNewCache swallows the coordinator errors (logs at Debug and carries on with a nil slice). So the cache can go populated→empty on a later epoch refresh while IsCachePopulated() keeps returning true, and in that window the sender reclassifies a real validator as observer again.

To be clear this isn't something you introduced — on develop the sender already ran unconditionally, so that empty-cache mislabel was always possible at epoch boundaries; the new gate just doesn't extend its protection there. So nothing to fix here for the PR's scope.

If we ever want to close it though, the cheap option is to not clobber a good cache with an empty one:

ptp.mutCache.Lock()
if len(newCache) > 0 {
    ptp.cache = newCache
    ptp.isReady = true
}
ptp.mutCache.Unlock()

which keeps the last-known-good classification through a transient empty refresh. Fine to leave as-is for now — just flagging so the latch's sticky semantics are on record.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sticky-true on a transient empty refresh is the gap. Your len(newCache) > 0 guard is the clean way to close it. Leaving for a follow-up per your call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants