[KLC-2388] Gate klv_node_type / klv_peer_type writes on cache miss#67
[KLC-2388] Gate klv_node_type / klv_peer_type writes on cache miss#67nickgs1337 wants to merge 6 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (1)**/*.go📄 CodeRabbit inference engine (Custom checks)
Files:
🧠 Learnings (3)📚 Learning: 2026-04-07T14:36:46.394ZApplied to files:
📚 Learning: 2026-04-21T20:12:22.959ZApplied to files:
📚 Learning: 2026-05-23T22:52:58.065ZApplied to files:
🔇 Additional comments (1)
WalkthroughAdds a cache-populated readiness flag to PeerTypeProvider, exposes IsCachePopulated() via the heartbeat interface and mock, seeds MetricPeerType with ObserverList at startup, and prevents sender metric updates until the peer-type cache is populated. Tests cover provider readiness and sender behavior. ChangesPeer-type cache readiness gating
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 4❌ Failed checks (1 warning, 3 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
fbsobreira
left a comment
There was a problem hiding this comment.
One comment on the readiness gate — needs verification: the metric suppression is correct on cold start, but on a mid-epoch reboot where the cache is already valid it may suppress correct data. Please validate whether the coordinator is populated at construction time. Observability-only, non-blocking. Concurrency on isReady and the test coverage look good.
| // is populated. The heartbeat sender's IsCachePopulated gate keeps this | ||
| // startup value in place until the first epoch-start event refreshes the | ||
| // cache, at which point the real peer-list classification takes over. | ||
| appStatusHandler.SetStringValue(core.MetricPeerType, string(core.ObserverList)) |
There was a problem hiding this comment.
One thing I want to check before this goes in: seeding peer_type=observer here is fine, but node_type is still seeded to validator up at line 47 (it's hardcoded to NodeTypeValidator in startup.go). Once we gate the sender, both seeds get frozen until the cache populates — so during the whole bootstrap window the node publishes node_type=validator + peer_type=observer, which contradict each other.
The part that worries me more than the cosmetics: on develop the sender corrected a bad node_type on the first heartbeat, so an observer node only showed validator for a second or two. With the gate, that wrong validator now sticks for the entire bootstrap window until the first cache refresh. So we fix "validator briefly shown as observer" but introduce "observer shown as validator for longer."
Is keeping node_type=validator stable the intent here, or should bootstrap be conservative? Two ways to square it:
- seed both to
observer(matches what the sender derives fromObserverListanyway —ObserverList → NodeTypeObserver), so observers are correct from t=0 and validators flip up once the cache loads; or - keep
node_typeas-is but drop a comment thatpeer_typeis intentionally pessimistic and the two are expected to diverge until the cache is ready.
Either way it's a one-liner — just want to make sure we pick deliberately since this is the exact metric the PR is meant to make trustworthy.
There was a problem hiding this comment.
Good catch on picking deliberately. Seeding both to observer re-introduces the exact symptom this PR is for (validators briefly showing observer on cache miss), so going the other way: keep node_type=validator, accept that observers briefly show validator in the bootstrap window. Pushed a comment in metrics.go making the choice explicit.
|
|
||
| ptp.mutCache.Lock() | ||
| ptp.cache = newCache | ||
| if len(newCache) > 0 { |
There was a problem hiding this comment.
Not a blocker for this PR — more a note for whoever touches this next. The isReady latch only ever flips false→true, but updateCache still swaps in the new map unconditionally, and createNewCache swallows the coordinator errors (logs at Debug and carries on with a nil slice). So the cache can go populated→empty on a later epoch refresh while IsCachePopulated() keeps returning true, and in that window the sender reclassifies a real validator as observer again.
To be clear this isn't something you introduced — on develop the sender already ran unconditionally, so that empty-cache mislabel was always possible at epoch boundaries; the new gate just doesn't extend its protection there. So nothing to fix here for the PR's scope.
If we ever want to close it though, the cheap option is to not clobber a good cache with an empty one:
ptp.mutCache.Lock()
if len(newCache) > 0 {
ptp.cache = newCache
ptp.isReady = true
}
ptp.mutCache.Unlock()which keeps the last-known-good classification through a transient empty refresh. Fine to leave as-is for now — just flagging so the latch's sticky semantics are on record.
There was a problem hiding this comment.
Yeah, sticky-true on a transient empty refresh is the gap. Your len(newCache) > 0 guard is the clean way to close it. Leaving for a follow-up per your call.
On validator startup, the heartbeat sender's
updateMetricswas overwriting the correct startup-timeklv_node_type=validatorwithobserver, becausepeerTypeProvider.ComputeForPubKeyreturnsObserverListsilently on cache miss and the cache isn't populated until the first epoch-start event fires. Same root cause hitklv_peer_type. Operators and monitoring tooling saw the wrong node type for the entire bootstrap window — minutes to hours depending on epoch length.The fix adds an
IsCachePopulated()readiness gate onPeerTypeProviderthat flips true only when the epoch-start handler fires.Sender.updateMetricsearly-returns when the gate is closed, preserving the startup-init values. A fourth commit also seedsklv_peer_typeat startup alongsideklv_node_type, so consumers never observe an empty/absent field during the bootstrap window.What changed
4 atomic commits:
Add IsCachePopulated readiness gate to PeerTypeProvider— newisReadyfield flipped inside the existing epoch-start handler.Expose IsCachePopulated on PeerTypeProviderHandler interface— wired through the heartbeat-side interface + mock stub.Skip klv_node_type and klv_peer_type writes on cache miss— early-return inSender.updateMetrics+ sender tests via anUpdateMetricsexporter.Initialize klv_peer_type at startup alongside klv_node_type— one-line seed ofObserverListso the field is never empty pre-gate.Validation
Unit tests (4 new):
IsCachePopulatedfalse at construction, flips true after epoch-start event;updateMetricsskips writes when not ready, writes correct values when ready. All green viago test ./node/heartbeat/process/... ./core/process/peer/....Binary verification: extracted the validator binary from the docker image and confirmed via
go tool nmthatIsCachePopulated+markReadysymbols compile in, and viago tool objdump -s 'Sender.*updateMetrics'that the gate is actually called atsender.go:177beforecomputePeerListatsender.go:181.In-the-wild observation: during this sprint's KLC-1920 multi-validator harness work, captured the live bug on the fallback container reporting
klv_node_type=observerdespiteklv_redundancy_level=2— independent pre-fix proof.End-to-end Docker A/B/C/D runs on a 3-validator + 1-redundancy-2-fallback localnet (kleverchain-localnet harness, both images built from the same Klever HEAD):
Side-effect audit: grep'd every reader of
MetricNodeType/MetricPeerTypein the codebase. Only consumers are the statusHandler presenter (TUI dashboard) and the/node/status+ Prometheus exposers. No consensus, sync, fork-detection, heartbeat-message, or block-processing code branches on these values. Fix has no internal impact beyond observability.Notes
The KLC-2388 bug requires the
peerTypeProvidercache to be unpopulated when the heartbeat first ticks (a production peer-churn condition). On localnet the cache populates correctly at construction, so the A/B runs deliberately exercise the cache-not-yet-populated codepath. The gate is correct regardless of whether the underlying cache-miss is reproducible in any specific environment.One build-system gotcha worth flagging for reviewers: the first two patched docker builds produced a binary where the `IsCachePopulated` symbol existed but `updateMetrics` wasn't calling it — BuildKit cache mounts (`--mount=type=cache,target=/root/.cache/go-build`) persist across `docker build --no-cache` and reused a stale `.a` for `node/heartbeat/process`. Dropping the cache mounts temporarily for the rebuild defeated it; the upstream Dockerfile is unchanged in this PR.
This PR fixes a startup observability bug by preventing the heartbeat sender from emitting incorrect peer-type metrics during node bootstrap. It gates metric writes on a peer-type cache readiness flag and seeds a bootstrap peer-type value to avoid empty metrics before the cache is populated.
Impact assessment
Key changes
Validation
Notes