-
Notifications
You must be signed in to change notification settings - Fork 4
[KLC-2388] Gate klv_node_type / klv_peer_type writes on cache miss #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
0ae8619
08c114f
0e2d007
287a9c3
bd70a0e
c660f88
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ type PeerTypeProvider struct { | |
| nodesCoordinator sharding.NodesCoordinator | ||
| cache map[string]*peerListAndShard | ||
| mutCache sync.RWMutex | ||
| isReady bool | ||
| } | ||
|
|
||
| // ArgPeerTypeProvider contains all parameters needed for creating a PeerTypeProvider | ||
|
|
@@ -54,6 +55,16 @@ func NewPeerTypeProvider(arg ArgPeerTypeProvider) (*PeerTypeProvider, error) { | |
| return ptp, nil | ||
| } | ||
|
|
||
| // IsCachePopulated returns true once at least one updateCache call has produced a | ||
| // non-empty cache. Until then, the NodesCoordinator has not yet exposed any | ||
| // validator keys for the active epoch, so callers should treat the peer-type result | ||
| // as unreliable and avoid overwriting startup-seeded values. | ||
| func (ptp *PeerTypeProvider) IsCachePopulated() bool { | ||
| ptp.mutCache.RLock() | ||
| defer ptp.mutCache.RUnlock() | ||
| return ptp.isReady | ||
| } | ||
|
|
||
| // ComputeForPubKey returns the peer type for a given public key and shard id | ||
| func (ptp *PeerTypeProvider) ComputeForPubKey(pubKey []byte) (core.PeerType, uint32, error) { | ||
| ptp.mutCache.RLock() | ||
|
|
@@ -105,6 +116,9 @@ func (ptp *PeerTypeProvider) updateCache(epoch uint32) { | |
|
|
||
| ptp.mutCache.Lock() | ||
| ptp.cache = newCache | ||
| if len(newCache) > 0 { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a blocker for this PR — more a note for whoever touches this next. The To be clear this isn't something you introduced — on If we ever want to close it though, the cheap option is to not clobber a good cache with an empty one: ptp.mutCache.Lock()
if len(newCache) > 0 {
ptp.cache = newCache
ptp.isReady = true
}
ptp.mutCache.Unlock()which keeps the last-known-good classification through a transient empty refresh. Fine to leave as-is for now — just flagging so the latch's sticky semantics are on record.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, sticky-true on a transient empty refresh is the gap. Your |
||
| ptp.isReady = true | ||
| } | ||
| ptp.mutCache.Unlock() | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I want to check before this goes in: seeding
peer_type=observerhere is fine, butnode_typeis still seeded tovalidatorup at line 47 (it's hardcoded toNodeTypeValidatorinstartup.go). Once we gate the sender, both seeds get frozen until the cache populates — so during the whole bootstrap window the node publishesnode_type=validator+peer_type=observer, which contradict each other.The part that worries me more than the cosmetics: on
developthe sender corrected a badnode_typeon the first heartbeat, so an observer node only showedvalidatorfor a second or two. With the gate, that wrongvalidatornow sticks for the entire bootstrap window until the first cache refresh. So we fix "validator briefly shown as observer" but introduce "observer shown as validator for longer."Is keeping
node_type=validatorstable the intent here, or should bootstrap be conservative? Two ways to square it:observer(matches what the sender derives fromObserverListanyway —ObserverList → NodeTypeObserver), so observers are correct from t=0 and validators flip up once the cache loads; ornode_typeas-is but drop a comment thatpeer_typeis intentionally pessimistic and the two are expected to diverge until the cache is ready.Either way it's a one-liner — just want to make sure we pick deliberately since this is the exact metric the PR is meant to make trustworthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on picking deliberately. Seeding both to
observerre-introduces the exact symptom this PR is for (validators briefly showing observer on cache miss), so going the other way: keepnode_type=validator, accept that observers briefly show validator in the bootstrap window. Pushed a comment in metrics.go making the choice explicit.