Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions cmd/node/metrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,17 @@ func InitMetrics(
initZeroString := "0"

appStatusHandler.SetStringValue(core.MetricPublicKeyBlockSign, pubkeyStr)
// KLC-2388: node_type stays "validator" during the bootstrap window even
// though peer_type below is seeded to "observer". The two intentionally
// diverge until the peerTypeProvider cache populates: node_type is the
// PR's protected value (the original bug was validators flipping to
// observer on cache miss), so we cannot seed it observer without
// re-introducing exactly that symptom; peer_type is left pessimistic
// because it's a peer-list classification that we genuinely don't know
// yet. Once the gate opens the sender writes the real values from the
// fork detector / nodes coordinator.
appStatusHandler.SetStringValue(core.MetricNodeType, string(nodeType))
appStatusHandler.SetStringValue(core.MetricPeerType, string(core.ObserverList))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I want to check before this goes in: seeding peer_type=observer here is fine, but node_type is still seeded to validator up at line 47 (it's hardcoded to NodeTypeValidator in startup.go). Once we gate the sender, both seeds get frozen until the cache populates — so during the whole bootstrap window the node publishes node_type=validator + peer_type=observer, which contradict each other.

The part that worries me more than the cosmetics: on develop the sender corrected a bad node_type on the first heartbeat, so an observer node only showed validator for a second or two. With the gate, that wrong validator now sticks for the entire bootstrap window until the first cache refresh. So we fix "validator briefly shown as observer" but introduce "observer shown as validator for longer."

Is keeping node_type=validator stable the intent here, or should bootstrap be conservative? Two ways to square it:

  • seed both to observer (matches what the sender derives from ObserverList anyway — ObserverList → NodeTypeObserver), so observers are correct from t=0 and validators flip up once the cache loads; or
  • keep node_type as-is but drop a comment that peer_type is intentionally pessimistic and the two are expected to diverge until the cache is ready.

Either way it's a one-liner — just want to make sure we pick deliberately since this is the exact metric the PR is meant to make trustworthy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on picking deliberately. Seeding both to observer re-introduces the exact symptom this PR is for (validators briefly showing observer on cache miss), so going the other way: keep node_type=validator, accept that observers briefly show validator in the bootstrap window. Pushed a comment in metrics.go making the choice explicit.

appStatusHandler.SetUInt64Value(core.MetricSlotTime, slotInterval/millisecondsInSecond)
appStatusHandler.SetStringValue(core.MetricAppVersion, version)
appStatusHandler.SetUInt64Value(core.MetricSlotsPerEpoch, uint64(slotsPerEpoch))
Expand Down
14 changes: 14 additions & 0 deletions core/process/peer/peerTypeProvider.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ type PeerTypeProvider struct {
nodesCoordinator sharding.NodesCoordinator
cache map[string]*peerListAndShard
mutCache sync.RWMutex
isReady bool
}

// ArgPeerTypeProvider contains all parameters needed for creating a PeerTypeProvider
Expand Down Expand Up @@ -54,6 +55,16 @@ func NewPeerTypeProvider(arg ArgPeerTypeProvider) (*PeerTypeProvider, error) {
return ptp, nil
}

// IsCachePopulated returns true once at least one updateCache call has produced a
// non-empty cache. Until then, the NodesCoordinator has not yet exposed any
// validator keys for the active epoch, so callers should treat the peer-type result
// as unreliable and avoid overwriting startup-seeded values.
func (ptp *PeerTypeProvider) IsCachePopulated() bool {
ptp.mutCache.RLock()
defer ptp.mutCache.RUnlock()
return ptp.isReady
}

// ComputeForPubKey returns the peer type for a given public key and shard id
func (ptp *PeerTypeProvider) ComputeForPubKey(pubKey []byte) (core.PeerType, uint32, error) {
ptp.mutCache.RLock()
Expand Down Expand Up @@ -105,6 +116,9 @@ func (ptp *PeerTypeProvider) updateCache(epoch uint32) {

ptp.mutCache.Lock()
ptp.cache = newCache
if len(newCache) > 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR — more a note for whoever touches this next. The isReady latch only ever flips false→true, but updateCache still swaps in the new map unconditionally, and createNewCache swallows the coordinator errors (logs at Debug and carries on with a nil slice). So the cache can go populated→empty on a later epoch refresh while IsCachePopulated() keeps returning true, and in that window the sender reclassifies a real validator as observer again.

To be clear this isn't something you introduced — on develop the sender already ran unconditionally, so that empty-cache mislabel was always possible at epoch boundaries; the new gate just doesn't extend its protection there. So nothing to fix here for the PR's scope.

If we ever want to close it though, the cheap option is to not clobber a good cache with an empty one:

ptp.mutCache.Lock()
if len(newCache) > 0 {
    ptp.cache = newCache
    ptp.isReady = true
}
ptp.mutCache.Unlock()

which keeps the last-known-good classification through a transient empty refresh. Fine to leave as-is for now — just flagging so the latch's sticky semantics are on record.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sticky-true on a transient empty refresh is the gap. Your len(newCache) > 0 guard is the clean way to close it. Leaving for a follow-up per your call.

ptp.isReady = true
}
ptp.mutCache.Unlock()
}

Expand Down
61 changes: 61 additions & 0 deletions core/process/peer/peerTypeProvider_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,67 @@ func TestNewPeerTypeProvider_ComputeForKeyNotFoundInCacheReturnsObserver(t *test
assert.Nil(t, err)
}

func TestPeerTypeProvider_IsCachePopulated_FalseWhenCoordinatorReturnsEmpty(t *testing.T) {
arg := createDefaultArgPeerTypeProvider()

ptp, err := NewPeerTypeProvider(arg)
require.Nil(t, err)
require.NotNil(t, ptp)

assert.False(t, ptp.IsCachePopulated())
}

func TestPeerTypeProvider_IsCachePopulated_TrueAtConstructionWhenCachePopulated(t *testing.T) {
arg := createDefaultArgPeerTypeProvider()
arg.NodesCoordinator = &mock.NodesCoordinatorMock{
GetAllElectedValidatorsKeysCalled: func() ([][]byte, error) {
return [][]byte{[]byte("validator-pk")}, nil
},
}

ptp, err := NewPeerTypeProvider(arg)
require.Nil(t, err)
require.NotNil(t, ptp)

assert.True(t, ptp.IsCachePopulated())
}

func TestPeerTypeProvider_IsCachePopulated_TrueAfterEpochStartFillsCache(t *testing.T) {
arg := createDefaultArgPeerTypeProvider()
epochStartNotifier := &mock.EpochStartNotifierStub{}
arg.EpochStartEventNotifier = epochStartNotifier

coordinator := &mock.NodesCoordinatorMock{}
arg.NodesCoordinator = coordinator

ptp, err := NewPeerTypeProvider(arg)
require.Nil(t, err)
require.NotNil(t, ptp)
require.False(t, ptp.IsCachePopulated())

coordinator.GetAllElectedValidatorsKeysCalled = func() ([][]byte, error) {
return [][]byte{[]byte("validator-pk")}, nil
}
epochStartNotifier.NotifyAll(&block.Block{Header: &block.BlockHeader{Nonce: 1, Epoch: 1}})

assert.True(t, ptp.IsCachePopulated())
}

func TestPeerTypeProvider_IsCachePopulated_StaysFalseWhenEpochStartFiresButCacheEmpty(t *testing.T) {
arg := createDefaultArgPeerTypeProvider()
epochStartNotifier := &mock.EpochStartNotifierStub{}
arg.EpochStartEventNotifier = epochStartNotifier

ptp, err := NewPeerTypeProvider(arg)
require.Nil(t, err)
require.NotNil(t, ptp)
require.False(t, ptp.IsCachePopulated())

epochStartNotifier.NotifyAll(&block.Block{Header: &block.BlockHeader{Nonce: 1, Epoch: 1}})

assert.False(t, ptp.IsCachePopulated())
}

func TestNewPeerTypeProvider_IsInterfaceNil(t *testing.T) {
arg := createDefaultArgPeerTypeProvider()

Expand Down
1 change: 1 addition & 0 deletions node/heartbeat/interface.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ type P2PAntifloodHandler interface {
type PeerTypeProviderHandler interface {
ComputeForPubKey(pubKey []byte) (core.PeerType, uint32, error)
GetAllPeerTypeInfos() []*state.PeerTypeInfo
IsCachePopulated() bool
IsInterfaceNil() bool
}

Expand Down
10 changes: 10 additions & 0 deletions node/heartbeat/mock/peerTypeProviderStub.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
// PeerTypeProviderStub -
type PeerTypeProviderStub struct {
ComputeForPubKeyCalled func(pubKey []byte) (core.PeerType, uint32, error)
IsCachePopulatedCalled func() bool
}

// ComputeForPubKey -
Expand All @@ -24,6 +25,15 @@ func (p *PeerTypeProviderStub) GetAllPeerTypeInfos() []*state.PeerTypeInfo {
return nil
}

// IsCachePopulated -
func (p *PeerTypeProviderStub) IsCachePopulated() bool {
if p.IsCachePopulatedCalled != nil {
return p.IsCachePopulatedCalled()
}

return true
}

// IsInterfaceNil -
func (p *PeerTypeProviderStub) IsInterfaceNil() bool {
return p == nil
Expand Down
5 changes: 5 additions & 0 deletions node/heartbeat/process/export_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,11 @@ func VerifyLengths(hbmi *data.Heartbeat) error {
return verifyLengths(hbmi)
}

// UpdateMetrics -
func (s *Sender) UpdateMetrics(hb *data.Heartbeat) {
s.updateMetrics(hb)
}

// GetMaxSizeInBytes -
func GetMaxSizeInBytes() int {
return maxSizeInBytes
Expand Down
4 changes: 4 additions & 0 deletions node/heartbeat/process/sender.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,10 @@ func (s *Sender) getCurrentPrivateAndPublicKeys() (crypto.PrivateKey, crypto.Pub
}

func (s *Sender) updateMetrics(hb *heartbeatData.Heartbeat) {
if !s.peerTypeProvider.IsCachePopulated() {
return
}

result := s.computePeerList(hb.Pubkey)

nodeType := ""
Expand Down
60 changes: 60 additions & 0 deletions node/heartbeat/process/sender_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
"testing"

cMock "github.com/klever-io/klever-go/common/mock"
"github.com/klever-io/klever-go/core"
"github.com/klever-io/klever-go/crypto"
"github.com/klever-io/klever-go/node/heartbeat"
"github.com/klever-io/klever-go/node/heartbeat/data"
Expand Down Expand Up @@ -621,3 +622,62 @@ func TestSender_SendHeartbeatAfterTriggerWithRecorededPayloadShouldWork(t *testi
assert.True(t, genPubKeyCalled)
assert.True(t, marshalCalled)
}

func TestSender_UpdateMetricsSkipsWritesWhenCacheNotPopulated(t *testing.T) {
t.Parallel()

arg := createMockArgHeartbeatSender()
arg.PeerTypeProvider = &mock.PeerTypeProviderStub{
IsCachePopulatedCalled: func() bool {
return false
},
ComputeForPubKeyCalled: func(pubKey []byte) (core.PeerType, uint32, error) {
t.Fatalf("ComputeForPubKey must not be called when cache is not populated")
return "", 0, nil
},
}
setStringCalls := make(map[string]string)
arg.StatusHandler = &mock.AppStatusHandlerStub{
SetStringValueHandler: func(key string, value string) {
setStringCalls[key] = value
},
}

sender, err := process.NewSender(arg)
assert.Nil(t, err)

sender.UpdateMetrics(&data.Heartbeat{Pubkey: []byte("pk")})

_, gotNodeType := setStringCalls[core.MetricNodeType]
_, gotPeerType := setStringCalls[core.MetricPeerType]
assert.False(t, gotNodeType)
assert.False(t, gotPeerType)
}

func TestSender_UpdateMetricsWritesWhenCachePopulated(t *testing.T) {
t.Parallel()

arg := createMockArgHeartbeatSender()
arg.PeerTypeProvider = &mock.PeerTypeProviderStub{
IsCachePopulatedCalled: func() bool {
return true
},
ComputeForPubKeyCalled: func(pubKey []byte) (core.PeerType, uint32, error) {
return core.EligibleList, 0, nil
},
}
setStringCalls := make(map[string]string)
arg.StatusHandler = &mock.AppStatusHandlerStub{
SetStringValueHandler: func(key string, value string) {
setStringCalls[key] = value
},
}

sender, err := process.NewSender(arg)
assert.Nil(t, err)

sender.UpdateMetrics(&data.Heartbeat{Pubkey: []byte("pk")})

assert.Equal(t, string(core.NodeTypeValidator), setStringCalls[core.MetricNodeType])
assert.Equal(t, string(core.EligibleList), setStringCalls[core.MetricPeerType])
}
Loading