fix(broker): sync host/port changes to chain#1337
Open
redstartechno wants to merge 1 commit into
Open
Conversation
areHardwareNodesEqual compared LocalId, Status, Hardware, Models and Version but not Host or Port, even though the sync loop submits both and the chain validates and stores them. A node whose host or port changed (the typical server migration) was reported as unchanged, so MsgSubmitHardwareDiff was never sent and the on-chain registration kept pointing at the old endpoint - even when the operator updated the node through the admin API. Addresses the diff-detection half of gonka-ai#447; the node_config.json merge-once behavior described there is a separate design question. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this solve?
When a hardware node's host or port changes — typically because the operator migrated the inference server to a new machine — the API node never updates the on-chain registration, which keeps pointing at the old endpoint. Addresses the diff-detection half of #447.
How do you know this is a real problem?
convertInferenceNodeToHardwareNodepopulatesHostandPortand the chain validates and stores both (message_submit_hardware_diff.go,msg_server_submit_hardware_diff.go), butareHardwareNodesEqual— the functioncalculateNodesDiffuses to decide whether a node needs resubmission — compared onlyLocalId,Status,Hardware,ModelsandVersion. A node whose host or port changed (with the same GPU and models, the classic migration) was reported as unchanged, soMsgSubmitHardwareDiffwas never sent.This also breaks the supported runtime path: updating a node via
PUT /admin/v1/nodes/:idchanges local broker state, but the 60-second sync loop (nodeSyncWorker→syncNodes) still considers it equal to the stale chain record and submits nothing.Issue #447 reports the operator-facing symptom. The reporter's full scenario also involves the
node_config.jsonmerge-once behavior (a separate design question, see scope note under Risks); the host/port comparison gap is independently reproducible from code — demonstrated below.How does this solve the problem?
Adds
HostandPortto the field-by-field comparison inareHardwareNodesEqual, so a host/port change marks the node asNewOrModifiedand the sync loop submits the corrective diff. AllHardwareNodeproto fields are now compared.What risks does this introduce? How can we mitigate them?
SubmitHardwareDiffis the only writer ofHardwareNodesand stores submitted nodes verbatim;Portis produced canonically viastrconv.Itoa(PoCPort). After one successful submit, local equals chain and the sync loop goes quiet. The newTestAreHardwareNodesEqual_HostPortincludes an identical-nodes case to guard this.node_config.jsonmerge-once behavior described in Node Registration Does Not Update After Migration (API stuck using old on-chain config) #447 (config merges into the local DB once, gated bykvKeyNodeConfigMerged; runtime node management is the admin API). Whether that design should change is left to maintainers.InferencePortis not part of theHardwareNodeproto, so inference-port-only changes remain invisible to the chain registry. Out of scope here.How do you know this PR fixes the problem?
The new unit test
TestAreHardwareNodesEqual_HostPortcovers the equal pair, a host-only change and a port-only change; it fails against the old implementation and passes against the new one (CI runs the package tests — see local limitation below). Additionally, an extracted-function harness reproduces the exact #447 migration scenario against both versions of the function (evidence below).Which components are affected?
decentralized-api/broker/broker.go—areHardwareNodesEqualdecentralized-api/broker/broker_test.go— new testNo chain-side changes.
Testing & evidence
Local limitation, stated openly: the
brokerpackage transitively requires cgo (supranational/blst) and this dev machine has no C compiler, sogo test ./broker/cannot run here — CI covers the real package tests. To still verify behavior locally, the old and new versions ofareHardwareNodesEqual(plushardwareEquals) were extracted verbatim into a standalone module with theHardwareNodefields mirrored 1:1 fromhardware_node.pb.go, and exercised with the #447 migration scenario: sameLocalId/Status/Hardware/Models/Version, new host and port.Before
Old function — the harness test is written to fail loudly when the bug fires:
After
New function — detects the migration; identical nodes still compare equal (no diff churn):