Skip to content

fix(broker): sync host/port changes to chain#1337

Open
redstartechno wants to merge 1 commit into
gonka-ai:mainfrom
redstartechno:fix/447-hardware-node-host-port-diff
Open

fix(broker): sync host/port changes to chain#1337
redstartechno wants to merge 1 commit into
gonka-ai:mainfrom
redstartechno:fix/447-hardware-node-host-port-diff

Conversation

@redstartechno

Copy link
Copy Markdown

What problem does this solve?

When a hardware node's host or port changes — typically because the operator migrated the inference server to a new machine — the API node never updates the on-chain registration, which keeps pointing at the old endpoint. Addresses the diff-detection half of #447.

How do you know this is a real problem?

convertInferenceNodeToHardwareNode populates Host and Port and the chain validates and stores both (message_submit_hardware_diff.go, msg_server_submit_hardware_diff.go), but areHardwareNodesEqual — the function calculateNodesDiff uses to decide whether a node needs resubmission — compared only LocalId, Status, Hardware, Models and Version. A node whose host or port changed (with the same GPU and models, the classic migration) was reported as unchanged, so MsgSubmitHardwareDiff was never sent.

This also breaks the supported runtime path: updating a node via PUT /admin/v1/nodes/:id changes local broker state, but the 60-second sync loop (nodeSyncWorkersyncNodes) still considers it equal to the stale chain record and submits nothing.

Issue #447 reports the operator-facing symptom. The reporter's full scenario also involves the node_config.json merge-once behavior (a separate design question, see scope note under Risks); the host/port comparison gap is independently reproducible from code — demonstrated below.

How does this solve the problem?

Adds Host and Port to the field-by-field comparison in areHardwareNodesEqual, so a host/port change marks the node as NewOrModified and the sync loop submits the corrective diff. All HardwareNode proto fields are now compared.

What risks does this introduce? How can we mitigate them?

  • Resubmission churn considered and ruled out. SubmitHardwareDiff is the only writer of HardwareNodes and stores submitted nodes verbatim; Port is produced canonically via strconv.Itoa(PoCPort). After one successful submit, local equals chain and the sync loop goes quiet. The new TestAreHardwareNodesEqual_HostPort includes an identical-nodes case to guard this.
  • One-time effect after upgrade: nodes whose on-chain host/port is stale will submit a single corrective diff on the next sync tick, then converge. That is the intended healing, not churn.
  • Scope: this does not change the node_config.json merge-once behavior described in Node Registration Does Not Update After Migration (API stuck using old on-chain config) #447 (config merges into the local DB once, gated by kvKeyNodeConfigMerged; runtime node management is the admin API). Whether that design should change is left to maintainers.
  • Known limitation (pre-existing): InferencePort is not part of the HardwareNode proto, so inference-port-only changes remain invisible to the chain registry. Out of scope here.

How do you know this PR fixes the problem?

The new unit test TestAreHardwareNodesEqual_HostPort covers the equal pair, a host-only change and a port-only change; it fails against the old implementation and passes against the new one (CI runs the package tests — see local limitation below). Additionally, an extracted-function harness reproduces the exact #447 migration scenario against both versions of the function (evidence below).

Which components are affected?

  • decentralized-api/broker/broker.goareHardwareNodesEqual
  • decentralized-api/broker/broker_test.go — new test

No chain-side changes.

Testing & evidence

Local limitation, stated openly: the broker package transitively requires cgo (supranational/blst) and this dev machine has no C compiler, so go test ./broker/ cannot run here — CI covers the real package tests. To still verify behavior locally, the old and new versions of areHardwareNodesEqual (plus hardwareEquals) were extracted verbatim into a standalone module with the HardwareNode fields mirrored 1:1 from hardware_node.pb.go, and exercised with the #447 migration scenario: same LocalId/Status/Hardware/Models/Version, new host and port.

Before

Old function — the harness test is written to fail loudly when the bug fires:

=== RUN   TestOldEquality_HostPortChangeNotDetected
    equal_test.go:32: BUG (#447): old equality reports local 203.0.113.20:8081 == chain 198.51.100.7:8080 — diff never submitted
--- FAIL: TestOldEquality_HostPortChangeNotDetected (0.00s)

After

New function — detects the migration; identical nodes still compare equal (no diff churn):

=== RUN   TestNewEquality_HostPortChangeDetected
--- PASS: TestNewEquality_HostPortChangeDetected (0.00s)
=== RUN   TestNewEquality_IdenticalNodesStillEqual
--- PASS: TestNewEquality_IdenticalNodesStillEqual (0.00s)

areHardwareNodesEqual compared LocalId, Status, Hardware, Models and
Version but not Host or Port, even though the sync loop submits both
and the chain validates and stores them. A node whose host or port
changed (the typical server migration) was reported as unchanged, so
MsgSubmitHardwareDiff was never sent and the on-chain registration
kept pointing at the old endpoint - even when the operator updated
the node through the admin API.

Addresses the diff-detection half of gonka-ai#447; the node_config.json
merge-once behavior described there is a separate design question.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant