Skip to content

Node Registration Does Not Update After Migration (API stuck using old on-chain config) #447

@Asplana92

Description

@Asplana92

🐞 [BUG] Node Registration Cannot Update After Migration (stuck on old on-chain state; diff returns no changes)
Summary

After migrating the ML inference server to new infrastructure and switching from an unsupported model (Llama) to a governance-approved one (Qwen2.5-7B-Instruct), the node becomes permanently stuck using outdated on-chain registration data.

Even though node_config.json contains the correct configuration and the API reads it correctly on startup, the API never submits an update transaction, because the diff logic always reports “No differences”, even when the on-chain state is completely different.

This leaves the operator unable to recover or participate in inference, even with fully correct infrastructure.

✅ Expected Behavior

When node_config.json changes (host, port, hardware, models),
→ API should detect differences.
→ API should submit an update transaction.
→ On-chain registration should be updated.

❌ Actual Behavior

API loads correct local config

On-chain data is old and completely mismatched

But diff logic reports:
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit
API then continues to use the old host + port, even though they are no longer valid:
ERROR: queryNodeStatus → dial tcp OLD_IP:8081 → connection refused
Node is stuck in FAILED state across epochs.

📌 Timeline & Reproduction Story

This issue appeared during a normal model migration, following official guidance.

(1) Initial setup (worked, but received 0 assignments)

Model: Meta-Llama-3.1-8B-Instruct

Framework: llama.cpp

Host: old_IP:8081

Hardware: RTX 4090

Registered successfully

Received 0 inference tasks for multiple epochs, so decision was made to switch to a governance model.

(2) Official guidance from Discord

Gonka team advised:

Llama is not governance-supported

Required: models from https://gonka.hyperfusion.io/v1/models

Recommended: use vLLM

Also required: additional disk space

So migration to new hardware was performed.

(3) Infrastructure migration
A new server with enough disk space was deployed:
Inference server:

  • GPU: RTX 4090
  • Disk: ~90GB
  • Model: Qwen/Qwen2.5-7B-Instruct
  • Framework: vLLM
  • vLLM running on port 8081

Validation:
curl http://localhost:8081/v1/models
→ returns Qwen2.5-7B-Instruct (OK)

(4) node_config.json updated
Correct configuration:
[{
"id": "node1",
"host": "NEW_IP",
"inference_port": 8081,
"poc_port": 8081,
"models": { "Qwen/Qwen2.5-7B-Instruct": { "args": [] }},
"hardware": [{ "type": "NVIDIA RTX 4090", "count": 1 }],
"max_concurrent": 500
}]

(5) API reads config correctly
Startup logs:
INFO Registered node:
Host: NEW_IP
InferencePort: 8081
Models: Qwen/Qwen2.5-7B-Instruct
Hardware: RTX 4090

(6) BUT the API immediately switches back to old on-chain data
ERROR queryNodeStatus
dial tcp OLD_IP:8081 → connection refused

(7) On-chain query shows old/corrupted data
host: "inference" (incorrect)
port: "8080" (incorrect)
models: Qwen3-235B-* (incorrect)
hardware: H200 140GB (incorrect)
No fields match local config.

(8) Diff logic incorrectly concludes “no changes”
[sync nodes] Local nodes: 1
[sync nodes] Chain nodes: 1
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit

As a result:

No update transaction is sent

Node cannot join inference

Node stays FAILED each epoch

Operator is permanently stuck

🔍 Root Cause (Hypothesis)

The problem seems to be triggered by a combination of:

  1. Old registration containing invalid values

(e.g. models not in governance list, incorrect hardware type, host="inference")

  1. API’s diff logic

Fails to detect differences between:

local node_config.json

on-chain corrupted registration

This is likely because the comparison:

ignores certain fields

or normalizes the structures so differently that mismatches become “equal”

or treats missing fields as defaults that match

  1. Stale on-chain registration overrides local state

Even after deleting:
rm -rf ~/.dapi
and restarting API, the old chain state overwrites config-dump.json.

Result

The operator has no way to force re-registration, even with valid config + valid model.

🧪 Steps to Reproduce (Generalized)

Register a node with a model that was once accepted but is now invalid (e.g. before governance model validation was strict)

Later switch to a different model + different infra:

new host

new port

new model

new hardware specs

Update local config

Restart API

API loads correct local config

API loads outdated on-chain data

API diff logic sees no differences

API never submits update transaction

Node remains stuck in FAILED

🎯 Impact

Node cannot join inference for many epochs

Operator cannot fix the issue manually

Registry becomes permanently stuck in invalid state

Requires team intervention to force removal or re-registration

📝 What would help resolve the issue

Please advise:

  1. Is there a way to force a fresh registration?

(e.g. delete existing hardware-node entry on chain)

  1. Should operators create a new node ID instead of updating existing nodes?
  2. Is this a known issue with the current diff logic?

The behavior strongly suggests it.

  1. Should the compare logic be updated to catch mismatched:

host

ports

hardware type

model list

number of models

missing fields

defaults vs explicit values

These should always trigger a diff.

🙏 Thank you

This report is intended to help improve the reliability of node onboarding and recovery, especially during model migrations.

If additional logs are needed, I can provide:

full API startup logs

vLLM logs

config-dump.json snapshots

raw output of inferenced query inference hardware-nodes-all

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions