🐞 [BUG] Node Registration Cannot Update After Migration (stuck on old on-chain state; diff returns no changes)
Summary
After migrating the ML inference server to new infrastructure and switching from an unsupported model (Llama) to a governance-approved one (Qwen2.5-7B-Instruct), the node becomes permanently stuck using outdated on-chain registration data.
Even though node_config.json contains the correct configuration and the API reads it correctly on startup, the API never submits an update transaction, because the diff logic always reports “No differences”, even when the on-chain state is completely different.
This leaves the operator unable to recover or participate in inference, even with fully correct infrastructure.
✅ Expected Behavior
When node_config.json changes (host, port, hardware, models),
→ API should detect differences.
→ API should submit an update transaction.
→ On-chain registration should be updated.
❌ Actual Behavior
API loads correct local config
On-chain data is old and completely mismatched
But diff logic reports:
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit
API then continues to use the old host + port, even though they are no longer valid:
ERROR: queryNodeStatus → dial tcp OLD_IP:8081 → connection refused
Node is stuck in FAILED state across epochs.
📌 Timeline & Reproduction Story
This issue appeared during a normal model migration, following official guidance.
(1) Initial setup (worked, but received 0 assignments)
Model: Meta-Llama-3.1-8B-Instruct
Framework: llama.cpp
Host: old_IP:8081
Hardware: RTX 4090
Registered successfully
Received 0 inference tasks for multiple epochs, so decision was made to switch to a governance model.
(2) Official guidance from Discord
Gonka team advised:
Llama is not governance-supported
Required: models from https://gonka.hyperfusion.io/v1/models
Recommended: use vLLM
Also required: additional disk space
So migration to new hardware was performed.
(3) Infrastructure migration
A new server with enough disk space was deployed:
Inference server:
- GPU: RTX 4090
- Disk: ~90GB
- Model: Qwen/Qwen2.5-7B-Instruct
- Framework: vLLM
- vLLM running on port 8081
Validation:
curl http://localhost:8081/v1/models
→ returns Qwen2.5-7B-Instruct (OK)
(4) node_config.json updated
Correct configuration:
[{
"id": "node1",
"host": "NEW_IP",
"inference_port": 8081,
"poc_port": 8081,
"models": { "Qwen/Qwen2.5-7B-Instruct": { "args": [] }},
"hardware": [{ "type": "NVIDIA RTX 4090", "count": 1 }],
"max_concurrent": 500
}]
(5) API reads config correctly
Startup logs:
INFO Registered node:
Host: NEW_IP
InferencePort: 8081
Models: Qwen/Qwen2.5-7B-Instruct
Hardware: RTX 4090
(6) BUT the API immediately switches back to old on-chain data
ERROR queryNodeStatus
dial tcp OLD_IP:8081 → connection refused
(7) On-chain query shows old/corrupted data
host: "inference" (incorrect)
port: "8080" (incorrect)
models: Qwen3-235B-* (incorrect)
hardware: H200 140GB (incorrect)
No fields match local config.
(8) Diff logic incorrectly concludes “no changes”
[sync nodes] Local nodes: 1
[sync nodes] Chain nodes: 1
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit
As a result:
No update transaction is sent
Node cannot join inference
Node stays FAILED each epoch
Operator is permanently stuck
🔍 Root Cause (Hypothesis)
The problem seems to be triggered by a combination of:
- Old registration containing invalid values
(e.g. models not in governance list, incorrect hardware type, host="inference")
- API’s diff logic
Fails to detect differences between:
local node_config.json
on-chain corrupted registration
This is likely because the comparison:
ignores certain fields
or normalizes the structures so differently that mismatches become “equal”
or treats missing fields as defaults that match
- Stale on-chain registration overrides local state
Even after deleting:
rm -rf ~/.dapi
and restarting API, the old chain state overwrites config-dump.json.
Result
The operator has no way to force re-registration, even with valid config + valid model.
🧪 Steps to Reproduce (Generalized)
Register a node with a model that was once accepted but is now invalid (e.g. before governance model validation was strict)
Later switch to a different model + different infra:
new host
new port
new model
new hardware specs
Update local config
Restart API
API loads correct local config
API loads outdated on-chain data
API diff logic sees no differences
API never submits update transaction
Node remains stuck in FAILED
🎯 Impact
Node cannot join inference for many epochs
Operator cannot fix the issue manually
Registry becomes permanently stuck in invalid state
Requires team intervention to force removal or re-registration
📝 What would help resolve the issue
Please advise:
- Is there a way to force a fresh registration?
(e.g. delete existing hardware-node entry on chain)
- Should operators create a new node ID instead of updating existing nodes?
- Is this a known issue with the current diff logic?
The behavior strongly suggests it.
- Should the compare logic be updated to catch mismatched:
host
ports
hardware type
model list
number of models
missing fields
defaults vs explicit values
These should always trigger a diff.
🙏 Thank you
This report is intended to help improve the reliability of node onboarding and recovery, especially during model migrations.
If additional logs are needed, I can provide:
full API startup logs
vLLM logs
config-dump.json snapshots
raw output of inferenced query inference hardware-nodes-all
🐞 [BUG] Node Registration Cannot Update After Migration (stuck on old on-chain state; diff returns no changes)
Summary
After migrating the ML inference server to new infrastructure and switching from an unsupported model (Llama) to a governance-approved one (Qwen2.5-7B-Instruct), the node becomes permanently stuck using outdated on-chain registration data.
Even though node_config.json contains the correct configuration and the API reads it correctly on startup, the API never submits an update transaction, because the diff logic always reports “No differences”, even when the on-chain state is completely different.
This leaves the operator unable to recover or participate in inference, even with fully correct infrastructure.
✅ Expected Behavior
When node_config.json changes (host, port, hardware, models),
→ API should detect differences.
→ API should submit an update transaction.
→ On-chain registration should be updated.
❌ Actual Behavior
API loads correct local config
On-chain data is old and completely mismatched
But diff logic reports:
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit
API then continues to use the old host + port, even though they are no longer valid:
ERROR: queryNodeStatus → dial tcp OLD_IP:8081 → connection refused
Node is stuck in FAILED state across epochs.
📌 Timeline & Reproduction Story
This issue appeared during a normal model migration, following official guidance.
(1) Initial setup (worked, but received 0 assignments)
Model: Meta-Llama-3.1-8B-Instruct
Framework: llama.cpp
Host: old_IP:8081
Hardware: RTX 4090
Registered successfully
Received 0 inference tasks for multiple epochs, so decision was made to switch to a governance model.
(2) Official guidance from Discord
Gonka team advised:
Llama is not governance-supported
Required: models from https://gonka.hyperfusion.io/v1/models
Recommended: use vLLM
Also required: additional disk space
So migration to new hardware was performed.
(3) Infrastructure migration
A new server with enough disk space was deployed:
Inference server:
Validation:
curl http://localhost:8081/v1/models
→ returns Qwen2.5-7B-Instruct (OK)
(4) node_config.json updated
Correct configuration:
[{
"id": "node1",
"host": "NEW_IP",
"inference_port": 8081,
"poc_port": 8081,
"models": { "Qwen/Qwen2.5-7B-Instruct": { "args": [] }},
"hardware": [{ "type": "NVIDIA RTX 4090", "count": 1 }],
"max_concurrent": 500
}]
(5) API reads config correctly
Startup logs:
INFO Registered node:
Host: NEW_IP
InferencePort: 8081
Models: Qwen/Qwen2.5-7B-Instruct
Hardware: RTX 4090
(6) BUT the API immediately switches back to old on-chain data
ERROR queryNodeStatus
dial tcp OLD_IP:8081 → connection refused
(7) On-chain query shows old/corrupted data
host: "inference" (incorrect)
port: "8080" (incorrect)
models: Qwen3-235B-* (incorrect)
hardware: H200 140GB (incorrect)
No fields match local config.
(8) Diff logic incorrectly concludes “no changes”
[sync nodes] Local nodes: 1
[sync nodes] Chain nodes: 1
[sync nodes] Hardware diff: NewOrModified:[] Removed:[]
[sync nodes] No diff to submit
As a result:
No update transaction is sent
Node cannot join inference
Node stays FAILED each epoch
Operator is permanently stuck
🔍 Root Cause (Hypothesis)
The problem seems to be triggered by a combination of:
(e.g. models not in governance list, incorrect hardware type, host="inference")
Fails to detect differences between:
local node_config.json
on-chain corrupted registration
This is likely because the comparison:
ignores certain fields
or normalizes the structures so differently that mismatches become “equal”
or treats missing fields as defaults that match
Even after deleting:
rm -rf ~/.dapi
and restarting API, the old chain state overwrites config-dump.json.
Result
The operator has no way to force re-registration, even with valid config + valid model.
🧪 Steps to Reproduce (Generalized)
Register a node with a model that was once accepted but is now invalid (e.g. before governance model validation was strict)
Later switch to a different model + different infra:
new host
new port
new model
new hardware specs
Update local config
Restart API
API loads correct local config
API loads outdated on-chain data
API diff logic sees no differences
API never submits update transaction
Node remains stuck in FAILED
🎯 Impact
Node cannot join inference for many epochs
Operator cannot fix the issue manually
Registry becomes permanently stuck in invalid state
Requires team intervention to force removal or re-registration
📝 What would help resolve the issue
Please advise:
(e.g. delete existing hardware-node entry on chain)
The behavior strongly suggests it.
host
ports
hardware type
model list
number of models
missing fields
defaults vs explicit values
These should always trigger a diff.
🙏 Thank you
This report is intended to help improve the reliability of node onboarding and recovery, especially during model migrations.
If additional logs are needed, I can provide:
full API startup logs
vLLM logs
config-dump.json snapshots
raw output of inferenced query inference hardware-nodes-all