Skip to content

[BUG]: Protocol compliance test_health_mirrors_runtime_mode_state fails in Edge mode due to narrow convergence predicate #4440

@jonpspri

Description

@jonpspri

🐞 Bug Summary

test_health_mirrors_runtime_mode_state in tests/protocol_compliance/test_runtime_mode.py fails in Edge (Rust) mode because the convergence polling loop only checks effective_mode, but the subsequent assertion also requires override_active to match across endpoints.

In a multi-replica deployment (docker-compose runs 3 gateway replicas behind Nginx), the two HTTP requests (GET /health and GET /admin/runtime/mcp-mode) can land on different pods. When boot_mode is edge and one pod has received an override-to-edge via Redis pub/sub but the other hasn't yet:

  • Both pods report effective_mode="edge" → the convergence loop exits immediately
  • But override_active diverges: True on the pod that received the override, False on the pod that hasn't

🧩 Affected Component

  • Federation or Transports
  • Other: protocol compliance test harness + runtime state propagation

🔁 Steps to Reproduce

  1. Run protocol compliance tests against a multi-replica Edge-mode deployment
  2. pytest tests/protocol_compliance/test_runtime_mode.py::test_health_mirrors_runtime_mode_state

🤔 Expected Behavior

The test should wait for all four compared fields (boot_mode, effective_mode, override_active, cluster_propagation) to converge before asserting, not just effective_mode.


📓 Logs / Error Output

tests/protocol_compliance/test_runtime_mode.py:149: in test_health_mirrors_runtime_mode_state
    assert mcp_rt.get(key) == admin.get(key), f"mcp_runtime.{key}={mcp_rt.get(key)!r} vs admin.{key}={admin.get(key)!r}"
E   AssertionError: mcp_runtime.override_active=True vs admin.override_active=False
E   assert True == False

🧠 Environment Info

Key Value
Runtime mode Edge (Rust MCP runtime active)
Deployment docker-compose, multi-replica behind Nginx

🧩 Additional Context

The root cause is in the convergence loop (lines 139-147 of test_runtime_mode.py):

if mcp_rt.get("effective_mode") == admin.get("effective_mode"):
    break

This should check all four asserted keys, not just effective_mode. When boot_mode already equals the override target (both edge), effective_mode is identical on both pods regardless of whether the override has propagated, masking the override_active divergence.

This is not a Rust runtime bug — both Python endpoints compute override_active the same way. It's a test convergence gap exposed by multi-pod state propagation timing.

Metadata

Metadata

Labels

bugSomething isn't workingtriageIssues / Features awaiting triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions