ADR-034: Multi-Pod Health Registration

Status

Proposed

Date

2026-03-23

Context

The current /api/v1/health endpoint (PR #69) probes a hardcoded list of upstream services via env-var-configured addresses:

_UPSTREAM_SERVICES: list[tuple[str, str, str]] = [
    ("Primary Orchestrator", "APME_PRIMARY_ADDRESS", "127.0.0.1:50051"),
    ("Native Validator",     "NATIVE_GRPC_ADDRESS",  "127.0.0.1:50055"),
    ...
]

This works for single-pod and local daemon deployments where all services share localhost. It breaks in multi-pod topologies (ADR-012) because:

Unknown pod count. The gateway cannot enumerate N engine pods from static env vars. Each pod is a self-contained stack (Primary + validators + Galaxy Proxy), and pods are created/destroyed dynamically by an orchestrator.
No registration path. ADR-005 rejected etcd/service-discovery for the single-pod case. That decision is correct within a pod (fixed ports, known services), but does not address the between-pod problem: how does the gateway know which pods exist?
Health is unidirectional. The gateway currently pulls health by probing addresses it already knows. In a multi-pod world, the gateway has no addresses to probe until pods announce themselves.
ADR-020 already established a push channel. Engine pods push ScanCompleted/FixCompleted events to the gateway's Reporting service via gRPC. Extending this channel with a heartbeat is a natural, low-cost addition.

Forces in tension

ADR-005: No service discovery infrastructure (etcd, Consul) — still correct
ADR-012: Scale pods, not services — pods must self-announce to a central aggregation point
ADR-020: Engine → gateway push model already exists for events — reusable for health
Operational simplicity: Operators should not have to manually configure pod addresses in the gateway

Decision

Push-based registration with TTL and selective pull verification.

Engine pods announce themselves to the gateway via periodic heartbeat messages over the existing gRPC Reporting channel. The gateway maintains a registry of known pods with TTL-based expiry. On health queries, the gateway returns the registry and optionally probes pods nearing TTL expiry.

Design

                         heartbeat (10s interval)
┌──────────────┐   ────────────────────────────────▶  ┌──────────────────┐
│ Engine Pod 1 │    PodHeartbeat{pod_id, services,    │                  │
│              │     started_at, version, load}       │     Gateway      │
└──────────────┘                                      │                  │
                                                      │  ┌────────────┐ │
┌──────────────┐   ────────────────────────────────▶  │  │  Registry  │ │
│ Engine Pod 2 │                                      │  │  (TTL=30s) │ │
└──────────────┘                                      │  └────────────┘ │
                                                      │                  │
┌──────────────┐   ────────────────────────────────▶  │  GET /health     │
│ Engine Pod N │                                      │  → all pods      │
└──────────────┘                                      └──────────────────┘

Heartbeat message

message PodHeartbeat {
  string pod_id       = 1;  // stable identifier (hostname or UUID)
  string version      = 2;  // apme-engine version
  map<string, string> services = 3;  // name → address (e.g. "primary" → "10.0.1.5:50051")
  google.protobuf.Timestamp started_at = 4;
  float cpu_load      = 5;  // optional: enables future load-aware routing
}

// Added to Reporting service
rpc PodHeartbeat(PodHeartbeat) returns (ReportAck);

Gateway registry

In-memory dictionary keyed by pod_id, storing the latest heartbeat + received timestamp
Entries expire after TTL (default 30s, configurable)
Expired entries are pruned lazily on health queries and periodically by a background task
No persistent storage needed — pods re-announce on restart

Health endpoint behavior

Scenario	Behavior
Single pod (current)	Falls back to static `_UPSTREAM_SERVICES` list when registry is empty — fully backward-compatible
Multi-pod	Returns all registered pods with per-service health status from the last heartbeat
Pod nearing TTL	Gateway optionally does an on-demand gRPC probe to distinguish "slow heartbeat" from "actually dead" before marking unavailable
Pod crash	TTL expires → pod removed from registry on next prune cycle

Engine-side integration

The GrpcReportingSink already runs a background health loop every 10s. The heartbeat piggybacks on this loop:

async def _health_loop(self) -> None:
    while True:
        await asyncio.sleep(_HEALTH_INTERVAL_S)
        await self._probe()
        if self._available:
            await self._send_heartbeat()  # new: announce to gateway

No new connections, no new dependencies. The heartbeat uses the same gRPC channel and stub.

Alternatives Considered

Alternative 1: Pure pull (gateway probes registered addresses)

Description: Pods register on startup via a POST /internal/register endpoint. Gateway stores the address list and probes all pods on health queries.

Pros:

Gateway verifies health directly (not trusting self-reports)
Detects network partitions (pod thinks it's healthy but gateway can't reach it)

Cons:

Startup ordering dependency: gateway must be reachable when a pod starts
Explicit deregistration needed for clean shutdown (or TTL fallback anyway)
Gateway becomes the active prober, scaling O(N) probes per health query
No natural crash recovery — requires TTL anyway, converging to the chosen approach

Why not chosen: Introduces a startup ordering constraint and reimplements TTL-based expiry, which the push model handles more naturally.

Alternative 2: External service discovery (Consul, etcd, Kubernetes endpoints)

Description: Delegate registration and health to infrastructure tooling.

Pros:

Mature, battle-tested solutions
Built-in health checking, DNS, and load balancing
Kubernetes Endpoints API is zero-config for k8s deployments

Cons:

Adds infrastructure dependencies (violates ADR-005 spirit)
Overkill for Podman-based deployments where there is no orchestrator
Different solution needed per deployment target (Podman vs. k8s vs. bare metal)

Why not chosen: APME must work in Podman-only environments with no orchestrator. External service discovery is welcome as an optional overlay (e.g. Kubernetes Endpoints) but cannot be the primary mechanism.

Alternative 3: Shared database polling

Description: Pods write their status to a shared database (the gateway's SQLite or a separate Redis). Gateway reads the table on health queries.

Pros:

Durable registration survives gateway restarts
Simple query model

Cons:

SQLite doesn't support concurrent writes from multiple pods (file locking)
Adds a database dependency to the engine (violates ADR-020's "engine never imports a database client")
Redis adds infrastructure

Why not chosen: Violates the engine's stateless design principle (ADR-020) and introduces shared-state coordination.

Consequences

Positive

Multi-pod health visibility with zero operator configuration
Backward-compatible: single-pod deployments work unchanged
Reuses existing gRPC channel — no new ports, protocols, or dependencies
Natural crash recovery via TTL expiry
Foundation for future load-aware routing (heartbeat carries cpu_load)
ADR-005 remains valid: no etcd, no external service discovery

Negative

Heartbeat adds ~1 small RPC every 10s per pod — negligible but nonzero
30s TTL means up to 30s stale health data after a pod crash
In-memory registry is lost on gateway restart (pods re-announce within 10s)

Neutral

Does not change intra-pod service discovery (ADR-005 still governs that)
Does not change the scaling model (ADR-012 still governs that)
The heartbeat is a superset of the existing health probe — could eventually replace it

Implementation Notes

Phase 1: Heartbeat proto + gateway registry (this ADR)

Add PodHeartbeat message and RPC to reporting.proto
Implement in-memory registry in gateway's Reporting servicer
Extend /api/v1/health to merge static services + registered pods
Add heartbeat sender to GrpcReportingSink._health_loop

Phase 2: UI and operational tooling

Update HealthPage.tsx to show per-pod breakdown
Add apme-cli pods command to list registered pods
Add Prometheus metrics for pod count, heartbeat latency

Phase 3: Load-aware routing (future)

Use cpu_load from heartbeats to weight routing decisions
Gateway forwards scan requests to least-loaded pod
Requires a routing layer in front of Primary (separate ADR)

Backward compatibility

When the registry is empty (no pods have sent heartbeats), the health endpoint falls back to the current static _UPSTREAM_SERVICES list. This means:

Existing single-pod deployments work with zero changes
Existing daemon-mode deployments work with zero changes
Multi-pod deployments get automatic registration once the engine is updated

Related Decisions

ADR-005: No service discovery — still valid for intra-pod; this ADR addresses inter-pod
ADR-012: Scale pods, not services — pods self-announce to gateway
ADR-020: Reporting service and event delivery — heartbeat extends the same push channel
ADR-029: Web gateway architecture — health endpoint is a gateway concern
ADR-048: Pod-internal admin endpoints rely on network isolation — if inter-pod routing exposes admin endpoints, auth must be added per ADR-048

References

PR #69: health endpoint implementation that motivated this discussion
src/apme_gateway/api/router.py: current static _UPSTREAM_SERVICES list
src/apme_engine/daemon/sinks/grpc_reporting.py: existing health loop to extend

Revision History

Date	Author	Change
2026-03-23	cidrblock	Initial proposal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-034: Multi-Pod Health Registration

Status

Date

Context

Forces in tension

Decision

Design

Heartbeat message

Gateway registry

Health endpoint behavior

Engine-side integration

Alternatives Considered

Alternative 1: Pure pull (gateway probes registered addresses)

Alternative 2: External service discovery (Consul, etcd, Kubernetes endpoints)

Alternative 3: Shared database polling

Consequences

Positive

Negative

Neutral

Implementation Notes

Phase 1: Heartbeat proto + gateway registry (this ADR)

Phase 2: UI and operational tooling

Phase 3: Load-aware routing (future)

Backward compatibility

Related Decisions

References

Revision History

FilesExpand file tree

ADR-034-multi-pod-health-registration.md

Latest commit

History

ADR-034-multi-pod-health-registration.md

File metadata and controls

ADR-034: Multi-Pod Health Registration

Status

Date

Context

Forces in tension

Decision

Design

Heartbeat message

Gateway registry

Health endpoint behavior

Engine-side integration

Alternatives Considered

Alternative 1: Pure pull (gateway probes registered addresses)

Alternative 2: External service discovery (Consul, etcd, Kubernetes endpoints)

Alternative 3: Shared database polling

Consequences

Positive

Negative

Neutral

Implementation Notes

Phase 1: Heartbeat proto + gateway registry (this ADR)

Phase 2: UI and operational tooling

Phase 3: Load-aware routing (future)

Backward compatibility

Related Decisions

References

Revision History