| Author | HyeokJin Kim (hyeokjin@lablup.com) |
|---|---|
| Status | Draft |
| Created | 2026-02-07 |
| Created-Version | 26.3.0 |
| Target-Version | |
| Implemented-Version |
- JIRA Epic: BA-4313
- JIRA: BA-4314 (BEP), BA-4315 (Foundation), BA-4316 (CLI Migration), BA-4317 (Runtime), BA-4318 (Client Pool)
| Document | Contents | Related Phase |
|---|---|---|
| Data Model | DB DDL, role/scope definitions and details | Phase 1 |
| Config Schema | TOML config examples, Pydantic models, instance_id per-component rules | Phase 1 |
| Event & Registration Flow | Event type definitions, registration flow details, config_hash optimization, per-component capabilities | Phase 1, 3 |
| CLI Migration | CLI usage, per-component mapping rules, output examples | Phase 2 |
| Client Pool & Query API | ServiceCatalogCache, SDClientPool, existing address transition strategy, GraphQL query API | Phase 4 |
Backend.AI's current service discovery (SD) uses different mechanisms per component, lacks consistency, and makes it difficult to query and expose service status to users.
Current state:
| Component | SD Registration | How other services discover it |
|---|---|---|
| Agent | Heartbeat event → Manager → agents table |
Manager queries DB |
| Manager | ServiceDiscoveryLoop (ETCD/Redis) |
Webserver directly sets address in config |
| Storage Proxy | ServiceDiscoveryLoop (ETCD/Redis) |
Manager reads address from etcd config |
| AppProxy Coordinator | ServiceDiscoveryLoop (Redis only) |
Manager uses wsproxy_addr from scaling_groups row |
| AppProxy Worker | ServiceDiscoveryLoop (Redis only) |
Worker requests Coordinator for connection |
| Webserver | Not registered | No server needs to discover it |
Problems:
- Inadequate query method: SD data stored in Redis hash or etcd is hard to expose to users in a structured way
- Lack of consistency: Agent uses events, others use
ServiceDiscoveryLoop, AppProxy address is in scaling_group row, Manager address is hardcoded in config — all different approaches - No endpoint model: Services have multiple endpoints for different roles (API, RPC, metrics) and network contexts (container, cluster, public), but there is no structure to represent this
- Client pool not integrated:
AgentClientPoolandClientPoolare not connected to SD, requiring manual address injection - Lack of failure resilience: Risk of SD data loss on Redis/etcd failure
Current ServiceDiscoveryLoop code (reference)
class ServiceDiscoveryLoop:
async def start(self):
await self._service_discovery.register(self._service_metadata)
self._heartbeat_task = asyncio.create_task(self._heartbeat_loop()) # 60s
async def stop(self):
await self._service_discovery.unregister(...)- Redis backend: stores hash at
service_discovery.{group}.{id}key with 3min TTL - ETCD backend: stores at
ai/backend/service_discovery/{group}/{id}prefix
Current endpoint model (reference)
The current ServiceEndpoint represents only a single address, but each component actually has multiple addresses:
- Agent:
rpc_listen_addr,advertised_rpc_addr,service_addr,announce_internal_addr - Storage Proxy:
client.service_addr,manager.service_addr,manager.announce_addr,manager.announce_internal_addr - AppProxy Worker:
api_bind_addr,api_advertised_addr,wildcard_domain.bind_addr,port_proxy.bind_host
- Agent publishes
AgentHeartbeatEventvia Redis Stream anycast → Manager syncs toagentstable - Other components register via
ServiceDiscoveryLoopto Redis/ETCD - Endpoint model is single-address structure, unable to represent multiple addresses
- Redis events as transport channel: Reuse existing Redis Stream event infrastructure
- DB as authoritative store: Enables SQL queries, GraphQL exposure, audit trails
- role + scope based multi-endpoint: Distinguish by role (what the endpoint is for) and scope (where it's accessed from)
- Config-driven: Explicit declaration via
[service-discovery.endpoints]section. CLI migration tool provided - Self-preservation: Freeze last known state on backend failure to prevent service eviction (Netflix Eureka pattern)
Composed of service_catalog + service_catalog_endpoint tables. Each service is identified by (service_group, instance_id), and endpoints represent multiple addresses via (role, scope) combinations.
- role: Functional purpose of the endpoint —
api,rpc,metrics,client-api,internal-api - scope: Network access context —
cluster,container,public
DDL and role/scope details → Data Model Details
A stable identifier that persists across service restarts. Uses config instance-id as first priority; falls back to component-specific rules if omitted. Services with multiple instances (Storage Proxy, AppProxy Coordinator) must explicitly set it in config.
Per-component rules and code → Config Schema
Startup → Read config + compute config_hash
→ Publish ServiceRegisteredEvent (Redis Stream anycast)
→ Manager consumes → DB UPSERT
Heartbeat (60s) → Compare config_hash → Full UPSERT if changed, update last_heartbeat only if same
Shutdown → ServiceDeregisteredEvent → status = DEREGISTERED
Stale detection (5min) → Manager sweep → status = UNHEALTHY
Event type definitions and details → Event & Registration Flow Details
In Agent's multi-agent mode (PRIMARY + AUXILIARY), SD registers only one entry per process. Uses the PRIMARY agent's agent_id as instance_id, and AUXILIARY agents continue using the existing AgentHeartbeatEvent → agents table path. Multi-agent is treated as a sub-concept of SD, with minimal integration in the current scope.
- Existing
ServiceDiscoveryLoopis not removed immediately; marked as deprecated - Falls back to existing SD method when
[service-discovery]config section is absent config migrate service-discoveryCLI supports migration of existing deployments
- New DB tables added (Alembic migration required)
- EventProducer added to Webserver (no impact on existing behavior)
- New value added to
EventDomain
- Create
service_catalogandservice_catalog_endpointtables via Alembic migration - Generate
[service-discovery]section in existing config files usingconfig migrate service-discoveryCLI - Restart services with new config → SD event publishing begins
- Remove existing
ServiceDiscoveryLoopas a separate task after all components are transitioned
Current: Component → Redis Event → Manager (consumer) → DB
Future: Component → Redis Event → SD Server (consumer) → DB
No changes required on the event publishing side (each component).
- Add
ServiceDiscoveryConfig,ServiceEndpointConfigPydantic models - Add
[service-discovery]section to each component's unified config - Define
ServiceRegisteredEvent,ServiceDeregisteredEventevent types - Add
EventDomain.SERVICE_DISCOVERY service_catalog,service_catalog_endpointDB models + Alembic migration- Update
generate-samplecommand
- Define
MappingRuledata structure - Write per-component mapping rules (Agent, Manager, Storage, AppProxy)
- Implement
config migrate service-discoveryCLI command - Unit tests for mapping rules
- Add SD event publishing logic to each component (startup, heartbeat, shutdown)
- Add EventProducer to Webserver
- Implement SD event handler in Manager (DB upsert)
- Implement Manager background sweep (mark stale services as UNHEALTHY)
- Implement GraphQL query API
- Mark
ServiceDiscoveryLoopas deprecated
- Implement
ServiceCatalogCache(in-memory cache + self-preservation) - Implement
SDClientPool(role/scope based endpoint selection) - Transition Manager's Storage Proxy client to SD-based
- Transition Manager's AppProxy client to SD-based
- Transition
AgentClientPooladdress resolution to SD-based
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-02-07 | Adopt Redis Event → DB architecture | Reuse existing event infrastructure + SQL queryability | Direct DB polling (latency), etcd watch (complexity) |
| 2026-02-07 | role + scope 2D endpoint model | Express role and network scope independently | Single name field (ambiguous), role-only (no network distinction) |
| 2026-02-07 | Config-driven endpoint registration | Explicit, auditable, enables CLI migration tool | Code-based auto-inference (implicit), hybrid (complex) |
| 2026-02-07 | Self-preservation (Eureka pattern) | Preserve existing connections on backend failure, prevent cascading failures | Immediate eviction (cascading failure risk), TTL-only (inflexible) |
| 2026-02-07 | Remove host from scope → cluster/container/public | localhost included in cluster; standalone host scope unnecessary | Keep host (over-segmentation) |
| 2026-02-07 | Agent SD: one entry per process | Multi-agent is sub-concept; minimize complexity | Register each auxiliary (excessive, duplicates existing heartbeat) |
| 2026-02-07 | instance_id: SD config first + component fallback | Consistent priority rules across all components | Per-component separate rules (inconsistent), fixed UUID (file management overhead) |
| 2026-02-07 | Webserver: add EventProducer only | SD registration needed but event consumption (Dispatcher) unnecessary | Full event system adoption (excessive) |
-
Heartbeat DB load optimization
- Designed to only update
last_heartbeatwhenconfig_hashis unchanged - Monitoring needed if service count grows very large
- Designed to only update
-
Relationship with existing Agent heartbeat
AgentHeartbeatEventcontains large data (resource slots, image lists, etc.)- Likely appropriate to keep separate from SD since purposes differ, but final decision needed after reviewing existing integration structures (AgentRow, scaling_group, etc.)
-
AppProxy Coordinator to scaling_group mapping
- Currently
wsproxy_addr/wsproxy_api_tokenstored inscaling_groupsrow - Need to decide: labels-based mapping vs separate junction table when transitioning to SD catalog
- Currently
- Webserver EventProducer: Add Redis Stream connection + EventProducer only. EventDispatcher (event consumption) remains Manager-only
- Scope query method: Registration is config-based with explicit per-scope endpoint declaration. Queries filter by specifying desired scope
- BEP-1024: Agent RPC Connection Pooling — existing client pool patterns
- BEP-1023: Unified Config Consolidation — config structure reference
- HashiCorp Consul — Catalog API, agent cache, blocking queries, anti-entropy
- Netflix Eureka — Client-side registry cache, self-preservation mode
- Kubernetes — EndpointSlices, kube-proxy iptables rules persistence
- Apache ZooKeeper — Ephemeral nodes, Curator ServiceCache