Status: Accepted Date: 2026-04-11 Deciders: Architecture Team, Ricardo Cataldi Tags: infrastructure, kubernetes, aks, namespace, security, istio, flux, network-policy References: ADR-024, ADR-017, ADR-008, ADR-021, ADR-007
The holiday-peak-hub platform deploys 27 production services to AKS: 1 CRUD service (transactional system of record) and 26 agent services (including 4 truth-layer services). All services currently run in a single holiday-peak Kubernetes namespace, despite node-level isolation already being enforced via dedicated node pools (aks-crud, aks-agents, aks-system) with taints (ADR-008).
A single shared namespace presents the following problems:
- Blast radius — A misconfigured RBAC binding, resource quota, or network policy in one workload class affects all 27 services. A runaway agent pod can consume the namespace's resource quota, starving the CRUD service (the transactional system of record).
- Observability noise — Logs, metrics, and traces for 27 services share label selectors, making per-concern dashboards noisy and alert routing less precise.
- Security posture — Kubernetes RBAC permissions, Secrets, ConfigMaps, and ServiceAccount tokens are namespace-scoped. A single namespace means every service's ServiceAccount can enumerate every other service's Secrets by default.
- Deployment independence — With Flux CD (ADR-017), a single Kustomization reconciling all 27 services means a rendering error in one agent blocks deployment of the CRUD service and vice versa.
- Istio policy granularity — Istio
AuthorizationPolicyandPeerAuthenticationresources are namespace-scoped. A single namespace makes it impossible to enforce distinct mTLS modes or access policies for the two workload classes without per-service policy explosion.
Architecture frameworks applied:
- TOGAF (Architecture Building Blocks): Namespace boundaries are governed infrastructure building blocks whose scope affects security, deployment, and observability.
- Domain-Driven Design (Bounded Contexts): CRUD (transactional) and Agent (intelligence) are fundamentally different bounded contexts with different scaling, availability, and security characteristics.
- Azure Well-Architected Framework (Security / Operational Excellence): Least privilege, defense in depth, and independent deployment units.
- microservices.io (Database per Service, Bulkhead): Namespace isolation is the Kubernetes-native bulkhead mechanism.
Split the existing holiday-peak namespace into exactly two namespaces:
| Namespace | Services | Node Pool | Purpose |
|---|---|---|---|
holiday-peak-crud |
crud-service (1 service) |
aks-crud |
Transactional system of record: products, orders, cart, customers |
holiday-peak-agents |
All 26 agent services (including truth-ingestion, truth-enrichment, truth-hitl, truth-export) |
aks-agents |
Intelligence, enrichment, search, support, logistics, CRM, product management |
Three or more namespaces (e.g., separating truth-layer or per-domain) were evaluated and rejected:
- Truth services share the same scaling characteristics, node pool, memory patterns (ADR-007), and MCP communication topology as other agents. Separating them adds cross-namespace complexity without measurable security or operational benefit.
- Per-domain namespaces (CRM, eCommerce, inventory, logistics, product-management) would create 6+ namespaces for 26 services, multiplying Flux Kustomizations, Istio policies, AGC routing rules, and RBAC bindings. The incremental isolation is not justified at current scale.
The two-namespace split maximizes the ratio of isolation benefit to operational cost by targeting the single most meaningful workload boundary: transactional vs. intelligence.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1a1a2e', 'primaryTextColor': '#e0e0e0', 'primaryBorderColor': '#00d4ff', 'lineColor': '#00d4ff', 'secondaryColor': '#16213e', 'tertiaryColor': '#0f3460', 'background': '#0d1117', 'mainBkg': '#1a1a2e', 'nodeBorder': '#00d4ff', 'clusterBkg': '#16213e', 'clusterBorder': '#00d4ff', 'titleColor': '#00d4ff', 'edgeLabelBackground': '#1a1a2e', 'nodeTextColor': '#e0e0e0' }}}%%
flowchart TB
subgraph AKS["AKS Cluster"]
direction TB
subgraph SYS["aks-system pool"]
direction LR
ISTIO[Istio Control Plane]
FLUX[Flux Controllers]
KEDA[KEDA Operator]
end
subgraph CRUD_NS["namespace: holiday-peak-crud<br/>(aks-crud pool — 1 node)"]
direction LR
CRUD[crud-service]
end
subgraph AGENTS_NS["namespace: holiday-peak-agents<br/>(aks-agents pool — 5 nodes)"]
direction LR
CRM1[crm-profile-aggregation]
CRM2[crm-segmentation-personalization]
CRM3[crm-support-assistance]
CRM4[crm-campaign-intelligence]
ECOM1[ecommerce-cart-intelligence]
ECOM2[ecommerce-catalog-search]
ECOM3[ecommerce-checkout-support]
ECOM4[ecommerce-order-status]
ECOM5[ecommerce-product-detail-enrichment]
INV1[inventory-alerts-triggers]
INV2[inventory-health-check]
INV3[inventory-jit-replenishment]
INV4[inventory-reservation-validation]
LOG1[logistics-carrier-selection]
LOG2[logistics-eta-computation]
LOG3[logistics-returns-support]
LOG4[logistics-route-issue-detection]
PM1[product-management-acp-transformation]
PM2[product-management-assortment-optimization]
PM3[product-management-consistency-validation]
PM4[product-management-normalization-classification]
SE[search-enrichment-agent]
T1[truth-ingestion]
T2[truth-enrichment]
T3[truth-hitl]
T4[truth-export]
end
end
EXT_APIM[Azure APIM + AGC] -->|north-south| CRUD
EXT_APIM -->|north-south| ECOM2
EXT_APIM -->|north-south| CRM3
CRUD -->|APIM gateway| CRM1
ECOM1 -->|cross-namespace DNS| CRUD
T1 -->|cross-namespace DNS| CRUD
CRM1 -->|MCP via APIM| CRM2
Agent services that need transactional data from the CRUD service will use the cross-namespace Kubernetes DNS FQDN:
http://crud-service-crud-service.holiday-peak-crud.svc.cluster.local:8000
Why Option A over Option B (routing through APIM):
| Criterion | Option A: Direct K8s DNS | Option B: Via APIM |
|---|---|---|
| Latency | ~1-2 ms (in-cluster) | ~15-30 ms (APIM + AGC hop) |
| Cost | Zero additional RU/request cost | APIM request units per call |
| Reliability | No external dependency | APIM outage blocks agent→CRUD |
| mTLS | Istio sidecar-to-sidecar | Terminated at AGC, re-encrypted to APIM |
| Observability | Full Istio telemetry | Split across APIM logs + Istio |
| Blast radius | Cluster-local, no public path | Uses public APIM facade for internal traffic |
| Symmetry | Asymmetric with CRUD→Agent | Symmetric but architecturally wasteful |
Decision: Option A. Agent→CRUD is a hot path (product lookups, order reads, cart validation) invoked by every agent enrichment cycle. Adding 15-30 ms per call across 26 services with multiple CRUD calls each would degrade the platform's primary value proposition — low-latency intelligent retail experiences. APIM exists as the public facade (ADR-021); using it for intra-cluster traffic violates separation of concerns.
Mitigation for asymmetry: CRUD→Agent calls already go through APIM because CRUD is the entry point for external requests and agent calls are policy-gated intelligence lookups. Agent→CRUD calls are internal data-plane reads that benefit from locality. The asymmetry is intentional and architecturally sound.
Per ADR-021 and ADR-024, CRUD calls agent REST endpoints through APIM for enrichment/decision assist flows. This path is unchanged by the namespace split.
Per ADR-024, agent-to-agent communication uses MCP tools routed through APIM. All 26 agent services are registered as MCP Servers in APIM. This path is namespace-agnostic and unchanged.
Per ADR-021, external traffic flows through APIM → AGC → AKS. AGC HTTPRoute resources will now target services in their respective namespaces. The AGC Gateway resource must reference both namespaces.
All 26 agent services read CRUD_SERVICE_URL from environment variables. The migration updates this single value:
| Before | After |
|---|---|
http://crud-service-crud-service.holiday-peak.svc.cluster.local:8000 |
http://crud-service-crud-service.holiday-peak-crud.svc.cluster.local:8000 |
Per ADR-024, CRUD_SERVICE_URL is used exclusively for approved cross-namespace DNS routing of transactional reads. Agents are forbidden from using this URL for general CRUD API consumption (e.g., creating orders, updating products, or invoking CRUD business logic).
This is set in Helm values per service and rendered into deployment manifests by render-helm.sh. The change is a values-only update — no application code changes required.
Enable STRICT mTLS independently per namespace to ensure all inter-namespace traffic is encrypted and identity-verified:
# holiday-peak-crud namespace
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: holiday-peak-crud
spec:
mtls:
mode: STRICT
---
# holiday-peak-agents namespace
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: default
namespace: holiday-peak-agents
spec:
mtls:
mode: STRICTRestrict which workloads can reach the CRUD service. Only agent services from holiday-peak-agents and the Istio ingress gateway (for AGC traffic) are permitted:
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: crud-service-access
namespace: holiday-peak-crud
spec:
selector:
matchLabels:
app: crud-service
action: ALLOW
rules:
# Allow agent services from the agents namespace
- from:
- source:
namespaces: ["holiday-peak-agents"]
to:
- operation:
ports: ["8000"]
# Allow AGC ingress gateway
- from:
- source:
namespaces: ["istio-system"]
to:
- operation:
ports: ["8000"]Agent-to-agent communication within holiday-peak-agents is unrestricted (same-namespace, governed by ADR-024 MCP policy). A default-deny policy for agent services blocks external namespaces except istio-system:
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: agents-default-access
namespace: holiday-peak-agents
spec:
action: ALLOW
rules:
# Allow intra-namespace (agent-to-agent)
- from:
- source:
namespaces: ["holiday-peak-agents"]
# Allow AGC ingress
- from:
- source:
namespaces: ["istio-system"]
# Allow CRUD callbacks (e.g., event confirmations)
- from:
- source:
namespaces: ["holiday-peak-crud"]NetworkPolicies operate at L3/L4 as a defense-in-depth layer beneath Istio's L7 policies. If Istio sidecars fail to inject or are bypassed, these policies still enforce boundaries.
# Deny all ingress by default in holiday-peak-crud
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: holiday-peak-crud
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow ingress to CRUD from agents namespace and istio-system
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-agents-and-ingress
namespace: holiday-peak-crud
spec:
podSelector:
matchLabels:
app: crud-service
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: holiday-peak-agents
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: istio-system
ports:
- protocol: TCP
port: 8000
---
# Deny all ingress by default in holiday-peak-agents
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: holiday-peak-agents
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow ingress to agents from same namespace, crud namespace, and istio-system
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-internal-and-ingress
namespace: holiday-peak-agents
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: holiday-peak-agents
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: holiday-peak-crud
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: istio-system
ports:
- protocol: TCP
port: 8000Split the single Flux Kustomization into two independent reconciliation units, one per namespace. This ensures a rendering error in one namespace does not block deployment of the other.
# Kustomization for CRUD namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: holiday-peak-crud
namespace: flux-system
spec:
interval: 5m
path: .kubernetes/rendered/crud-service
prune: true
sourceRef:
kind: GitRepository
name: holiday-peak-gitops
targetNamespace: holiday-peak-crud
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: crud-service-crud-service
namespace: holiday-peak-crud
---
# Kustomization for Agents namespace
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: holiday-peak-agents
namespace: flux-system
spec:
interval: 5m
path: .kubernetes/rendered
prune: true
sourceRef:
kind: GitRepository
name: holiday-peak-gitops
targetNamespace: holiday-peak-agents
dependsOn:
- name: holiday-peak-crud
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: search-enrichment-agent-search-enrichment-agent
namespace: holiday-peak-agentsThe dependsOn ensures CRUD is healthy before agents reconcile, matching the existing deployment ordering requirement (ADR-008: "Deploy CRUD first").
The Helm chart (/.kubernetes/chart/values.yaml) gains a namespace field per service, and render-helm.sh passes the correct namespace during rendering:
# render-helm.sh (pseudocode update)
if [ "$SERVICE_NAME" = "crud-service" ]; then
NAMESPACE="holiday-peak-crud"
else
NAMESPACE="holiday-peak-agents"
fi
helm template "$RELEASE_NAME" .kubernetes/chart/ \
--namespace "$NAMESPACE" \
--values "$VALUES_FILE" \
> ".kubernetes/rendered/$SERVICE_NAME/all.yaml"All 26 agent services must be registered as MCP Servers in APIM. The APIM backend URLs will target AGC hostnames that route to holiday-peak-agents namespace services. The CRUD service backend URL targets holiday-peak-crud. This is a configuration extension of ADR-021, not a new pattern.
The AGC Gateway resource must allow route attachment from both namespaces:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: holiday-peak-agc
namespace: holiday-peak-crud # or a shared infra namespace
spec:
gatewayClassName: azure-alb-external
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
holiday-peak/ingress-allowed: "true"Both holiday-peak-crud and holiday-peak-agents namespaces receive the label holiday-peak/ingress-allowed: "true".
- Create both namespaces with labels and annotations:
kubectl create namespace holiday-peak-crud kubectl label namespace holiday-peak-crud \ holiday-peak/ingress-allowed=true \ istio-injection=enabled kubectl create namespace holiday-peak-agents kubectl label namespace holiday-peak-agents \ holiday-peak/ingress-allowed=true \ istio-injection=enabled
- Deploy Istio
PeerAuthenticationinPERMISSIVEmode to both new namespaces. - Deploy NetworkPolicies to both new namespaces.
- Create Flux Kustomizations for both namespaces (suspended).
- Scale down
crud-serviceinholiday-peakto 0 replicas. - Unsuspend Flux Kustomization for
holiday-peak-crud— Flux deploys CRUD to new namespace. - Validate CRUD health endpoint via AGC and APIM.
- Update AGC HTTPRoute to target
holiday-peak-crudnamespace for CRUD paths. - Smoke-test CRUD through APIM:
GET /api/health,GET /api/products.
- Update
CRUD_SERVICE_URLin all agent Helm values to the new cross-namespace FQDN. - Unsuspend Flux Kustomization for
holiday-peak-agents. - Migrate agents in domain batches (5 batches of ~5 services), validating each batch:
- Batch 1:
truth-*services (4 services) — validates cross-namespace CRUD connectivity - Batch 2:
ecommerce-*services (5 services) - Batch 3:
crm-*services (4 services) - Batch 4:
inventory-*+logistics-*services (8 services) - Batch 5:
product-management-*+search-enrichment-agent(5 services)
- Batch 1:
- After each batch: verify MCP tool discovery via APIM, check agent→CRUD latency, confirm Istio telemetry.
- Update AGC HTTPRoutes for agent paths to target
holiday-peak-agents.
- Switch Istio
PeerAuthenticationfromPERMISSIVEtoSTRICTin both namespaces. - Delete all resources from old
holiday-peaknamespace. - Delete old single-namespace Flux Kustomization.
- Archive old namespace:
kubectl delete namespace holiday-peak.
At any phase, rollback is:
- Suspend Flux Kustomizations for new namespaces.
- Scale up services in original
holiday-peaknamespace (Flux reconciles old Kustomization). - Revert AGC HTTPRoutes to original namespace.
- Revert
CRUD_SERVICE_URLvalues in agent Helm values.
The old namespace and its resources remain untouched until Phase 3 explicit deletion, making rollback a configuration revert.
- Blast radius reduction — CRUD and agent workloads have independent resource quotas, RBAC bindings, and failure domains. A runaway agent cannot starve the transactional system.
- Deployment independence — Separate Flux Kustomizations allow CRUD and agents to deploy, fail, and roll back independently (ADR-017 extension).
- Security posture — Namespace-scoped Secrets, ConfigMaps, and ServiceAccounts are no longer shared. NetworkPolicies and Istio AuthorizationPolicies enforce explicit allow-lists.
- Observability clarity — Per-namespace metrics, logs, and traces reduce noise and enable namespace-scoped alerting and dashboards.
- Istio policy precision — mTLS mode and authorization can evolve independently per namespace (e.g., STRICT for CRUD, PERMISSIVE for agents during rollout).
- Future extensibility — If a third workload class emerges (e.g., batch processing), adding a namespace follows this same pattern without redesign.
- Operational complexity — Two Flux Kustomizations, two sets of NetworkPolicies, two PeerAuthentication resources, and namespace-aware
render-helm.shadd configuration surface area. - Cross-namespace DNS coupling — Agent services depend on a specific DNS FQDN for CRUD. If the CRUD service name or namespace changes, all 26 agent values files must update. Mitigated by the FQDN being a single Helm value.
- AGC multi-namespace routing — HTTPRoute resources must reference services in their local namespace or use cross-namespace
ReferenceGrant. This adds Gateway API administrative overhead. - Migration risk — A ~15-minute maintenance window is required for CRUD cutover. Mitigated by defined rollback plan and phased approach.
- External PaaS services unaffected — Redis, Cosmos DB, Blob Storage, Azure AI Foundry are namespace-agnostic; no changes required.
- Memory namespace contract (ADR-007) unaffected — The
<service>:<tenantId>:<sessionId>key contract operates at application level, not Kubernetes namespace level. - MCP communication policy (ADR-024) unaffected — Agent-to-agent MCP routing through APIM is namespace-transparent.
- Node pool assignment unchanged — Taints and tolerations already segregate compute (ADR-008). Namespace split aligns the logical boundary with the existing physical boundary.
Retain holiday-peak and rely solely on Istio AuthorizationPolicies and RBAC for isolation.
Rejected: Per-service policy proliferation (27 AuthorizationPolicies, 27 NetworkPolicies) is harder to maintain and audit than namespace-scoped defaults. Does not address deployment independence or resource quota isolation.
Split into holiday-peak-crud, holiday-peak-crm, holiday-peak-ecommerce, holiday-peak-inventory, holiday-peak-logistics, holiday-peak-product-mgmt, holiday-peak-truth.
Rejected: Multiplies Flux Kustomizations, NetworkPolicies, Istio policies, and AGC routing by 7x. All agent services share the same node pool, scaling characteristics, and MCP communication mesh. The incremental isolation does not justify the operational cost at current scale (26 agent services).
Use APIM for all service-to-service communication including agent→CRUD.
Rejected: Adds 15-30 ms latency per agent→CRUD call on a hot path. Introduces APIM as a single point of failure for intra-cluster data reads. Consumes APIM request units for internal traffic. Violates the architectural principle that APIM is the public facade (ADR-021), not an internal service bus.