Comprehensive deployment guide for Holiday Peak Hub infrastructure.
| Tool | Version | Purpose |
|---|---|---|
| Azure CLI | ≥ 2.67 | Azure resource management |
Azure CLI alb extension |
latest | AGC (Application Gateway for Containers) management — install with az extension add --name alb |
| azd | ≥ 1.10 | Provisioning + deployment |
| Bicep CLI | ≥ 0.30 | Infrastructure-as-code (bundled with Azure CLI) |
| azd env set deployShared true -e | ||
| azd env set environment -e | ||
| azd env set location -e | ||
| azd provision -e | ||
| Docker | ≥ 24 | Container image builds |
- Permissions: Owner or Contributor + User Access Administrator on the subscription
- Resource Providers: Ensure the following providers are registered:
Microsoft.ContainerService(AKS)Microsoft.DocumentDB(Cosmos DB) --name static-web-app-
--location \Microsoft.EventHub(Event Hubs) --parameters environment=
projectName=holidaypeakhub405
resourceGroupName=holidaypeakhub405--rgMicrosoft.CognitiveServices(AI Foundry)Microsoft.ContainerRegistry(ACR)Microsoft.Web(Static Web Apps)
Register missing providers:
az provider register --namespace Microsoft.ContainerService
--resource-group holidaypeakhub405-<environment>-rg \
--name holidaypeakhub405-<environment>-aks
az provider register --namespace Microsoft.EventHub
az provider register --namespace Microsoft.ApiManagement
az provider register --namespace Microsoft.CognitiveServices
azd deploy --service crud-service -e <environment>
### Quota Requirements (Dev)
azd deploy --all -e <environment>
| Resource | Required | SKU |
|----------|----------|-----|
| vCPUs (Ddsv5) | 32 | Standard_D8ds_v5 × 4 nodes |
| Cosmos DB (Serverless) | 1 account | — |
--name static-web-app-<environment> \
--location <location> \
| APIM | 1 instance | Consumption |
--parameters environment=<environment> \
projectName=holidaypeakhub405 \
resourceGroupName=holidaypeakhub405-<environment>-rg
```bash
az login
az account set --subscription <SUBSCRIPTION_ID>--resource-group holidaypeakhub405--rg
--name holidaypeakhub405--aks
FORCE=true ./.infra/azd/hooks/generate-crud-env.sh
azd deploy --service crud-service -e azd deploy --all -e Deployment order: Shared Infrastructure → CRUD Service + Agent Services → AGC Readiness Gate → APIM Sync + APIM Smoke Gate → Static Web App
Release gate notes:
- UI deployment is blocked unless backend deployment jobs are
successorskipped. - AGC readiness is validated before APIM sync by checking the configured GatewayClass, the shared
ApplicationLoadBalancercontract, Azure traffic-controller health, shared-GatewayAcceptedandProgrammedstate with an assigned status address, shared-Gateway route attachment, and direct CRUD plus changed-agent health reachability on the approved AGC frontend hostname. - APIM gateway URL is propagated from
azdoutputs and checked against live APIM to catch config drift. - APIM smoke checks validate direct AGC CRUD health, APIM
GET /api/health,GET /api/products?limit=1,GET /api/categories, CRUD CORS preflight, negative CRUD path behavior, and changed agentGET /agents/<service>/healthbefore UI publish. - APIM sync consumes the approved AGC hostname contract from azd outputs and fails closed if the backend drifts to IP-based or cluster-local targets.
- UI deployment runs pre/post smoke checks to ensure API health and SWA hostname reachability.
- Foundry surface registration is explicit and separate from AKS rollout. Manual dev deployments can emit a dry-run plan by default;
foundrySurfaceMode=applycreates or updates Foundry Hosted Agent versions from tested image digests and validates Custom Agent proxy metadata without replacing APIM -> AGC -> AKS product traffic.
azd env set deployShared true -e azd env set deployStatic true -e azd env set location -e azd up -e
The shared infrastructure module creates all platform resources using AVM.
cd .infra/modules/shared-infrastructure
az deployment sub create \
--name shared-infra-<environment> \
--location <location> \
--template-file shared-infrastructure-main.bicep \
--parameters environment=<environment> location=<location>Or with azd:
azd env set deployShared true -e <environment>
azd env set environment <environment> -e <environment>
azd env set location <location> -e <environment>
azd provision -e <environment>Resources created (all AVM, ~25 minutes):
| # | Resource | Module | Purpose |
|---|---|---|---|
| 1 | VNet (5 subnets) | avm/res/network/virtual-network:0.7.2 |
Network isolation |
| 2 | 5 NSGs | avm/res/network/network-security-group:0.5.2 |
Subnet security |
| 3 | 8 Private DNS Zones | avm/res/network/private-dns-zone:0.8.0 |
PE DNS resolution |
| 4 | Log Analytics | avm/res/operational-insights/workspace:0.15.0 |
Centralized logs |
| 5 | App Insights | avm/res/insights/component:0.7.1 |
APM telemetry |
| 6 | ACR (Premium) | avm/res/container-registry/registry:0.9.3 |
Container images |
| 7 | PostgreSQL Flexible Server | avm/res/db-for-postgre-sql/flexible-server:0.15.0 |
CRUD transactional data |
| 8 | Cosmos DB (agent warm memory) | avm/res/document-db/database-account:0.18.0 |
Agent session/history memory |
| 9 | Redis (Premium P1) | avm/res/cache/redis:0.16.4 |
Hot-tier memory |
| 10 | Storage Account | avm/res/storage/storage-account:0.31.0 |
Cold-tier memory |
| 11 | Event Hubs (5 topics) | avm/res/event-hub/namespace:0.14.0 |
Async events |
| 12 | Key Vault (Premium) | avm/res/key-vault/vault:0.13.3 |
Secrets |
| 13 | AI Foundry Project | avm/ptn/ai-ml/ai-foundry:0.6.0 |
AI/ML models |
| 14 | AKS (3 pools) | avm/res/container-service/managed-cluster:0.12.0 |
Compute |
| 15 | APIM | avm/res/api-management/service:0.14.0 |
API gateway |
| 16 | AGC controller identity | Native | ALB controller workload identity |
| — | 6-9 RBAC assignments | Native | Identity permissions |
If agcSupportEnabled is enabled for the environment, the shared stack also provisions:
- a delegated
agcsubnet sized for AGC association capacity, - workload identity federation for
azure-alb-system/alb-controller-sa, - deterministic RBAC for AGC controller access to the AKS node resource group and delegated subnet (
Reader,AppGw for Containers Configuration Manager, andNetwork Contributor).
During azd provision, the postprovision hook validates those AGC controller RBAC prerequisites, installs the ALB controller Helm chart, and verifies that GatewayClass azure-alb-external is present. No ApplicationLoadBalancer or workload route is created in this step.
Note: The CI/CD pipeline's output recovery step uses
az network alb frontend listto query AGC frontend hostnames. This requires the Azure CLIalbextension (az extension add --name alb --only-show-errors). The extension is GA and requires Azure CLI ≥ 2.67.
Private endpoints are created for: ACR, Cosmos DB, Redis, Storage, Event Hubs, Key Vault, AI Services.
cd .infra/modules/static-web-app
az deployment sub create \
--name static-web-app-<environment> \
--location <location> \
--template-file static-web-app-main.bicep \
--parameters environment=<environment> \
projectName=holidaypeakhub405 \
resourceGroupName=holidaypeakhub405-<environment>-rgResources created (~5 minutes):
- Azure Static Web Apps (Free tier for non-prod, Standard for prod)
- Custom domain (prod only)
Publish the UI with .github/workflows/deploy-ui-swa.yml. The canonical path is workflow-driven and validates the exact Static Web App identity before publishing.
az aks get-credentials \
--resource-group holidaypeakhub405-<environment>-rg \
--name holidaypeakhub405-<environment>-aks
kubectl get nodes# CRUD service first
azd deploy --service crud-service -e <environment>
# All agent services
azd deploy --all -e <environment>AKS deployment alone does not create visible Foundry Hosted Agent versions. To expose the public/human-facing agents in the Foundry portal, run the deploy-azd-dev (entrypoint) workflow with:
deployFoundrySurfaces=truefoundrySurfaceMode=planto generate thefoundry-surface-plan-<environment>artifactfoundrySurfaceMode=applyto create or update Hosted Agent versions after reviewing the plan
The reusable workflow consumes the tested repo@sha256:... image references produced by build-aks-images, reads apps/foundry-surfaces.yaml, and calls scripts/ops/register_foundry_surfaces.py. Apply mode uses OIDC-backed Azure identity and AIProjectClient(..., allow_preview=True). It does not create a second AKS runtime, does not call the retired /assistants surface, and does not turn Custom Agent proxy metadata into Foundry-managed compute.
Cost and networking notes:
- Hosted Agent apply mode can incur Foundry active-session CPU/memory charges.
- Foundry Hosted Agents require the ACR public endpoint to be intentionally reachable by the service. The workflow fails apply mode if the baseline ACR policy is private-only or
defaultAction=Deny. - Custom Agent entries remain metadata/proxy validated until Microsoft Foundry exposes a supported Custom Agent create API for APIM-backed proxy surfaces.
After deployment, retrieve outputs:
az deployment sub show \
--name shared-infra-<environment> \
--query properties.outputs -o jsonKey outputs:
aksClusterName— AKS cluster nameacrLoginServer— ACR login server URLcosmosEndpoint— Cosmos DB endpointkeyVaultUri— Key Vault URIapimGatewayUrl— APIM gateway URLappInsightsConnectionString— App Insights connection stringaiProjectName— AI Foundry project nameagcSubnetId— Delegated subnet reserved for AGC associationsagcControllerDeploymentMode— Controller installation mode for downstream automationagcGatewayClass— GatewayClass to target in later Kubernetes routing issuesagcFrontendReference— Equivalent frontend reference exported before any real AGC frontend exists
Each agent service deploys its own isolated set of resources. Suitable for single-service demos.
Note: This creates redundant resources per service and is significantly more expensive.
# Generate Bicep from template (if not already generated)
python cli.py generate-bicep --service ecommerce-catalog-search
cd .infra/modules/ecommerce-catalog-search
az deployment sub create \
--name ecommerce-catalog-search-demo \
--location eastus2 \
--template-file ecommerce-catalog-search-main.bicep \
--parameters appName=ecommerce-catalog-search \
appImage=ghcr.io/azure-samples/ecommerce-catalog-search:latestResources created per service (~15 minutes):
| Resource | SKU | Purpose |
|---|---|---|
| Cosmos DB | Standard | Chat memory (warm tier) |
| Redis | Standard C0 | Chat memory (hot tier) |
| Storage Account | Standard LRS | Chat memory (cold tier) |
| AI Search | Standard | Retrieval index |
| Azure OpenAI | S0 | GPT-4.1, GPT-4.1-mini, GPT-4.1-nano |
| AKS | Standard_B4ms (1 node) | Compute |
python cli.py generate-bicep --apply-all
python cli.py deploy-all --location eastus2| # | Service | Domain | Type |
|---|---|---|---|
| 1 | crud-service | Core | REST API (CRUD operations) |
| 2 | crm-campaign-intelligence | CRM | Agent |
| 3 | crm-profile-aggregation | CRM | Agent |
| 4 | crm-segmentation-personalization | CRM | Agent |
| 5 | crm-support-assistance | CRM | Agent |
| 6 | ecommerce-cart-intelligence | eCommerce | Agent |
| 7 | ecommerce-catalog-search | eCommerce | Agent |
| 8 | ecommerce-checkout-support | eCommerce | Agent |
| 9 | ecommerce-order-status | eCommerce | Agent |
| 10 | ecommerce-product-detail-enrichment | eCommerce | Agent |
| 11 | inventory-alerts-triggers | Inventory | Agent |
| 12 | inventory-health-check | Inventory | Agent |
| 13 | inventory-jit-replenishment | Inventory | Agent |
| 14 | inventory-reservation-validation | Inventory | Agent |
| 15 | logistics-carrier-selection | Logistics | Agent |
| 16 | logistics-eta-computation | Logistics | Agent |
| 17 | logistics-returns-support | Logistics | Agent |
| 18 | logistics-route-issue-detection | Logistics | Agent |
| 19 | product-management-acp-transformation | Product Mgmt | Agent |
| 20 | product-management-assortment-optimization | Product Mgmt | Agent |
| 21 | product-management-consistency-validation | Product Mgmt | Agent |
| 22 | product-management-normalization-classification | Product Mgmt | Agent |
| 23 | search-enrichment-agent | Search | Agent |
| 24 | truth-enrichment | Truth Layer | Agent |
| 25 | truth-export | Truth Layer | Agent |
| 26 | truth-hitl | Truth Layer | Agent |
| 27 | truth-ingestion | Truth Layer | Agent |
| 28 | ui | Frontend | Next.js 15 (Azure Static Web Apps) |
| Parameter | Required | Default | Description |
|---|---|---|---|
environment |
Yes | dev |
Target environment: dev, staging, prod |
location |
Yes | eastus2 |
Azure region |
projectName |
No | holidaypeakhub |
Resource naming prefix |
keyVaultNameOverride |
No | "" |
Custom Key Vault name (3-24 chars) |
aksKubernetesVersion |
No | "" |
Specific K8s version (empty = Azure default) |
{projectName}-{environment}-rg
Examples: holidaypeakhub405-dev-rg, holidaypeakhub405-prod-rg
azd env set deployShared true -e <environment>
azd env set deployStatic true -e <environment>
azd env set environment <environment> -e <environment>
azd env set location <location> -e <environment>
azd env set K8S_NAMESPACE holiday-peak -e <environment>
azd env set IMAGE_PREFIX ghcr.io/azure-samples -e <environment>
azd env set IMAGE_TAG latest -e <environment># AKS connectivity
kubectl get nodes
kubectl get namespaces
# Cosmos DB (Managed Identity token)
kubectl run test-cosmos --image=mcr.microsoft.com/azure-cli --restart=Never --rm -it \
--command -- bash -c "curl -sH 'Metadata:true' \
'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://cosmos.azure.com' | python3 -m json.tool"
# Private endpoint DNS resolution
kubectl run test-dns --image=busybox --restart=Never --rm -it \
--command -- nslookup holidaypeakhub405-<environment>-cosmos.documents.azure.comaz staticwebapp show \
--name holidaypeakhub405-ui-<environment> \
--resource-group holidaypeakhub405-<environment>-rg \
--query defaultHostname -o tsv| Issue | Cause | Resolution |
|---|---|---|
| AKS timeout (> 30 min) | Region capacity | Retry; try eastus2 |
QuotaExceeded |
Insufficient vCPU quota | Request quota increase for Ddsv5 family |
NameNotAvailable |
Resource name conflict | Use keyVaultNameOverride parameter |
| Key Vault soft-delete name conflict (non-prod) | Previously deleted vault reserves name | deploy-azd.yml provision preflight auto-purges matching soft-deleted vault before azd provision |
| PostgreSQL Flexible Server is stopped (non-prod) | Cost-saving stop leaves control plane drift | deploy-azd.yml provision preflight starts the existing server and waits for Ready before azd provision |
| RBAC not propagated | Azure AD timing | Wait 5-10 minutes, retry |
| PE DNS resolution fails | DNS zone not linked | Verify Private DNS Zone VNet links |
| ACR pull fails | Missing role | Verify AcrPull role on kubelet identity |
| Bicep validation error | AVM version mismatch | Run az bicep upgrade |
| Cosmos Serverless limit | 1 GB partition limit | For prod, use provisioned throughput |
Preflight remediation guardrails:
- Applies only to non-production environments (
prodandproductionare excluded). - Performs only targeted reconciliation for known drift cases; infrastructure provisioning remains
azd provisionas source of truth. - Never deletes active resources; only purges matching soft-deleted Key Vault tombstones.
- Fails fast if PostgreSQL cannot reach
Readyto prevent unsafe partial deployments.
# Check deployment status
az deployment sub show --name shared-infra-<environment> --query properties.provisioningState
# View deployment operations (find failures)
az deployment sub operation list --name shared-infra-<environment> --query "[?properties.provisioningState=='Failed']"
# AKS diagnostics
kubectl describe nodes
kubectl get events --sort-by='.lastTimestamp' -A
# Check RBAC assignments
az role assignment list --scope /subscriptions/<SUB_ID>/resourceGroups/holidaypeakhub405-<environment>-rg -o table# Delete entire resource group
az group delete --name holidaypeakhub405-<environment>-rg --yes --no-wait
# Or delete specific deployment
az deployment sub delete --name shared-infra-<environment>Current workflow gate behavior:
.github/workflows/deploy-azd.ymlenforces backend and APIM readiness beforedeploy-ui..github/workflows/deploy-azd.ymlfails on Foundry readiness check failures instead of warning-only..github/workflows/deploy-ui-swa.ymlresolves APIM gateway URL from Azure and rejects mismatched manualapiUrloverrides..github/workflows/deploy-ui-swa.ymlincludes pre/post deployment smoke checks (/api/health,/api/products?limit=1,/api/categories, and SWA home page).
# .github/workflows/deploy-infra.yml
name: Deploy Infrastructure
on:
push:
branches: [main]
paths: ['.infra/**']
jobs:
deploy:
runs-on: ubuntu-latest
environment: dev
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: |
az deployment sub create \
--name shared-infra-${{ github.run_number }} \
--location eastus2 \
--template-file .infra/modules/shared-infrastructure/shared-infrastructure-main.bicep \
--parameters environment=${{ github.environment }} location=eastus2