Skip to content

Commit 96f4bca

Browse files
committed
fix: use one big vnet and attach AKS clusters to it to avoid creating bastion multiple times
1 parent ddbcdcc commit 96f4bca

8 files changed

Lines changed: 746 additions & 381 deletions

File tree

e2e/README.md

Lines changed: 111 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -20,38 +20,130 @@ From a high-level, for each scenario,
2020
To write an E2E scenario,
2121

2222
- choose a testing cluster. There are a few defined
23-
in [cluster.go](https://github.com/Azure/AgentBaker/blob/dev/e2e/cluster.go), e.g,
24-
- ClusterKubenetAirgap
25-
- ClusterAzureNetwork
23+
in [cache.go](cache.go), e.g,
2624
- ClusterKubenet
25+
- ClusterAzureNetwork
26+
- ClusterAzureOverlayNetwork
27+
- ClusterCiliumNetwork
2728
- use `NodeBootstrappingConfiugration` (`nbc`) to setup your scenario. it is used to invoke the primary
2829
node-bootstrapping
2930
API [GetLatestNodeBootstrapping](https://github.com/Azure/AgentBaker/blob/2e730b5a498c5be9b082d912fd08ac9346582db9/pkg/agent/bakerapi.go#L14).
3031
to modify agentpool properties, usually you need to set both`nbc.containerService.properties.AgentPoolProfiles[0].xxx`
3132
as well as `nbc.agentPoolProfile`. It is because when RP invokes AgentBaker, it will set the properties in this way
3233
and in e2e we follow the pattern.
3334
- use `VMConfigMutator` to set VMSS properties such as SKU when needed.
34-
Check [vmss](https://github.com/Azure/AgentBaker/blob/dev/e2e/vmss.go) for other configs.
35+
Check [vmss](vmss.go) for other configs.
3536
it is necessary to set `nbc.agentPoolProfile.VMSize` to match the VMSS SKU if you choose to change.
3637
- use `Validator` to include your own verification of the VM's live state, such as file existsnce, sysctl settings, etc.
3738

39+
## Infrastructure Architecture
40+
41+
All E2E clusters share a single VNet and Azure Bastion in the `abe2e-{location}` resource group. This
42+
avoids creating a per-cluster Bastion (~10 min each) and ensures all clusters are reachable from a
43+
single SSH entry point.
44+
45+
```mermaid
46+
graph TB
47+
subgraph RG["abe2e-{location} Resource Group"]
48+
subgraph VNET["abe2e-shared-vnet (10.0.0.0/8)"]
49+
BASTION_SUBNET["AzureBastionSubnet<br/>10.0.0.0/26"]
50+
FW_SUBNET["AzureFirewallSubnet<br/>10.0.1.0/24"]
51+
KUBENET_SUBNET["aks-subnet-abe2e-kubenet-v5<br/>10.1.0.0/20"]
52+
AZNET_SUBNET["aks-subnet-abe2e-azure-network-v4<br/>10.1.16.0/20"]
53+
OVERLAY_SUBNET["aks-subnet-abe2e-azure-overlay-...<br/>10.1.32.0/20"]
54+
MORE_SUBNETS["... more cluster subnets"]
55+
end
56+
BASTION["abe2e-shared-bastion<br/>(Standard SKU, Tunneling)"]
57+
FIREWALL["abe2e-fw<br/>(Azure Firewall)"]
58+
end
59+
60+
subgraph MC_KUBENET["MC_abe2e-kubenet-v5 Resource Group"]
61+
VMSS_K["VMSS (system pool)"]
62+
VMSS_K_TEST["VMSS (test VMs)"]
63+
RT_K["Route Table<br/>(pod routes + firewall)"]
64+
end
65+
66+
subgraph MC_AZNET["MC_abe2e-azure-network-v4 Resource Group"]
67+
VMSS_A["VMSS (system pool)"]
68+
VMSS_A_TEST["VMSS (test VMs)"]
69+
end
70+
71+
BASTION --> BASTION_SUBNET
72+
FIREWALL --> FW_SUBNET
73+
VMSS_K --> KUBENET_SUBNET
74+
VMSS_K_TEST --> KUBENET_SUBNET
75+
RT_K -.->|associated| KUBENET_SUBNET
76+
VMSS_A --> AZNET_SUBNET
77+
VMSS_A_TEST --> AZNET_SUBNET
78+
79+
DEV["Developer / CI"]
80+
DEV -->|SSH via tunnel| BASTION
81+
BASTION -->|"connects to any VM<br/>in shared VNet"| VMSS_K_TEST
82+
BASTION -->|"connects to any VM<br/>in shared VNet"| VMSS_A_TEST
83+
```
84+
85+
### Shared Infrastructure Setup
86+
87+
The shared infrastructure is created **automatically** on first test run via cached idempotent
88+
functions — no separate setup script is needed.
89+
90+
| Resource | Name | Details |
91+
|----------|------|---------|
92+
| VNet | `abe2e-shared-vnet` | `10.0.0.0/8` — supports ~4096 `/20` cluster subnets |
93+
| Bastion | `abe2e-shared-bastion` | Standard SKU with tunneling enabled for native SSH |
94+
| Bastion Subnet | `AzureBastionSubnet` | `10.0.0.0/26` (required by Azure Bastion) |
95+
| Firewall Subnet | `AzureFirewallSubnet` | `10.0.1.0/24` (created by shared infra, firewall on-demand) |
96+
97+
Each AKS cluster gets its own `/20` subnet (4091 usable IPs) in the shared VNet. The subnet is
98+
named `aks-subnet-{clusterName}`.
99+
100+
### How It Works
101+
102+
1. **`CachedEnsureSharedInfra`** — runs once per location per test run. Creates/verifies the shared
103+
VNet, Bastion, and Firewall subnet.
104+
2. **`CachedEnsureClusterSubnet`** — runs once per cluster. Creates/verifies the cluster's dedicated
105+
subnet in the shared VNet.
106+
3. Each cluster model sets `VnetSubnetID` on the agent pool profile (BYOV — Bring Your Own VNet).
107+
4. AKS creates VMSS and route tables in the `MC_` resource group, but uses the shared VNet's subnet.
108+
5. SSH to test VMs goes through the shared Bastion, which can reach any VM in the VNet.
109+
110+
### Test Flow
111+
38112
```mermaid
39113
sequenceDiagram
40-
E2E->>+ARM: Get or Create AKS Cluster
41-
ARM-->>-E2E: Cluster details
42-
E2E->>+AgentBakerCode: Fetch VM Configuration (include CSE)
43-
AgentBakerCode-->>-E2E: VM Configuration
44-
E2E->>+ARM: Create VM using fetched VM Config in cluster network
45-
ARM-->>-E2E: VM instance
46-
E2E->>+Bastion: Create SSH Tunnel
47-
Bastion->>+VM: Forward SSH Connection
48-
E2E->>VM: Healthcheck via SSH Tunnel
49-
VM-->>E2E: Healthcheck OK
50-
E2E->>+KubeAPI: Verify Node Ready
51-
KubeAPI-->>-E2E: Node Ready
52-
E2E->>VM: Execute test validators via SSH Tunnel
53-
VM-->>-E2E: Test results
54-
Bastion-->>-E2E: Close SSH Tunnel
114+
participant CI as Developer / CI
115+
participant Infra as Shared Infra (cached)
116+
participant ARM as Azure Resource Manager
117+
participant AB as AgentBaker API
118+
participant Bastion as Shared Bastion
119+
participant VM as Test VM
120+
participant K8s as Kube API Server
121+
122+
CI->>Infra: Ensure shared VNet + Bastion
123+
Infra-->>CI: Ready (cached after first run)
124+
125+
CI->>Infra: Ensure cluster subnet
126+
Infra-->>CI: Subnet ID
127+
128+
CI->>ARM: Create/Get AKS cluster (BYOV subnet)
129+
ARM-->>CI: Cluster details
130+
131+
CI->>AB: Generate CSE + CustomData
132+
AB-->>CI: VM configuration
133+
134+
CI->>ARM: Create VMSS in cluster subnet
135+
ARM-->>CI: VM instance
136+
137+
CI->>Bastion: SSH tunnel to VM private IP
138+
Bastion->>VM: Forward SSH connection
139+
140+
CI->>VM: Run health checks + validators
141+
VM-->>CI: Results
142+
143+
CI->>K8s: Verify node ready
144+
K8s-->>CI: Node ready ✓
145+
146+
Bastion-->>CI: Close tunnel
55147
```
56148

57149
## Running Locally

e2e/aks_model.go

Lines changed: 30 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,8 @@ func getBaseClusterModel(clusterName, location, k8sSystemPoolSKU string) *armcon
173173
},
174174
NetworkProfile: &armcontainerservice.NetworkProfile{
175175
NetworkPlugin: to.Ptr(armcontainerservice.NetworkPluginKubenet),
176+
ServiceCidr: to.Ptr("172.16.0.0/16"),
177+
DNSServiceIP: to.Ptr("172.16.0.10"),
176178
},
177179
AddonProfiles: map[string]*armcontainerservice.ManagedClusterAddonProfile{
178180
"omsagent": {
@@ -303,113 +305,34 @@ func getFirewall(ctx context.Context, location, firewallSubnetID, publicIPID str
303305
func addFirewallRules(
304306
ctx context.Context, clusterModel *armcontainerservice.ManagedCluster,
305307
) error {
306-
location := *clusterModel.Location
307308
defer toolkit.LogStepCtx(ctx, "adding firewall rules")()
308309

309-
rg := *clusterModel.Properties.NodeResourceGroup
310-
vnet, err := getClusterVNet(ctx, rg)
310+
nodeRG := *clusterModel.Properties.NodeResourceGroup
311+
vnet, err := getClusterVNet(ctx, clusterModel)
311312
if err != nil {
312313
return err
313314
}
314315

316+
// Get the shared firewall's private IP (firewall was created by ensureSharedInfra)
317+
infra, err := CachedEnsureSharedInfra(ctx, *clusterModel.Location)
318+
if err != nil {
319+
return fmt.Errorf("getting shared infra for firewall IP: %w", err)
320+
}
321+
firewallPrivateIP := infra.FirewallIP
322+
315323
// For kubenet, the AKS-managed route table must stay attached so that pod
316324
// routes (managed by cloud-provider-azure) and firewall routes coexist.
317325
// For Azure CNI variants, the subnet may not have any route table, so we
318326
// create and associate a dedicated one before adding the firewall routes.
319-
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, rg, vnet.name, "aks-subnet", nil)
327+
aksSubnetResp, err := config.Azure.Subnet.Get(ctx, vnet.resourceGroup, vnet.name, vnet.subnetName, nil)
320328
if err != nil {
321329
return fmt.Errorf("failed to get AKS subnet: %w", err)
322330
}
323-
aksRTName, err := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
331+
aksRTName, err := ensureFirewallRouteTable(ctx, clusterModel, vnet, aksSubnetResp.Subnet)
324332
if err != nil {
325333
return err
326334
}
327335

328-
// Create AzureFirewallSubnet - this subnet name is required by Azure Firewall
329-
firewallSubnetName := "AzureFirewallSubnet"
330-
firewallSubnetParams := armnetwork.Subnet{
331-
Properties: &armnetwork.SubnetPropertiesFormat{
332-
AddressPrefix: to.Ptr("10.225.0.0/24"), // Use a different CIDR that doesn't overlap with 10.224.0.0/16
333-
},
334-
}
335-
336-
toolkit.Logf(ctx, "Creating subnet %s in VNet %s", firewallSubnetName, vnet.name)
337-
subnetPoller, err := config.Azure.Subnet.BeginCreateOrUpdate(
338-
ctx,
339-
rg,
340-
vnet.name,
341-
firewallSubnetName,
342-
firewallSubnetParams,
343-
nil,
344-
)
345-
if err != nil {
346-
return fmt.Errorf("failed to start creating firewall subnet: %w", err)
347-
}
348-
349-
subnetResp, err := subnetPoller.PollUntilDone(ctx, config.DefaultPollUntilDoneOptions)
350-
if err != nil {
351-
return fmt.Errorf("failed to create firewall subnet: %w", err)
352-
}
353-
354-
firewallSubnetID := *subnetResp.ID
355-
toolkit.Logf(ctx, "Created firewall subnet with ID: %s", firewallSubnetID)
356-
357-
// Create public IP for the firewall
358-
publicIPName := "abe2e-fw-pip"
359-
publicIPParams := armnetwork.PublicIPAddress{
360-
Location: to.Ptr(location),
361-
SKU: &armnetwork.PublicIPAddressSKU{
362-
Name: to.Ptr(armnetwork.PublicIPAddressSKUNameStandard),
363-
},
364-
Properties: &armnetwork.PublicIPAddressPropertiesFormat{
365-
PublicIPAllocationMethod: to.Ptr(armnetwork.IPAllocationMethodStatic),
366-
},
367-
}
368-
369-
toolkit.Logf(ctx, "Creating public IP %s", publicIPName)
370-
pipPoller, err := config.Azure.PublicIPAddresses.BeginCreateOrUpdate(
371-
ctx,
372-
rg,
373-
publicIPName,
374-
publicIPParams,
375-
nil,
376-
)
377-
if err != nil {
378-
return fmt.Errorf("failed to start creating public IP: %w", err)
379-
}
380-
381-
pipResp, err := pipPoller.PollUntilDone(ctx, config.DefaultPollUntilDoneOptions)
382-
if err != nil {
383-
return fmt.Errorf("failed to create public IP: %w", err)
384-
}
385-
386-
publicIPID := *pipResp.ID
387-
toolkit.Logf(ctx, "Created public IP with ID: %s", publicIPID)
388-
389-
firewallName := "abe2e-fw"
390-
firewall := getFirewall(ctx, location, firewallSubnetID, publicIPID)
391-
fwPoller, err := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil)
392-
if err != nil {
393-
return fmt.Errorf("failed to start Firewall creation: %w", err)
394-
}
395-
fwResp, err := fwPoller.PollUntilDone(ctx, nil)
396-
if err != nil {
397-
return fmt.Errorf("failed to create Firewall: %w", err)
398-
}
399-
400-
// Get the firewall's private IP address
401-
var firewallPrivateIP string
402-
if fwResp.Properties != nil && fwResp.Properties.IPConfigurations != nil && len(fwResp.Properties.IPConfigurations) > 0 {
403-
if fwResp.Properties.IPConfigurations[0].Properties != nil && fwResp.Properties.IPConfigurations[0].Properties.PrivateIPAddress != nil {
404-
firewallPrivateIP = *fwResp.Properties.IPConfigurations[0].Properties.PrivateIPAddress
405-
toolkit.Logf(ctx, "Firewall private IP: %s", firewallPrivateIP)
406-
}
407-
}
408-
409-
if firewallPrivateIP == "" {
410-
return fmt.Errorf("failed to get firewall private IP address")
411-
}
412-
413336
// Add firewall routes to the existing AKS route table using individual
414337
// route operations. This avoids replacing the entire table (which would
415338
// race with cloud-provider-azure pod route updates) and preserves the
@@ -418,7 +341,7 @@ func addFirewallRules(
418341
{
419342
Name: to.Ptr("vnet-local"),
420343
Properties: &armnetwork.RoutePropertiesFormat{
421-
AddressPrefix: to.Ptr("10.224.0.0/16"),
344+
AddressPrefix: to.Ptr(vnet.addressPrefix),
422345
NextHopType: to.Ptr(armnetwork.RouteNextHopTypeVnetLocal),
423346
},
424347
},
@@ -434,7 +357,7 @@ func addFirewallRules(
434357

435358
for _, route := range firewallRoutes {
436359
toolkit.Logf(ctx, "Adding route %q to AKS route table %q", *route.Name, aksRTName)
437-
poller, err := config.Azure.Routes.BeginCreateOrUpdate(ctx, rg, aksRTName, *route.Name, route, nil)
360+
poller, err := config.Azure.Routes.BeginCreateOrUpdate(ctx, nodeRG, aksRTName, *route.Name, route, nil)
438361
if err != nil {
439362
return fmt.Errorf("failed to start adding route %q: %w", *route.Name, err)
440363
}
@@ -451,7 +374,7 @@ func addFirewallRules(
451374
func ensureFirewallRouteTable(
452375
ctx context.Context,
453376
clusterModel *armcontainerservice.ManagedCluster,
454-
vnetName string,
377+
vnet VNet,
455378
aksSubnet armnetwork.Subnet,
456379
) (string, error) {
457380
if aksSubnet.Properties == nil {
@@ -493,7 +416,7 @@ func ensureFirewallRouteTable(
493416
aksSubnet.Properties.RouteTable = &armnetwork.RouteTable{
494417
ID: routeTableResp.ID,
495418
}
496-
if err := updateSubnet(ctx, clusterModel, aksSubnet, vnetName); err != nil {
419+
if err := updateSubnet(ctx, clusterModel, aksSubnet, vnet); err != nil {
497420
return "", fmt.Errorf("failed to associate firewall route table %q with AKS subnet: %w", routeTableName, err)
498421
}
499422

@@ -512,7 +435,7 @@ func addPrivateAzureContainerRegistry(ctx context.Context, cluster *armcontainer
512435
if err := createPrivateAzureContainerRegistryPullSecret(ctx, cluster, kube, resourceGroupName, isNonAnonymousPull); err != nil {
513436
return fmt.Errorf("create private acr pull secret: %w", err)
514437
}
515-
vnet, err := getClusterVNet(ctx, *cluster.Properties.NodeResourceGroup)
438+
vnet, err := getClusterVNet(ctx, cluster)
516439
if err != nil {
517440
return err
518441
}
@@ -533,7 +456,7 @@ func addNetworkIsolatedSettings(ctx context.Context, clusterModel *armcontainers
533456
location := *clusterModel.Location
534457
defer toolkit.LogStepCtx(ctx, fmt.Sprintf("Adding network settings for network isolated cluster %s in rg %s", *clusterModel.Name, *clusterModel.Properties.NodeResourceGroup))
535458

536-
vnet, err := getClusterVNet(ctx, *clusterModel.Properties.NodeResourceGroup)
459+
vnet, err := getClusterVNet(ctx, clusterModel)
537460
if err != nil {
538461
return err
539462
}
@@ -549,16 +472,18 @@ func addNetworkIsolatedSettings(ctx context.Context, clusterModel *armcontainers
549472
return err
550473
}
551474

475+
subnetAddressPrefix := vnet.addressPrefix
476+
552477
subnetParameters := armnetwork.Subnet{
553478
ID: to.Ptr(subnetId),
554479
Properties: &armnetwork.SubnetPropertiesFormat{
555-
AddressPrefix: to.Ptr("10.224.0.0/16"),
480+
AddressPrefix: to.Ptr(subnetAddressPrefix),
556481
NetworkSecurityGroup: &armnetwork.SecurityGroup{
557482
ID: nsg.ID,
558483
},
559484
},
560485
}
561-
if err = updateSubnet(ctx, clusterModel, subnetParameters, vnet.name); err != nil {
486+
if err = updateSubnet(ctx, clusterModel, subnetParameters, vnet); err != nil {
562487
return err
563488
}
564489

@@ -944,7 +869,11 @@ func createPrivateDNSLink(ctx context.Context, vnet VNet, nodeResourceGroup, pri
944869
return nil
945870
}
946871

947-
vnetForId, err := config.Azure.VNet.Get(ctx, nodeResourceGroup, vnet.name, nil)
872+
vnetRG := vnet.resourceGroup
873+
if vnetRG == "" {
874+
vnetRG = nodeResourceGroup
875+
}
876+
vnetForId, err := config.Azure.VNet.Get(ctx, vnetRG, vnet.name, nil)
948877
if err != nil {
949878
return fmt.Errorf("failed to get vnet: %w", err)
950879
}
@@ -1118,8 +1047,8 @@ func createNetworkIsolatedSecurityGroup(ctx context.Context, cluster *armcontain
11181047
return &nsg, nil
11191048
}
11201049

1121-
func updateSubnet(ctx context.Context, cluster *armcontainerservice.ManagedCluster, subnetParameters armnetwork.Subnet, vnetName string) error {
1122-
poller, err := config.Azure.Subnet.BeginCreateOrUpdate(ctx, *cluster.Properties.NodeResourceGroup, vnetName, config.Config.DefaultSubnetName, subnetParameters, nil)
1050+
func updateSubnet(ctx context.Context, cluster *armcontainerservice.ManagedCluster, subnetParameters armnetwork.Subnet, vnet VNet) error {
1051+
poller, err := config.Azure.Subnet.BeginCreateOrUpdate(ctx, vnet.resourceGroup, vnet.name, vnet.subnetName, subnetParameters, nil)
11231052
if err != nil {
11241053
return err
11251054
}

0 commit comments

Comments
 (0)