You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/blog/2026-04-12-nap-disruption/index.md
+10-11Lines changed: 10 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,15 @@
1
1
---
2
-
title: "Managing Disruption with AKS Node Auto-Provisioning (NAP): PDBs, Consolidation, and Disruption Budgets"
3
-
description: "Learn AKS best practices to control voluntary disruption from Node Auto-Provisioning (NAP): how Pod Disruption Budgets interact with Karpenter consolidation/drift/expiration, and how to use NodePool disruption budgetsand maintenance windows to keep workloads stable."
2
+
title: "Managing Disruption with AKS Node Auto-Provisioning"
3
+
description: "Learn AKS best practices to control NAP disruption with Pod Disruption Budgets (PDBs), node pool disruption budgets, consolidation, and maintenance windows."
4
4
date: 2026-04-12
5
5
authors: ["wilson-darko"]
6
6
tags:
7
7
- node-auto-provisioning
8
8
---
9
9
10
10
## Background
11
-
AKS users want to ensure that their workloads scaling when needed, and are disrupted only when (or where) desired.
12
-
AKS Node Auto-Provisioning (NAP) is designed to keep clusters efficient: it provisions nodes for pending pods, and it also continuously *removes* nodes when it’s safe to do so (for example, when nodes are empty or underutilized). That second half**disruption** is where many production surprises happen.
11
+
AKS users want to ensure that their workloads scale when needed and are disrupted only when (and where) desired.
12
+
AKS Node Auto-Provisioning (NAP) is designed to keep clusters efficient: it provisions nodes for pending pods, and it also continuously *removes* nodes when it’s safe to do so (for example, when nodes are empty or underutilized). That node-removal**disruption** is where many production surprises happen.
13
13
14
14
When managing Kubernetes, operational questions that users might have are:
15
15
@@ -19,7 +19,7 @@ When managing Kubernetes, operational questions that users might have are:
19
19
- Why do upgrades get “stuck” on certain nodes?
20
20
21
21
22
-
This post focuses on **NAP disruption best practices**, and not workload scheduling (tools like topology spread constraints, node affinity, taints, etc.). For more on scheduling best practices, check out our [blog post](<will edit once part 1 blog is published>).
22
+
This post focuses on **NAP disruption best practices**, and not workload scheduling (tools like topology spread constraints, node affinity, taints, etc.). For more on scheduling best practices, check out our earlier blog post on NAP scheduling fundamentals.
23
23
24
24
If you’re new to these NAP features, this post will give you “good defaults” as a starting point. If you’re already deep into NAP disruption settings, treat it as a checklist for the behaviors AKS users most commonly ask about.
25
25
@@ -58,15 +58,15 @@ NAP is built on Karpenter concepts and exposes disruption controls on the **Node
58
58
-**Consolidation policy** (when NAP is allowed to consolidate)
59
59
-**Disruption budgets** (how many nodes can be disrupted at once, and when)
60
60
-**Expire-after** (node lifetime)
61
-
-**Drift**(replace nodes that are out o)
61
+
-**Drift**(replace nodes that are out of date with the desired NodePool configuration)
62
62
63
63
A good operational posture is: **use PDBs to protect *applications*** and **use NAP disruption tools to control *the cluster’s disruption rate***.
64
64
65
65
---
66
66
67
67
## Part 2 - NAP Overview
68
68
69
-
Node auto-provisioning (NAP) provisions, scales, and manages nodes. NAP bases it's scheduling and disruption logic on settings from 3 sources:
69
+
Node auto-provisioning (NAP) provisions, scales, and manages nodes. NAP bases its scheduling and disruption logic on settings from 3 sources:
70
70
71
71
- Workload deployment file - For disruption NAP honors the pod disruption budgets defined by the user here
72
72
-[NodePool CRD](https://learn.microsoft.com/azure/aks/node-auto-provisioning-node-pools) - Used to list the range of allowed virtual machine options (size, zones, architecture) and also disruption settings
@@ -119,8 +119,6 @@ spec:
119
119
app: web
120
120
```
121
121
122
-
Kubernetes describes minAvailable / maxUnavailable as the two key availability knobs, and notes you can only specify one per PDB.
123
-
124
122
Why it works well in practice:
125
123
- Consolidation/drift/expiration can still proceed.
126
124
- You avoid large brownouts caused by draining too many replicas at once.
@@ -149,7 +147,8 @@ This can be intentional for extremely sensitive workloads, but it has a cost: if
149
147
150
148
There are two different operator intents that often get conflated:
151
149
152
-
- **When** consolidation is allowed and will happen- **How much** disruption can happen concurrently (budgets / rate limiting)
150
+
- **When** consolidation is allowed and will happen
NAP exposes Karpenter-style disruption budgets on the NodePool. If you don’t set them, a default budget of `nodes: 10%` is used. Use budgets to regulate how many nodes are consolidate at a time.
178
+
NAP exposes Karpenter-style disruption budgets on the NodePool. If you don’t set them, a default budget of `nodes: 10%` is used. Use budgets to regulate how many nodes are consolidated at a time.
180
179
181
180
The following example sets the node disruption budget to 1 node at a time.
0 commit comments