feat: Grove Update Strategy Improvements Design #262

tmonty12 · 2025-11-14T00:43:28Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR creates a design proposal for allowing configurability over the default Grove RollingUpdate strategy for PodCliqueSets, PodCliqueScalingGroups and PodCliques.

Introduced a ReplicaRecreate strategy at the PodCliqueSet level to atomtically recreate PCS replicas in the case where application level version compatibility is not possible. Also introduces the notion of maxUnavailable/maxSurge for ReplicaRecreate
For the PCS RollingUpdate strategy, introduces maxSurge/maxUnavailable at the PC and PCSG levels.
Includes considerations for the following:
- Gang Scheduling
- Indice management when surging
- Use cases and examples
- Updates to webhook validation
- Updates to APIs

Ronkahn21 · 2025-11-17T01:29:52Z

docs/designs/update-strategy.md

+- Pod selection: oldest pod first (by creation timestamp)
+- Individual pods are deleted and recreated
+
+This default behavior provides safe, conservative updates but lacks user configurability. At the PCSG and standalone PC levels, the update corresponds to maxUnavailable 1 and maxSurge 0 where a singular old replica is deleted and new one is created.


We do this a lot, will be a habit hard to break: https://docs.google.com/document/d/11dGjQh--s7QXggTyBxrcAttl2JHt1tbUKEV_UxvxAp8/edit?tab=t.0#heading=h.s62odkdoge8e

Ronkahn21 · 2025-11-17T02:23:48Z

docs/designs/update-strategy.md

+
+type PodCliqueSetUpdateStrategyType string
+const (
+    RollingUpdate    // Update replicas sequentially


the name does not match the description,. it does not update replicas sequentially it update the replica components sequentially, better name should be ReplicaUpdate or somthing that is refering to replica

Agreed, ReplicaUpdate is better but it's still a bit confusing. Not sure what a better name would be though. Unless we can find one, we should document this very well.

ZYecho11 · 2025-11-21T07:38:25Z

Perhaps we can consider introducing a partiton field to control the upgrade process? The partiton field will be meaningful for the scenario of PD association upgrade

gflarity · 2025-11-26T21:55:51Z

docs/designs/update-strategy.md

+- **Frontend** standalone PC: 3 replicas
+- **Prefill** PCSG: 2 replicas (prefill-leader PC: 1 replica, prefill-worker PC: 2 replicas)
+- **Decode** PCSG: 2 replicas (decode-leader PC: 1 replica, decode-worker PC: 2 replicas)
+


Suggest you include the YAML as well, it helps those who "see the matrix" ;)

apiVersion: grove.io/v1alpha1 kind: PodCliqueSet metadata: name: multinode-disaggregated-inference namespace: default spec: # 2 PCS replicas - each replica contains the full inference pipeline replicas: 2 template: podcliques: # Frontend - standalone PodClique (not part of any PCSG) - name: frontend spec: roleName: frontend replicas: 3 podSpec: containers: - name: frontend # ... # Prefill Leader - part of prefill PCSG - name: prefill-leader spec: roleName: prefill-leader replicas: 1 podSpec: containers: - name: prefill-leader # ... # Prefill Worker - part of prefill PCSG - name: prefill-worker spec: roleName: prefill-worker replicas: 2 podSpec: containers: - name: prefill-worker # ... # Decode Leader - part of decode PCSG - name: decode-leader spec: roleName: decode-leader replicas: 1 podSpec: containers: - name: decode-leader # ... # Decode Worker - part of decode PCSG - name: decode-worker spec: roleName: decode-worker replicas: 2 podSpec: containers: - name: decode-worker # ... # PodCliqueScalingGroups - groups of cliques that scale together podCliqueScalingGroups: # Prefill PCSG: 2 replicas of (1 leader + 2 workers) - name: prefill cliqueNames: [prefill-leader, prefill-worker] replicas: 2 # Decode PCSG: 2 replicas of (1 leader + 2 workers) - name: decode cliqueNames: [decode-leader, decode-worker] replicas: 2

gflarity

Thanks for the design. See comments. Ping me, happy to talk through any of them in a video chat.

Recommendation

Given that we live in a world supply constraints around GPUs, I doubt there's going to be clusters out there with spare capacity to use for maxSurge. So I'd recommend we narrow the scope of this document to just cover replicaRecreate and maxUnavailable. I suspect those are the two knobs the real world actually cares about. Should maxSurge get requested by an organization with real use cases for it, we can dig in and understand why and if the approach of scale-then-roll is sufficient, as that seems a lot cleaner though slower.

gflarity · 2025-11-26T22:06:51Z

docs/designs/update-strategy.md

+Consider a multinode aggregated inference serving deployment with 2 PCS replicas, where each replica contains:
+
+- **Aggregated Workers** PCSG: 3 replicas (each replica includes inference workers with a frontend capable of tokenization that accepts OpenAI Chat Completion requests)
+


So I think your example would like the following.

Sorry to nitpick, I know this is a toy example, but in this situation you could also just 1 replica os 6 pcsg right? Or 6 pcs with 1 pcsg. Which makes the most sense?

apiVersion: grove.io/v1alpha1 kind: PodCliqueSet metadata: name: multinode-aggregated-inference namespace: default spec: # 2 PCS replicas - each replica contains the full aggregated inference deployment replicas: 2 template: podcliques: # Aggregated Worker - combines inference + frontend/tokenization in a single pod # Each worker can handle OpenAI Chat Completion requests directly - name: aggregated-worker spec: roleName: aggregated-worker replicas: 1 podSpec: #... # PodCliqueScalingGroup - groups the aggregated workers that scale together podCliqueScalingGroups: # Aggregated Workers PCSG: 3 replicas of aggregated inference workers - name: aggregated-workers cliqueNames: [aggregated-worker] replicas: 3

gflarity · 2025-11-26T22:19:46Z

docs/designs/update-strategy.md

+```
+PodCliqueSet (Top Level)
+├─ UpdateStrategy (controls PCS replica updates)
+│  ├─ RollingUpdate: one replica at a time


I'm a bit confused by this. Even if we were to ReplicaRecreate, you'd still do it one at time as there's no maxSurge or maxUnavailable on PCS?

Oh, I see below there is? This diagram is bit confusing.

gflarity · 2025-11-26T22:25:31Z

docs/designs/update-strategy.md

+
+type PodCliqueSetUpdateStrategyType string
+const (
+    RollingUpdate    // Update replicas sequentially


Agreed, ReplicaUpdate is better but it's still a bit confusing. Not sure what a better name would be though. Unless we can find one, we should document this very well.

gflarity · 2025-11-26T22:46:03Z

docs/designs/update-strategy.md

+Controls **how pods update within a standalone PodClique**.
+
+```go
+type ComponentUpdateStrategy struct {


Not sure the Component is necessary here. Just call it UpdateStrategy. Though if recreate is set on the PCS, then this doesn't matter and we probably need to warn the user.

gflarity · 2025-11-26T22:52:08Z

docs/designs/update-strategy.md

+
+## Update Behavior
+
+### RollingUpdate (Default)


ReplicaUpdate

Actually another way to Frame it DelegateChildren or something like that. IE follow the strategy of the children vs replica recreate which ignores their strategies and just recreates.

gflarity · 2025-11-27T15:40:18Z

docs/designs/update-strategy.md

+- Need to clear all state at once within a replica
+- Coordinated recreation of interdependent components to prevent cross-version communication issues
+
+## MaxSurge Considerations


I found this section a bit hard to follow. Leaving some suggestions below to make it a bit easier.

gflarity · 2025-11-27T15:54:54Z

docs/designs/update-strategy.md

+## MaxSurge Considerations
+
+### PodClique MaxSurge
+
+**Indexing Strategy:**
+
+PodClique uses an index tracker that extracts pod indices from hostnames and fills holes automatically. When surge pods are created:
+
+1. **Surge pods get indices above replica count**: With `replicas=3` and `maxSurge=1`, surge pod gets index 3 (or higher if holes exist)
+2. **Index tracker fills holes**: When old pods are deleted, their indices become available. The tracker fills holes from lowest to highest (starting from 0)
+3. **No holes at end of update**: As old pods are deleted and recreated, new pods fill the lowest available indices, ensuring sequential indices `[0, replicas-1]` at completion
+
+**Example with `replicas=3`, `maxSurge=1`, `maxUnavailable=0`:**
+
+1. **Initial:** Pods with indices 0, 1, 2 (old spec)
+2. **Create surge:** Pod with index 3 (surge, new spec) - now have [0, 1, 2, 3]
+3. **Delete pod 0:** Index 0 becomes available
+4. **Recreate pod 0:** New pod fills index 0 (new spec) - now have [0, 1, 2, 3]
+5. **Delete pod 1:** Index 1 becomes available
+6. **Recreate pod 1:** New pod fills index 1 (new spec) - now have [0, 1, 2, 3]
+7. **Delete pod 2:** Index 2 becomes available
+8. **Recreate pod 2:** New pod fills index 2 (new spec) - now have [0, 1, 2, 3]
+9. **Delete surge pod 3:** Final state [0, 1, 2] - no holes
+
+**Gang Scheduling Impact:**
+
+- Surge pods are added to the same PodGroup as existing pods
+- PodGroup's `PodReferences` list includes all pods (old + surge)
+- Gang scheduling requires the PodGroup to meet `MinReplicas` (from PodClique's `MinAvailable`)
+- All pods in the PodGroup (including surge) must be scheduled together as part of the gang
+- If surge pod cannot be scheduled, the entire gang is blocked
+
+**PodGang/PodGroup Construction:**
+
+- PodGroup contains pod references from the PodClique
+- During surge, PodGroup temporarily has more pod references than `replicas` count
+- PodGroup's `MinReplicas` is set to PodClique's `MinAvailable` (not affected by surge)
+- Gang scheduling ensures at least `MinReplicas` pods are scheduled together
+
+**Stuck Scenarios:**
+
+- **Surge pod cannot be scheduled**: Gang scheduling blocks until surge pod can be scheduled, update stuck
+- **Surge pod scheduled but not ready**: Update cannot proceed if `maxUnavailable=0` requires surge pod to be ready before deleting old pods
+
+### PodCliqueScalingGroup MaxSurge
+
+**Indexing Strategy:**
+
+PodCliqueScalingGroup replicas use replica indices (0, 1, 2, ...). When surge replicas are created:
+
+1. **Surge replicas get indices above replica count**: With `replicas=3` and `maxSurge=1`, surge replica gets index 3
+2. **Replica placement depends on minAvailable**:
+   - If `replicas <= minAvailable`: All replicas (including surge) go into base PodGang
+   - If `replicas > minAvailable`: Surge replica goes into scaled PodGang
+3. **No holes at end of update**: Original replica indices `[0, replicas-1]` are maintained, surge replicas at `[replicas, replicas+maxSurge-1]` are deleted after update completes
+
+**Example with `replicas=3`, `minAvailable=3`, `maxSurge=1`, `maxUnavailable=0`:**
+
+1. **Initial:** Replicas 0, 1, 2 in base PodGang (old spec)
+2. **Create surge:** Replica 3 in scaled PodGang (surge, new spec) - replicas 0, 1, 2 in base PodGang; replica 3 in scaled PodGang
+3. **Wait for surge available:** Replica 3 becomes available (scaled PodGang gated by base PodGang readiness)
+4. **Delete and recreate replica 0:** Replica 0 (new spec) in base PodGang
+5. **Wait for replica 0 available:** Replica 0 becomes available
+6. **Delete and recreate replica 1:** Replica 1 (new spec) in base PodGang
+7. **Wait for replica 1 available:** Replica 1 becomes available
+8. **Delete and recreate replica 2:** Replica 2 (new spec) in base PodGang
+9. **Wait for replica 2 available:** Replica 2 becomes available
+10. **Delete surge replica 3:** Final state replicas [0, 1, 2] - no holes
+
+**Example with `replicas=3`, `minAvailable=2`, `maxSurge=1`:**
+
+1. **Initial:** Replicas 0, 1 in base PodGang; Replica 2 in scaled PodGang (old spec)
+2. **Create surge:** Replica 3 in scaled PodGang (surge, new spec)
+3. **Update proceeds:** Replicas 0, 1, 2 updated, then surge replica 3 deleted
+
+**Gang Scheduling Impact:**
+
+- **Base PodGang (replicas 0 to minAvailable-1)**: All PodGroups in base PodGang must meet `MinReplicas` for gang scheduling to proceed.
+- **Scaled PodGangs (replicas >= minAvailable)**: Surge replicas (always at indices >= replicas, which is >= minAvailable) get their own scaled PodGang. Scaled PodGangs are gated by base PodGang readiness - gates are removed only after base PodGang is ready.
+- **Gang scheduling constraints**: Each PodGroup (one per PodClique in the PCSG replica) must meet its `MinReplicas` for the gang to be scheduled.
+
+**PodGang/PodGroup Construction:**
+
+- **Base PodGang**: Contains PodGroups for replicas 0 to `minAvailable-1`.
+- **Scaled PodGangs**: Each replica >= `minAvailable` gets its own scaled PodGang. Surge replicas (always at indices >= replicas >= minAvailable) create new scaled PodGangs.
+- **PodGroup per PodClique**: Each PodClique in a PCSG replica becomes a PodGroup. Surge replica creates PodGroups for all its PodCliques.
+
+**Stuck Scenarios:**
+
+- **Surge replica cannot be scheduled**: Surge replica is always in a scaled PodGang. If scaled PodGang is blocked and base PodGang is updating, creates circular dependency
+- **Base PodGang update blocks surge scaled PodGang**: Surge replica in scaled PodGang is gated by base PodGang readiness. If base is updating, surge cannot proceed.
+- **Surge replica scheduled but not ready**: Update cannot proceed if `maxUnavailable=0` requires surge replica to be available before deleting old replicas.
+
+### PCS Replica-Level MaxSurge with ReplicaRecreate
+
+**Behavior:**
+
+With ReplicaRecreate, surge replicas are created at new indices above the desired replica count to avoid index holes. The update process:
+
+1. Creates surge replicas at indices `[replicas, replicas+maxSurge-1]`
+2. Recreates original indices `[0, replicas-1]` with the updated spec
+3. Deletes surge replicas once original indices are recreated
+
+**Example:**
+
+With `replicas=3`, `maxSurge=1`, and `maxUnavailable=0`:
+
+1. **Initial state:** Replicas 0, 1, 2 (old spec)
+2. **Create surge replica:** Replicas 0, 1, 2 (old), 3 (surge, new spec)
+3. **Wait for surge available:** Replica 3 becomes available
+4. **Delete and recreate replica 0:** Replicas 0 (new), 1, 2 (old), 3 (surge, new)
+5. **Wait for replica 0 available:** Replica 0 becomes available
+6. **Delete and recreate replica 1:** Replicas 0, 1 (new), 2 (old), 3 (surge, new)
+7. **Wait for replica 1 available:** Replica 1 becomes available
+8. **Delete and recreate replica 2:** Replicas 0, 1, 2 (new), 3 (surge, new)
+9. **Wait for replica 2 available:** Replica 2 becomes available
+10. **Delete surge replica 3:** Replicas 0, 1, 2 (new spec) - no index holes
+
+This approach maintains sequential indices throughout the update, avoiding DNS naming issues and ensuring applications always see consistent replica indices. With `maxUnavailable=0`, a surge replica must be available before deleting any original replica to maintain full capacity.
+
+**Stuck Scenarios with ReplicaRecreate and MaxSurge:**
+
+When using ReplicaRecreate with `maxSurge > 0`, the update can get stuck if surge replicas fail to become available. This can happen in several scenarios:
+
+1. **Surge replica is unscheduled**: The surge replica's pods cannot be scheduled due to insufficient cluster resources, topology constraints that cannot be satisfied, node selectors/affinity mismatches, or resource quotas exceeded.
+
+2. **Surge replica has MinAvailable breached**: The surge replica's pods are scheduled but fail to become ready due to crash loops, health check failures, application startup failures, or dependency issues.
+
+3. **Existing replicas are unhealthy**: Even if surge replica is healthy, if existing replicas are unscheduled or have MinAvailable breached, the update may be blocked by `maxUnavailable` constraints.
+
+Users are responsible for identifying when a rolling update with `maxSurge` during ReplicaRecreate is stuck (e.g., update progress stalls, surge replica remains unscheduled or has MinAvailable breached) and manually intervening to unblock the update, such as by reducing `maxSurge` to 0 or deleting the stuck surge replica.
+


I found this section a bit confusing. I worked with Opus 4.5 to make it a bit clearer. Please take a look and incorporate what you like.

# MaxSurge Considerations ## Overview Enabling `maxSurge` changes the update strategy from **delete-then-create** (current default) to **create-then-delete**. This maintains full capacity during updates but introduces complexity around indexing, gang scheduling, and potential "stuck" scenarios. **Key Risk**: With `maxSurge > 0`, updates can become **stuck** if surge resources fail to schedule or become healthy. Grove does not automatically detect or resolve these situations—users must monitor and manually intervene. ## Common Concepts Across All Levels Before diving into level-specific details, here are concepts that apply at every level: **Indexing Strategy**: Surge resources are assigned indices *above* the normal replica count to avoid index collisions. When old resources are deleted and recreated, they reclaim their original indices. Surge resources are deleted at the end of the update, leaving clean sequential indices. **Availability Gating**: When `maxUnavailable=0`, the surge resource must become available *before* any old resource can be deleted. This is what enables zero-downtime updates but also creates the primary stuck scenario. ## PodClique MaxSurge (Pod Level) **Scope**: Controls how individual pods update within a standalone PodClique. **How It Works**: - With `replicas=3` and `maxSurge=1`: surge pod gets index 3, update proceeds, surge pod deleted at end - The index tracker fills holes from lowest to highest, ensuring no index gaps at completion **Example with `replicas=3`, `maxSurge=1`, `maxUnavailable=0`:** 1. **Initial:** Pods with indices 0, 1, 2 (old spec) 2. **Create surge:** Pod with index 3 (new spec) — now have [0, 1, 2, 3] 3. **Delete pod 0:** Index 0 becomes available 4. **Recreate pod 0:** New pod fills index 0 (new spec) — now have [0, 1, 2, 3] 5. **Repeat for pods 1 and 2** 6. **Delete surge pod 3:** Final state [0, 1, 2] with new spec — no holes **Gang Scheduling Impact**: Surge pods are added to the **same PodGroup** as existing pods. This has an important implication: > **Gang scheduling requires ALL pods in a PodGroup (including surge) to be schedulable together.** If the cluster lacks resources for the surge pod, the entire gang becomes unschedulable, blocking the update. **Stuck Scenarios**: | Scenario | Cause | Result | |----------|-------|--------| | Surge pod unschedulable | Insufficient cluster resources | Gang blocked, update stuck | | Surge pod not ready | Container failures, health check issues | Update blocked (if `maxUnavailable=0`) | ## PodCliqueScalingGroup MaxSurge (PCSG Replica Level) **Scope**: Controls how PCSG replicas (groups of related PodCliques) update within a scaling group. **How It Works**: - With `replicas=3` and `maxSurge=1`: surge PCSG replica gets index 3 - Each PCSG replica contains multiple PodCliques that are updated together **Example with `replicas=3`, `minAvailable=3`, `maxSurge=1`, `maxUnavailable=0`:** 1. **Initial:** Replicas 0, 1, 2 in base PodGang (old spec) 2. **Create surge:** Replica 3 in scaled PodGang (new spec) 3. **Wait for surge available:** Replica 3 becomes available 4. **Delete and recreate replicas 0, 1, 2** sequentially, waiting for each to become available 5. **Delete surge replica 3:** Final state replicas [0, 1, 2] with new spec **Gang Scheduling Impact — The Base/Scaled PodGang Problem**: This is where `maxSurge` becomes complicated. Grove uses a two-tier gang scheduling model: - **Base PodGang**: Contains PCSG replicas 0 through `minAvailable-1` - **Scaled PodGangs**: Contain PCSG replicas at index `minAvailable` and above Since surge replicas are always at index ≥ `replicas` (which is ≥ `minAvailable`), **surge replicas always land in Scaled PodGangs**. Scaled PodGangs have a dependency: they are **gated until the base PodGang is ready**. This creates a potential problem: \`\`\` ┌─────────────────────────────────────────────────────────────────┐ │ POTENTIAL CIRCULAR DEPENDENCY │ │ │ │ 1. Surge replica (in scaled PodGang) waits for base to be ready │ │ 2. Base PodGang is being updated (may not be "ready") │ │ 3. Update needs surge to be available before deleting old base │ │ 4. Deadlock: surge waits for base, update waits for surge │ └─────────────────────────────────────────────────────────────────┘ \`\`\` **Stuck Scenarios**: | Scenario | Cause | Result | |----------|-------|--------| | Surge blocked by base PodGang | Base PodGang updating, not "ready" | Circular dependency, update stuck | | Surge replica unschedulable | Resource constraints on scaled PodGang | Update stuck | | Surge replica not available | Pod failures within the surge PCSG replica | Update blocked (if `maxUnavailable=0`) | \``` ## PCS Replica MaxSurge with ReplicaRecreate (Top Level) **Scope**: Controls how entire PCS replicas are recreated during version-incompatible updates. **How It Works**: With `replicas=2`, `maxSurge=1`, and `maxUnavailable=0`:

Step 1: [0-old, 1-old] Initial state
Step 2: [0-old, 1-old, 2-surge] Create surge replica at index 2
Step 3: Wait for replica 2 to become available
Step 4: [0-new, 1-old, 2-surge] Delete/recreate replica 0
Step 5: Wait for replica 0 to become available
Step 6: [0-new, 1-new, 2-surge] Delete/recreate replica 1
Step 7: Wait for replica 1 to become available
Step 8: [0-new, 1-new] Delete surge replica 2

This approach maintains full capacity (2 available replicas) throughout the update. **Stuck Scenarios**: Since surge PCS replicas are complete deployments (with their own PodGangs, PCSGs, and PodCliques), they can fail to become available for many reasons: | Scenario | Examples | |----------|----------| | **Surge replica unscheduled** | Insufficient resources, topology constraints unsatisfiable, node selector mismatches, quota exceeded | | **Surge replica unhealthy** | Container crash loops, health check failures, application startup failures, dependency issues | | **Existing replicas degraded** | If `maxUnavailable` constraint prevents progress due to unhealthy existing replicas | ## User Responsibilities **Grove does not automatically recover from stuck surge scenarios.** Users are responsible for: 1. **Monitoring update progress** — Watch for updates that stall (surge replica remains unscheduled or unhealthy) 2. **Diagnosing the cause** — Check pod events, resource availability, and PodGang status 3. **Manual intervention** — Options include: - Reducing `maxSurge` to 0 to switch to delete-then-create - Manually deleting the stuck surge replica - Freeing cluster resources to allow scheduling - Fixing application issues preventing readiness ## Summary | Level | Surge Resource | Primary Risk | Gang Scheduling Concern | |-------|---------------|--------------|------------------------| | **PodClique** | Extra pod | Pod unschedulable | Surge pod blocks entire PodGroup gang | | **PCSG** | Extra PCSG replica | Scaled PodGang gated | Base/scaled dependency creates circular wait | | **PCS (ReplicaRecreate)** | Extra PCS replica | Replica unhealthy/unscheduled | Full replica must schedule and become healthy | **Bottom Line**: `maxSurge` enables zero-downtime, full-capacity updates but shifts the failure mode from "reduced capacity during update" to "potentially stuck update requiring manual intervention."

Please check out the draft PR for rolling update E2E tests, seems like scaling during a roll is an edge case we care about current. What are the impacts of this on max surge?

What if we just increase the replicas first (with the old revision), then treat this as a max unavailable situation once pods are ready. Once the rollout is successful, reduce the replicas again. This might handle the intermingling mentioned above better. I suspect only really having to implement the maxUnavailable case would make the implementation a lot simpler too. The down side is you'll have to wait for new old pods to spin up, which could take quite a long time depending on what's being launched. That time would be wasted, but that might be a better trade off than all the stuck cases.

gflarity · 2025-11-27T16:38:33Z

docs/designs/update-strategy.md

+
+### RollingUpdate (Default)
+
+The default RollingUpdate behavior is described in the [Motivation](#motivation) section. When using the RollingUpdate strategy, `maxUnavailable` and `maxSurge` settings at the PodCliqueSet level are invalid and will be rejected by the validation webhook - PCS replicas are always updated one at a time sequentially. However, PC and PCSG `updateStrategy` settings (maxUnavailable/maxSurge) are observed and control how pods and PCSG replicas update within each PCS replica.


Suggested change

The default RollingUpdate behavior is described in the [Motivation](#motivation) section. When using the RollingUpdate strategy, `maxUnavailable` and `maxSurge` settings at the PodCliqueSet level are invalid and will be rejected by the validation webhook - PCS replicas are always updated one at a time sequentially. However, PC and PCSG `updateStrategy` settings (maxUnavailable/maxSurge) are observed and control how pods and PCSG replicas update within each PCS replica.

The default ReplicaUpdate behavior is described in the [Motivation](#motivation) section. When using the ReplicaUpdate strategy, `maxUnavailable` and `maxSurge` settings at the PodCliqueSet level are invalid and will be rejected by the validation webhook - PCS replicas are always updated one at a time sequentially. However, PC and PCSG `updateStrategy` settings (maxUnavailable/maxSurge) are observed and control how pods and PCSG replicas update within each PCS replica.

tmonty12 added 15 commits November 12, 2025 13:50

init

2a3df29

motivation section

1beef86

incompatible version update

9223423

compatible configuration updates

e0cde4f

goals and non-goals section

147a638

updates

3ce90ad

pcs maxsurge explanation

04250ab

PC and PCSG maxSurge implications

a6f8b32

update examples section

40c4d50

update implementation phases

a55d76c

add considerations section

8c03ff7

review updates

b2856b4

add validation webhook section

938ff68

add status update section

9fe9b4c

rm unneeded files

fc98c98

tmonty12 requested review from sanjaychatterjee and unmarshall as code owners November 14, 2025 00:43

athreesh requested review from gflarity and nvrohanv November 16, 2025 21:10

Ronkahn21 reviewed Nov 17, 2025

View reviewed changes

gflarity reviewed Nov 26, 2025

View reviewed changes

gflarity reviewed Nov 27, 2025

View reviewed changes

		Consider a multinode aggregated inference serving deployment with 2 PCS replicas, where each replica contains:

		- Aggregated Workers PCSG: 3 replicas (each replica includes inference workers with a frontend capable of tokenization that accepts OpenAI Chat Completion requests)


		### RollingUpdate (Default)

		The default RollingUpdate behavior is described in the [Motivation](#motivation) section. When using the RollingUpdate strategy, `maxUnavailable` and `maxSurge` settings at the PodCliqueSet level are invalid and will be rejected by the validation webhook - PCS replicas are always updated one at a time sequentially. However, PC and PCSG `updateStrategy` settings (maxUnavailable/maxSurge) are observed and control how pods and PCSG replicas update within each PCS replica.

feat: Grove Update Strategy Improvements Design #262

Are you sure you want to change the base?

feat: Grove Update Strategy Improvements Design #262

Uh oh!

Conversation

tmonty12 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ronkahn21 Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZYecho11 commented Nov 21, 2025

Uh oh!

gflarity Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gflarity left a comment

Choose a reason for hiding this comment

Recommendation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tmonty12 commented Nov 14, 2025 •

edited

Loading

Ronkahn21 Nov 17, 2025 •

edited

Loading

gflarity Nov 26, 2025 •

edited

Loading