You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -23,7 +23,7 @@ Validated with **Celeborn 0.6.2** on **Amazon EKS 1.30+**. TPC-DS 10 TB benchmar
23
23
| 6 |**Ports**| Set all 4 worker ports to fixed values (9091 to 9094) | Dynamic `port=0` triggers `AssertionError` on every graceful shutdown. |
24
24
| 7 |**Graceful shutdown**|`celeborn.worker.graceful.shutdown.enabled: "true"`| Without it, abrupt worker exit causes Spark jobs to hang. |
25
25
| 8 |**Local shuffle reader**|`spark.sql.adaptive.localShuffleReader.enabled: "false"` on every job | If true, Spark reads from executor local disks where Celeborn data does not exist. Jobs fail with `FileNotFoundException`. |
26
-
| 9 |**terminationGracePeriodSeconds**| Set to at least 600s for EBS workers, 3600s for NVMe | Kubernetes default is 30s. At 30s, SIGKILL fires before graceful shutdown can flush in-flight writes, which corrupts data and causes job failures. |
26
+
| 9 |**terminationGracePeriodSeconds**| Set to at least 720s (600s graceful shutdown + 120s buffer) | Kubernetes default is 30s. If too short, SIGKILL fires before graceful shutdown completes, corrupting data and causing job failures. Must exceed `celeborn.worker.graceful.shutdown.timeout`. |
27
27
| 10 |**DNS registration**|`celeborn.network.bind.preferIpAddress: "false"`| Workers register with pod IPs by default. Pod IPs change on restart, so the master ends up with stale mappings and clients can't reconnect. DNS names are stable. |
28
28
| 11 |**Rolling restarts**|`kubectl delete pod` with 120s delay between workers | SIGTERM triggers graceful shutdown (requires rows 7 and 9 above). Replication covers the ~70s restart window. Zero job failures validated on TPC-DS 10 TB. |
29
29
| 12 |**Decommission API**| Optional. Use for 100+ worker clusters, not required for correctness | It stops new writes to a worker but does not migrate existing shuffle data. Fetch errors still happen (20-30 per worker) and are handled by Spark retries. Simple pod delete with replication achieves the same data safety outcome. |
@@ -265,26 +265,21 @@ With NVMe, a node failure means permanent data loss. Highly recommended to enabl
spark.celeborn.client.reserveSlots.rackAware.enabled: "true" # puts replicas on different nodes
269
268
```
270
269
271
-
:::danger
272
-
Without replication on NVMe, any node termination from Karpenter consolidation, spot interruption, or hardware failure causes immediate job failure. This is not a theoretical risk.
273
-
:::
274
-
275
-
**4. Rotate one node at a time, never two at once**
270
+
**4. Rotate one node(one worker pod per node) at a time, never two at once**
276
271
277
-
With replication factor 1, each shuffle partition exists on exactly 2 workers. If you restart 2 workers at the same time there is a window where some partitions have zero live copies. Always rotate sequentially: decommission, drain, wait for re-registration, then move to the next worker.
272
+
With replication enabled, each shuffle partition exists on 2 workers. If you restart 2 workers at the same time, there is a window where some partitions have zero live copies. Always rotate sequentially: decommission, drain, wait for re-registration, then move to the next worker.
278
273
279
-
**5. Set `terminationGracePeriodSeconds: 3600`**
274
+
**5. Set `terminationGracePeriodSeconds` appropriately**
280
275
281
-
Karpenter's default drain timeout is 30 seconds. NVMe workers can take up to 10 minutes to drain active shuffle slots. Without an extended grace period Kubernetes sends SIGKILL before graceful shutdown finishes:
276
+
The `terminationGracePeriodSeconds` must be longer than `celeborn.worker.graceful.shutdown.timeout` (600s by default) to allow graceful shutdown to complete before Kubernetes sends SIGKILL.
Schedule NVMe maintenance during off-peak hours. Because each worker can take minutes to drain, rolling restarts on NVMe clusters take significantly longer than on EBS clusters.
- Buffer for decommission API call and cleanup: 120s (2 minutes)
294
+
- If `terminationGracePeriodSeconds` is too short, Kubernetes sends SIGKILL before graceful shutdown completes, causing data corruption
295
+
296
+
**For NVMe specifically:** While decommission typically drains in 0-5 seconds (only waits for in-flight writes), the graceful shutdown process still needs the full 600s to flush buffers and save metadata to RocksDB. The longer grace period ensures this completes even under heavy load.
297
297
298
298
---
299
299
@@ -392,18 +392,31 @@ sparkConf:
392
392
3. Worker ports set to `0` (dynamic) — `AssertionError` on every graceful shutdown
393
393
:::
394
394
395
-
### NVMe at Large Scale: Increase Retries
395
+
### Large Cluster Tuning (100+ Workers)
396
+
397
+
#### Master Sizing
396
398
397
-
When workers are draining active NVMe shuffle slots, which can take 2 to 5 minutes, the executors need more time before giving up:
- "-Xmx64g" # match to the master sizing table below
435
440
```
436
441
437
442
---
@@ -459,14 +464,18 @@ master:
459
464
| **Test 1** | Simple pod delete | 20 to 30 "file not found" | Zero failures | ~70s | ~13 min |
460
465
| **Test 2** | Decommission API first | 62 errors (worker-5), 0 (worker-4) | Zero failures | ~70s | ~13 min |
461
466
462
-
**What we learned from Test 2:** The decommission API did **not** eliminate errors. Worker-5 had 62 "file not found" errors even though decommission completed in 0 seconds. The API stops the worker from accepting new writes but it does not migrate existing shuffle data. It relies on replication the same way a simple restart does.
467
+
#### Simple Restart with Replication
463
468
464
-
Why decommission drained in 0 seconds: the API only waits for in-flight *writes* to finish, not for any kind of data migration. Worker-5 had 66 active slots holding 261.2 GiB of data and still drained instantly.
469
+
:::caution Not Recommended for Production
470
+
While our tests showed that simple pod restarts with replication enabled can work (zero job failures, zero executor losses), we **do not recommend this approach for production environments**.
465
471
466
-
#### Recommended Approach: Simple Restart with Replication
472
+
**Production recommendation:** Always use the decommission API when performing rolling restarts or pod updates. The decommission API provides explicit coordination with the master, cleaner shutdown signals, and better observability - all critical for production operations.
473
+
474
+
The simple restart approach documented below is useful for understanding how Celeborn's replication and retry mechanisms work, but production deployments should use the decommission-based approach described in the next section.
475
+
:::
467
476
468
477
```bash
469
-
# Rolling restart: validated approach
478
+
# Rolling restart: validated approach (testing/development only)
470
479
cd data-stacks/spark-on-eks/benchmarks/celeborn-benchmarks
471
480
./rolling-restart-celeborn.sh 120 # 120s pause between workers
472
481
```
@@ -491,149 +500,46 @@ cd data-stacks/spark-on-eks/benchmarks/celeborn-benchmarks
491
500
492
501
Doing a rolling restart without graceful shutdown is not safe. GitHub issue [#3539](https://github.com/apache/celeborn/issues/3539) documents the failure mode: abrupt worker termination causes Spark jobs to hang with `"CommitManager: Worker shutdown, commit all its partition locations"`.
493
502
494
-
#### When to Use the Decommission API
495
-
496
-
The decommission API is worth using in a few specific situations:
497
-
- Large clusters of 100 or more workers where you want explicit coordination with the master
498
-
- Automated operations pipelines that benefit from clear lifecycle hooks
499
-
- When you want cleaner shutdown signals in your logs and dashboards
503
+
#### Production Approach: Decommission API
500
504
501
-
It does **not** migrate shuffle data, eliminate "file not found" errors, speed up restarts, or let you skip replication.
505
+
**This is the recommended approach for production environments.** The decommission API provides explicit coordination with the master and cleaner operational semantics.
502
506
503
507
```bash
504
-
# Decommission-based restart: optional, for large clusters
508
+
# Decommission-based restart: production recommended
505
509
cd data-stacks/spark-on-eks/benchmarks/celeborn-benchmarks
- Use the decommission API since drain takes longer with active slots
528
-
- Increase the delay to 180 to 300 seconds between restarts
529
-
- Never restart two workers at the same time
530
-
- Schedule during off-peak hours
527
+
:::tip
528
+
Decommission drain time is typically 0-5 seconds because it only waits for in-flight writes to complete, not for data migration. Existing shuffle files remain on disk and rely on replicas for availability during the restart.
529
+
:::
531
530
532
531
533
532
### Storage Vertical Scaling
534
533
535
-
EBS volumes can be resized **online** with no pod downtime. You only need a rolling restart to apply the updated configuration values.
536
-
537
-
```bash
538
-
# Step 1: confirm the StorageClass allows expansion
539
-
kubectl get storageclass <name> -o jsonpath='{.allowVolumeExpansion}' # must be true
540
-
541
-
# Step 2: resize all 4 PVCs per worker
542
-
NEW_SIZE="2000Gi"
543
-
REPLICAS=$(kubectl get statefulset celeborn-worker -n celeborn -o jsonpath='{.spec.replicas}')
`volumeClaimTemplates`is immutable in StatefulSets. This procedure patches existing PVCs only and does not change the StatefulSet template. For new PVCs when scaling out workers, use the blue-green approach instead.
559
-
:::
534
+
EBS volumes backing Celeborn workers can be resized online without pod restarts or data movement. Patch the existing PVCs directly and update the Helm values to match — Kubernetes handles the underlying volume expansion transparently.
560
535
561
536
### Blue-Green Worker Pool Upgrade
562
537
563
-
Use this when you need to change the instance type, storage type, or any immutable StatefulSet field.
564
-
565
-
```bash
566
-
# 1. Deploy a new Karpenter NodePool for the target instance type
567
-
kubectl apply -f celeborn-nodepool-v2.yaml
568
-
569
-
# 2. Deploy the new worker StatefulSet pointing to the new NodePool
570
-
kubectl apply -f celeborn-worker-v2.yaml
571
-
572
-
# 3. Confirm both old and new workers are registered with the masters
If something goes wrong, keep the old StatefulSet running until the new pool is confirmed healthy. Scale the old pool back up and decommission the new one to roll back.
597
-
598
-
### EKS and AMI Upgrades
599
-
600
-
Always use AL2023 for Celeborn nodes. AL2 is end-of-life.
601
-
602
-
```yaml
603
-
apiVersion: karpenter.k8s.aws/v1
604
-
kind: EC2NodeClass
605
-
metadata:
606
-
name: celeborn-node-class
607
-
spec:
608
-
amiFamily: AL2023
609
-
amiSelectorTerms:
610
-
- alias: al2023@latest
611
-
role: KarpenterNodeRole
612
-
subnetSelectorTerms:
613
-
- tags:
614
-
karpenter.sh/discovery: "<cluster-name>"
615
-
securityGroupSelectorTerms:
616
-
- tags:
617
-
karpenter.sh/discovery: "<cluster-name>"
618
-
```
619
-
620
-
**Per-node upgrade procedure:**
621
-
```bash
622
-
kubectl cordon <node-name>
538
+
Two patterns exist for upgrading Celeborn workers, and the right choice depends on what is changing and how your deployment is managed.
623
539
624
-
# Decommission the worker before draining the node
625
-
WORKER=$(kubectl get pods -n celeborn --field-selector spec.nodeName=<node-name> -o name | head -1)
626
-
kubectl exec -n celeborn $WORKER -- \
627
-
curl -sf -X POST -H "Content-Type: application/json" \
**Worker pool replacement against shared masters** suits changes to immutable StatefulSet fields — instance type, storage class, node selectors — where only the workers need to change. Deploy a second worker StatefulSet with a new name pointing at the same existing masters, wait for the new workers to register and be confirmed healthy, then gracefully decommission the old workers via the Celeborn REST API so in-flight partitions drain completely before any pods terminate. The masters remain untouched, running Spark jobs never need to update their endpoint configuration, and rollback is as simple as scaling the old StatefulSet back up since EBS PVCs are preserved. Teams using ArgoCD should manage the second worker StatefulSet as a separate ArgoCD Application outside the primary Helm release, and remove it explicitly once migration is confirmed stable to avoid drift.
629
541
630
-
# Poll master until the worker finishes draining, then drain the node
Never drain a Celeborn node without decommissioning the worker first. Abrupt termination with active shuffle slots causes retry storms. On NVMe clusters, draining two nodes at the same time can cause permanent data loss if a partition's primary and replica both happen to be on the two nodes being drained.
636
-
:::
542
+
**Full blue-green cluster deployment** suits major version upgrades that change master-worker wire compatibility, changes to the master layer itself, or any scenario requiring complete isolation before cutover. A complete second Celeborn cluster — masters and workers — is deployed alongside the existing one, and cutover happens by updating `spark.celeborn.master.endpoints` in Spark job configurations. This is the natural pattern for teams using GitOps tooling like ArgoCD or Flux, where both clusters are declarative manifests in Git and the cutover is a single config change that can be promoted across environments with approval gates. The tradeoff is managing two complete HA master quorums simultaneously until all in-flight jobs on the old cluster drain. When using ArgoCD, model each cluster as a separate Application or ApplicationSet pointing at versioned Helm values — the green cluster is promoted by updating the master endpoints in your Spark job values files, which ArgoCD then syncs automatically.
637
543
638
544
### Node Rotation with Karpenter
639
545
@@ -643,7 +549,9 @@ When Karpenter drains a node for consolidation, expiry, or drift, a `preStop` ho
643
549
spec:
644
550
template:
645
551
spec:
646
-
terminationGracePeriodSeconds: 3600 # 600s is enough for EBS; use 3600s for NVMe
552
+
# Must be longer than celeborn.worker.graceful.shutdown.timeout (600s)
553
+
# to allow graceful shutdown to complete before Kubernetes sends SIGKILL
To prevent Karpenter from consolidating Celeborn nodes while jobs are running, add `karpenter.sh/do-not-disrupt: "true"` to worker pods and set a conservative disruption policy on the NodePool:
563
+
### Karpenter Disruption Policy
564
+
565
+
**For production Celeborn clusters, disable automatic consolidation** and use controlled rolling restarts instead:
656
566
657
567
```yaml
568
+
# Karpenter NodePool for Celeborn workers
569
+
apiVersion: karpenter.sh/v1
570
+
kind: NodePool
571
+
metadata:
572
+
name: celeborn-workers
658
573
spec:
659
574
disruption:
660
-
consolidationPolicy: WhenEmpty # only consolidate nodes that have no pods at all
661
-
expireAfter: 720h# force rotation every 30 days to pick up AMI updates
575
+
consolidationPolicy: WhenEmpty # Only consolidate completely empty nodes
576
+
budgets:
577
+
- nodes: "0" # Prevent any automatic disruption
578
+
reasons:
579
+
- Underutilized
580
+
- Drifted
581
+
- nodes: "1" # Allow one node at a time for empty node cleanup
582
+
reasons:
583
+
- Empty
662
584
```
663
585
664
-
---
586
+
**Why disable automatic consolidation?**
587
+
- Celeborn workers hold shuffle data that must be gracefully drained
588
+
- Automatic consolidation can disrupt multipe workers simultaneously
589
+
- Controlled rolling restarts (documented above) provide safer, predictable maintenance windows
590
+
- The decommission API provides explicit coordination that automatic consolidation cannot guarantee
591
+
592
+
**For AMI updates and node rotation:**
593
+
- Use the manual rolling restart procedures documented above
594
+
- Schedule during off-peak hours
595
+
- Control the pace (120s between workers)
596
+
- Monitor for issues before proceeding
597
+
598
+
**And add a PodDisruptionBudget for additional safety:**
0 commit comments