Unified Helm chart for deploying Apache Spark 4.1.0 on Kubernetes with support for Spark Connect, GPU acceleration (RAPIDS), Apache Iceberg, Jupyter, and Airflow.
Deploy all core infrastructure components with a single command:
helm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-baseline.yamlThis enables:
- MinIO (S3-compatible storage)
- PostgreSQL (database)
- Hive Metastore (metadata catalog)
- History Server (job history UI)
helm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-gpu.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-iceberg.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-gpu-iceberg.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-jupyter-connect-standalone.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-airflow-connect-k8s.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-airflow-connect-standalone.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-airflow-k8s-submit.yamlhelm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-airflow-operator.yamlThe chart provides unified configuration for core infrastructure components under the core: section.
┌─────────────────┐ ┌──────────┐
│ Spark Connect │ │ Jupyter │
└────────┬────────┘ └────┬─────┘
│ │
┌────────▼─────────────────────────────────┐
│ Core Infrastructure │
├──────────────┬──────────────┬────────────┤
│ MinIO │ PostgreSQL │ Hive │
│ (S3 Store) │ (Metastore) │ Metastore │
└──────────────┴──────────────┴────────────┘
│
┌────────▼────────┐ ┌──────────┐
│ History Server │ │ Airflow │
└─────────────────┘ └──────────┘
- Spark Connect submits jobs to Kubernetes
- Jupyter provides interactive notebook interface
- MinIO stores event logs, warehouse data, and checkpoints
- PostgreSQL persists Hive Metastore catalog
- Hive Metastore manages table metadata for Spark SQL
- History Server reads event logs from MinIO for UI display
- Airflow orchestrates Spark job workflows
Stores event logs, warehouse data, and checkpoints.
core:
minio:
enabled: true
fullnameOverride: "minio-spark-41"
buckets:
- warehouse # Lakehouse warehouse
- spark-logs # Spark event logs
- spark-jobs # Job artifacts
- raw-data # Raw input data
- processed-data # Processed output
- checkpoints # Streaming checkpointsAccess credentials:
- Default console:
http://minio-spark-41:9001 - Default API:
http://minio-spark-41:9000 - Default access key:
minioadmin - Default secret key:
minioadmin
S3 Configuration in Spark:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio-spark-41:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.getOrCreate()Backend database for Hive Metastore.
core:
postgresql:
enabled: true
fullnameOverride: "postgresql-metastore-41"
auth:
username: hive
password: "hive123"
database: metastoreConnection string:
jdbc:postgresql://postgresql-metastore-41:5432/metastore
Centralized metadata catalog for Spark SQL and Lakehouse operations.
core:
hiveMetastore:
enabled: true
fullnameOverride: "hive-metastore-41"
warehouseDir: "s3a://warehouse/spark-41"Key features:
- Table metadata management
- Partition discovery
- Schema enforcement
- ACID transaction support (with Iceberg)
Web UI for viewing completed Spark application history.
core:
historyServer:
enabled: true
fullnameOverride: "history-server-41"
eventLogDir: "s3a://spark-logs/"Access UI:
kubectl port-forward service/history-server-41 18080:18080
# Open http://localhost:18080Enable NVIDIA RAPIDS plugin for GPU-accelerated Spark operations.
Requirements:
- Kubernetes nodes with NVIDIA GPUs
- NVIDIA Device Plugin installed
- CUDA 12.0 compatible drivers
Configuration:
features:
gpu:
enabled: true
cudaVersion: "12.0"
nvidiaVisibleDevices: "all"
taskResourceAmount: "0.25" # Fraction of GPU per task
rapids:
plugins: "com.nvidia.spark.SQLPlugin"
sql:
enabled: true
python:
enabled: true
fallback:
enabled: true # Fall back to CPU if GPU fails
memory:
allocFraction: "0.8"
maxAllocFraction: "0.9"
minAllocFraction: "0.3"
shuffle:
enabled: true
format:
parquet:
read: true
write: true
orc:
read: true
csv:
read: true
json:
read: trueVerifying GPU:
# In PySpark
spark.conf.get("spark.rapids.sql.enabled")
# Should return 'true'
spark.conf.get("spark.task.resource.gpu.amount")
# Should return '0.25'Enable Apache Iceberg for ACID transactions, time travel, and schema evolution.
Catalog Types:
- Hadoop Catalog (default, recommended for S3):
features:
iceberg:
enabled: true
catalogType: "hadoop"
warehouse: "s3a://warehouse/iceberg"
ioImpl: "org.apache.iceberg.hadoop.HadoopFileIO"- Hive Catalog (requires Hive Metastore):
features:
iceberg:
enabled: true
catalogType: "hive"
warehouse: "s3a://warehouse/iceberg"
uri: "thrift://hive-metastore-41:9083"- REST Catalog (new in Spark 4.1):
features:
iceberg:
enabled: true
catalogType: "rest"
warehouse: "s3a://warehouse/iceberg"
uri: "http://iceberg-rest-catalog:8181"Creating Iceberg Tables:
# Using Hadoop catalog
spark.sql("""
CREATE TABLE my_table (
id LONG,
name STRING
)
USING iceberg
LOCATION 's3a://warehouse/iceberg/my_table'
""")
# Using Hive catalog
spark.sql("""
CREATE TABLE my_catalog.my_db.my_table (
id LONG,
name STRING
)
USING iceberg
""")
# Using REST catalog (Spark 4.1+)
spark.sql("""
CREATE TABLE my_catalog.my_db.my_table (
id LONG,
name STRING
)
USING iceberg
""")Additional buckets for Iceberg:
core:
minio:
buckets:
- warehouse
- spark-logs
- iceberg-metadata # Iceberg metadata filesPre-configured values files for common scenarios.
| Preset | Description | Features |
|---|---|---|
core-baseline |
Core Components only | MinIO, PostgreSQL, Hive Metastore, History Server |
core-gpu |
Core + GPU | All core components + RAPIDS GPU acceleration |
core-iceberg |
Core + Iceberg | All core components + Apache Iceberg Lakehouse |
core-gpu-iceberg |
Core + GPU + Iceberg | All components + GPU + Iceberg |
Usage:
helm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-baseline.yaml| Scenario | File | Backend | Features |
|---|---|---|---|
| Jupyter + Connect | values-scenario-jupyter-connect-k8s.yaml |
Kubernetes | JupyterLab + Spark Connect |
| Jupyter + Connect | values-scenario-jupyter-connect-standalone.yaml |
Standalone | JupyterLab + Spark Standalone |
| Airflow + Connect | values-scenario-airflow-connect-k8s.yaml |
Kubernetes | Airflow + Connect |
| Airflow + Connect | values-scenario-airflow-connect-standalone.yaml |
Standalone | Airflow + Standalone |
| Airflow + K8s Submit | values-scenario-airflow-k8s-submit.yaml |
Kubernetes | Airflow + K8s Submit |
| Airflow + Operator | values-scenario-airflow-operator.yaml |
Operator | Airflow + Spark Operator |
| Airflow + GPU + Connect | values-scenario-airflow-gpu-connect-k8s.yaml |
Kubernetes | Airflow + GPU + Connect |
| Airflow + Iceberg + Connect | values-scenario-airflow-iceberg-connect-k8s.yaml |
Kubernetes | Airflow + Iceberg + Connect |
Usage:
helm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yamlOverride specific values:
helm install my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-baseline.yaml \
--set core.minio.persistence.size=100Gi \
--set core.historyServer.resources.requests.memory=2Giglobal:
imagePullSecrets: []
s3:
endpoint: "http://minio:9000"
accessKey: "minioadmin"
secretKey: "minioadmin"
pathStyleAccess: true
sslEnabled: false
existingSecret: "s3-credentials"
postgresql:
host: "postgresql-metastore-41"
port: 5432
user: "hive"
password: "hive123"rbac:
create: true
serviceAccountName: "spark-41"connect:
enabled: false
replicas: 1
image:
repository: "apache/spark"
tag: "4.1.0"
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"jupyter:
enabled: false
replicas: 1
image:
repository: "jupyter/scipy-notebook"
tag: "latest"
resources:
requests:
memory: "2Gi"
cpu: "1"airflow:
enabled: false
replicas: 1
image:
repository: "apache/airflow"
tag: "2.8.0-python3.10"See values.yaml for complete configuration reference.
Migrating from Spark 3.5 to Spark 4.1 requires several changes due to API updates and new features.
| Feature | Spark 3.5 | Spark 4.1 |
|---|---|---|
| Python API | pyspark |
pyspark (updated) |
| Connect Client | spark:// |
sc:// (recommended) |
| Catalog API | spark.catalog |
spark.catalog (enhanced) |
| Iceberg | 1.4.x | 1.5+ (with REST catalog) |
| GPU Support | RAPIDS 23.x | RAPIDS 24.x |
# Before (Spark 3.5)
connect:
image:
tag: "3.5.8"
# After (Spark 4.1)
connect:
image:
tag: "4.1.0"# Before (Spark 3.5)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("spark://spark-connect-35:15002") \
.getOrCreate()
# After (Spark 4.1)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("sc://spark-connect-41:15002") \
.getOrCreate()# Before
fullnameOverride: "spark-connect-35"
# After
fullnameOverride: "spark-connect-41"# Before
rbac:
serviceAccountName: "spark-35"
# After
rbac:
serviceAccountName: "spark-41"# Before
warehouseDir: "s3a://warehouse/spark-35"
# After
warehouseDir: "s3a://warehouse/spark-41"# Spark 3.5
features:
gpu:
rapids:
plugins: "com.nvidia.spark.SQLPlugin"
sql:
enabled: true
# Spark 4.1 (additional shuffle options)
features:
gpu:
rapids:
plugins: "com.nvidia.spark.SQLPlugin,com.nvidia.spark.ShufflePlugin"
sql:
enabled: true
shuffle:
enabled: true # New in 4.1# Spark 3.5 (Hadoop or Hive catalog only)
features:
iceberg:
catalogType: "hadoop"
# Spark 4.1 (can use REST catalog)
features:
iceberg:
catalogType: "rest"
uri: "http://iceberg-rest-catalog:8181"Data stored in MinIO and PostgreSQL is compatible between versions. However, you should:
- Backup existing data:
kubectl exec -it deployment/minio-spark-35 -- mc mirror minio/ /backup/
kubectl exec -it deployment/postgresql-metastore-35 -- pg_dump metastore > metastore_backup.sql- Migrate warehouse data (if changing paths):
# Copy data from old to new location
kubectl exec -it deployment/minio-spark-41 -- mc cp --recursive minio/warehouse/spark-35/ minio/warehouse/spark-41/- Update Hive Metastore pointers:
-- In PostgreSQL
UPDATE DBS
SET DB_LOCATION_URI = REPLACE(DB_LOCATION_URI, 'spark-35', 'spark-41')
WHERE DB_LOCATION_URI LIKE '%spark-35%';
UPDATE SDS
SET LOCATION = REPLACE(LOCATION, 'spark-35', 'spark-41')
WHERE LOCATION LIKE '%spark-35%';If migration fails, rollback to Spark 3.5:
# Uninstall Spark 4.1
helm uninstall my-spark
# Restore from backup
kubectl exec -it deployment/minio-spark-35 -- mc mirror /backup/ minio/
# Reinstall Spark 3.5
helm install my-spark charts/spark-3.5 \
-f charts/spark-3.5/presets/core-baseline.yamlTest the migration in a development environment first:
# Install Spark 4.1 in test namespace
helm install spark-test charts/spark-4.1 \
-f charts/spark-4.1/presets/core-baseline.yaml \
--namespace test
# Run test jobs
kubectl exec -it deployment/spark-test-connect -- spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode client \
/path/to/test_job.py
# Verify results
kubectl exec -it deployment/spark-test-connect -- spark-sql \
-e "SELECT * FROM test_table LIMIT 10"Check resource requests:
kubectl describe pod <pod-name>Solution: Increase resource limits or add more nodes to the cluster.
Check event logs:
kubectl exec -it deployment/history-server-41 -- ls -la /spark-eventsSolution: Verify MinIO is accessible and event logs are being written:
kubectl exec -it deployment/minio-spark-41 -- mc ls minio/spark-logsCheck GPU availability:
kubectl describe node <node-name> | grep nvidia.com/gpuVerify RAPIDS configuration:
kubectl logs deployment/spark-connect-41 | grep rapidsCheck database:
kubectl exec -it deployment/postgresql-metastore-41 -- psql -U hive -d metastoreVerify Metastore logs:
kubectl logs deployment/hive-metastore-41Check S3 access:
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "minioadmin")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "minioadmin")Verify warehouse location:
kubectl exec -it deployment/minio-spark-41 -- mc ls minio/warehouseCheck compatibility:
# Verify Spark version
kubectl exec -it deployment/spark-connect-41 -- spark-submit --version
# Check Connect client compatibility
python -c "from pyspark.sql import SparkSession; print(SparkSession.version)"Solution: Use backward-compatible APIs and gradual migration strategy.
Enable debug logging:
connect:
sparkConf:
"spark.executor.extraJavaOptions": "-Dlog4j.debug=true"
"spark.driver.extraJavaOptions": "-Dlog4j.debug=true"Access services locally:
# Spark Connect
kubectl port-forward service/spark-connect-41 15002:15002
# History Server
kubectl port-forward service/history-server-41 18080:18080
# MinIO Console
kubectl port-forward service/minio-spark-41-console 9001:9001
# Jupyter
kubectl port-forward service/jupyter-41 8888:8888
# Airflow Web UI
kubectl port-forward service/airflow-41-web 8080:8080helm upgrade my-spark charts/spark-4.1 \
-f charts/spark-4.1/presets/core-baseline.yamlWarning: Back up your data before upgrading:
kubectl exec -it deployment/minio-spark-41 -- mc mirror minio/ /backup/helm uninstall my-sparkNote: PVCs are not deleted by default. Delete manually:
kubectl delete pvc -l app.kubernetes.io/instance=my-spark- Main README - Project overview
- Spark 3.5 Chart - Spark 3.5 documentation
- Usage Guide - Complete user guide
- Quick Reference - Command cheat sheet
- Architecture - System architecture
Chart Version: 0.1.0 Spark Version: 4.1.0 Application Version: 4.1.0