Version: 0.1.0 | Last Updated: 2025-01-26
Modular Helm charts for deploying Apache Spark on Kubernetes. Deploy Spark Connect, Spark Standalone, and supporting components (Jupyter, Airflow, MLflow, MinIO, Hive Metastore, History Server) with preset configurations.
| Resource | Description | Link |
|---|---|---|
| Usage Guide | Complete user guide (RU/EN) | RU | EN |
| Quick Reference | Command cheat sheet (RU/EN) | RU | EN |
| Architecture | System architecture and components | Architecture |
| Recipes | Operations, Troubleshooting, Deployment, Integration | Recipes |
| Project Origins | Why spark_k8s — problem, solution, vision (EN/RU) | EN | RU |
| Pre-built Images | GHCR pull instructions, versioning | docs/guides/ghcr-images.md |
| What's New | Changelog and release notes | CHANGELOG.md |
| Platform | Status | Notes |
|---|---|---|
| Minikube | ✅ Tested | E2E + load tests validated |
| OpenShift | ✅ Prepared | PSS restricted / SCC restricted compatible |
See OpenShift notes for details.
| Chart | Description | Quick Start |
|---|---|---|
| spark-connect | Spark Connect server (gRPC) | helm install spark-connect charts/spark-3.5/charts/spark-connect |
| spark-standalone | Master + Workers + Airflow + MLflow | helm install spark-standalone charts/spark-3.5/charts/spark-standalone |
| Chart | Description | Quick Start |
|---|---|---|
| spark-4.1 | All-in-one: Connect, Jupyter, History Server, Hive Metastore | helm install spark charts/spark-4.1 |
| Component | Spark 3.5 | Spark 4.1 |
|---|---|---|
| Apache Spark | 3.5.7 | 4.1.0 |
| Python | 3.10 | 3.10 |
| Java | 17 | 17 |
Pre-configured values files for common scenarios:
| Scenario | Chart | Preset File | Backend |
|---|---|---|---|
| Jupyter + Connect (K8s) | 4.1 | values-scenario-jupyter-connect-k8s.yaml |
K8s |
| Jupyter + Connect (Standalone) | 4.1 | values-scenario-jupyter-connect-standalone.yaml |
Standalone |
| Airflow + Connect (K8s) | 4.1 | values-scenario-airflow-connect-k8s.yaml |
K8s |
| Airflow + Connect (Standalone) | 4.1 | values-scenario-airflow-connect-standalone.yaml |
Standalone |
| Airflow + K8s Submit | 4.1 | values-scenario-airflow-k8s-submit.yaml |
K8s |
| Airflow + Spark Operator | 4.1 | values-scenario-airflow-operator.yaml |
Operator |
| Jupyter + Connect (K8s) | 3.5 | values-scenario-jupyter-connect-k8s.yaml |
K8s |
| Jupyter + Connect (Standalone) | 3.5 | values-scenario-jupyter-connect-standalone.yaml |
Standalone |
| Airflow + Connect | 3.5 | values-scenario-airflow-connect.yaml |
Standalone |
| Airflow + K8s Submit | 3.5 | values-scenario-airflow-k8s-submit.yaml |
K8s |
| Airflow + Operator | 3.5 | values-scenario-airflow-operator.yaml |
Operator |
Usage:
# Spark 4.1 example
helm install spark charts/spark-4.1 -f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yaml
# Spark 3.5 example
helm install spark-connect charts/spark-3.5/charts/spark-connect \
-f charts/spark-3.5/charts/spark-connect/values-scenario-jupyter-connect-k8s.yaml| Component | Description | Use Case |
|---|---|---|
| spark-connect | Remote Spark server (gRPC) | Data Scientists, Engineers |
| spark-standalone | Master + Workers cluster | Batch processing, ETL |
| jupyter | JupyterLab with remote Spark | Interactive notebooks |
| hive-metastore | Table metadata warehouse | SQL queries, ACID |
| history-server | Job history and metrics | Debugging, monitoring |
| airflow | Pipeline orchestration | DAG scheduling |
| mlflow | Experiment tracking | ML workflows |
| minio | S3-compatible storage | Object storage, event logs |
| Documentation | Scripts |
|---|---|
| Operations | scripts/recipes/operations |
| Troubleshooting | scripts/recipes/troubleshoot |
| Deployment | — |
| Integration | — |
Operations:
- Configure event log for MinIO
- Enable event log (Spark 4.1)
- Initialize Hive Metastore
Troubleshooting:
- S3 connection failed
- History Server empty
- Spark properties syntax
- Zstandard library missing
- Driver not starting
- Driver host resolution
- Helm installation label validation
- S3 credentials secret missing
- Connect crashloop (RBAC)
- Jupyter Python dependencies
Deployment:
- Deploy Spark Connect for new team
- Migrate Standalone → K8s
- Add History Server HA
- Setup resource quotas
Integration:
- Airflow + Spark Connect
- MLflow experiment tracking
- External Hive Metastore
- Kerberos authentication
- Prometheus monitoring
- ✅ Spark 3.5.7 and Spark 4.1.0 support
- ✅ 11 preset values files for production scenarios
- ✅ 23 operation, troubleshooting, deployment, and integration recipes
- ✅ Jupyter notebooks with remote Spark Connect
- ✅ MinIO S3-compatible storage with auto-configuration
- ✅ E2E test suite (Minikube validated)
- ✅ Load testing support (synthetic and parquet data)
- ✅ Policy-as-code validation (OPA/Conftest)
- ✅ Quick Reference Card
- ✅ ISSUE-031: Auto-create s3-credentials secret
- ✅ ✅ ISSUE-033: RBAC configmaps create permission
- ✅ ISSUE-034: Jupyter Python dependencies (grpcio, grpcio-status, zstandard)
- ✅ ISSUE-035: Parquet data loader upload mechanism
⚠️ ISSUE-030: Helm "N/A" label validation (workaround: install spark-base separately)
docs/
├── architecture/ # System architecture
│ └── spark-k8s-charts.md
├── guides/ # User guides
│ ├── en/ # English
│ │ ├── spark-k8s-constructor.md
│ │ └── quick-reference.md
│ └── ru/ # Russian
│ ├── spark-k8s-constructor.md
│ └── quick-reference.md
├── recipes/ # How-to guides
│ ├── operations/ # Day-to-day tasks
│ ├── troubleshoot/ # Problem diagnosis
│ ├── deployment/ # Setup procedures
│ └── integration/ # External systems
├── adr/ # Architectural decisions
├── issues/ # Issue reports
└── PROJECT_MAP.md # Repository map
See docs/PROJECT_MAP.md for complete navigation.
# Using preset
helm install spark charts/spark-4.1 \
-f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yaml \
-n spark --create-namespace
# Or customize
helm install spark charts/spark-4.1 -n spark \
--set connect.enabled=true \
--set jupyter.enabled=true \
--set spark-base.minio.enabled=true# Port-forward Jupyter
kubectl port-forward -n spark svc/spark-4-1-spark-41-jupyter 8888:8888
# Open http://localhost:8888In Jupyter:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://spark-4-1-spark-41-connect:15002").getOrCreate()
df = spark.range(1000)
df.show()scripts/test-e2e-jupyter-connect.sh spark-test spark-connect 4.1Spark Connect supports three backend modes:
| Mode | Description | Use Case |
|---|---|---|
| k8s | Dynamic executors via Kubernetes API | Cloud-native, auto-scaling |
| standalone | Fixed cluster (master/workers) | Predictable resources, on-prem |
| operator | Spark Operator CRD-based | Advanced scheduling, pod templates |
Example (Standalone mode):
helm install spark charts/spark-4.1 -n spark \
--set connect.enabled=true \
--set connect.backendMode=standalone \
--set standalone.enabled=true# Validate all preset values
./scripts/validate-presets.sh# Validate against OPA policies
./scripts/validate-policy.sh# Helm template validation
helm template test charts/spark-4.1 -f charts/spark-4.1/values-scenario-*.yaml --dry-runSee CONTRIBUTING.md for development guidelines.
MIT License - see LICENSE file for details.
- Apache Spark community
- Kubernetes upstream
- Helm charts maintainers
Links: