Spark K8s Constructor: Apache Spark on Kubernetes

Version: 0.1.0 | Last Updated: 2025-01-26

Modular Helm charts for deploying Apache Spark on Kubernetes. Deploy Spark Connect, Spark Standalone, and supporting components (Jupyter, Airflow, MLflow, MinIO, Hive Metastore, History Server) with preset configurations.

Quick Links

Resource	Description	Link
Usage Guide	Complete user guide (RU/EN)	RU \| EN
Quick Reference	Command cheat sheet (RU/EN)	RU \| EN
Architecture	System architecture and components	Architecture
Recipes	Operations, Troubleshooting, Deployment, Integration	Recipes
Project Origins	Why spark_k8s — problem, solution, vision (EN/RU)	EN \| RU
Pre-built Images	GHCR pull instructions, versioning	docs/guides/ghcr-images.md
What's New	Changelog and release notes	CHANGELOG.md

Testing Status

Platform	Status	Notes
Minikube	✅ Tested	E2E + load tests validated
OpenShift	✅ Prepared	PSS `restricted` / SCC `restricted` compatible

See OpenShift notes for details.

Charts

Spark 3.5 (Modular Charts)

Chart	Description	Quick Start
spark-connect	Spark Connect server (gRPC)	`helm install spark-connect charts/spark-3.5/charts/spark-connect`
spark-standalone	Master + Workers + Airflow + MLflow	`helm install spark-standalone charts/spark-3.5/charts/spark-standalone`

Spark 4.1 (Unified Chart)

Chart	Description	Quick Start
spark-4.1	All-in-one: Connect, Jupyter, History Server, Hive Metastore	`helm install spark charts/spark-4.1`

Component Versions

Component	Spark 3.5	Spark 4.1
Apache Spark	3.5.7	4.1.0
Python	3.10	3.10
Java	17	17

Preset Catalog

Pre-configured values files for common scenarios:

Scenario	Chart	Preset File	Backend
Jupyter + Connect (K8s)	4.1	`values-scenario-jupyter-connect-k8s.yaml`	K8s
Jupyter + Connect (Standalone)	4.1	`values-scenario-jupyter-connect-standalone.yaml`	Standalone
Airflow + Connect (K8s)	4.1	`values-scenario-airflow-connect-k8s.yaml`	K8s
Airflow + Connect (Standalone)	4.1	`values-scenario-airflow-connect-standalone.yaml`	Standalone
Airflow + K8s Submit	4.1	`values-scenario-airflow-k8s-submit.yaml`	K8s
Airflow + Spark Operator	4.1	`values-scenario-airflow-operator.yaml`	Operator
Jupyter + Connect (K8s)	3.5	`values-scenario-jupyter-connect-k8s.yaml`	K8s
Jupyter + Connect (Standalone)	3.5	`values-scenario-jupyter-connect-standalone.yaml`	Standalone
Airflow + Connect	3.5	`values-scenario-airflow-connect.yaml`	Standalone
Airflow + K8s Submit	3.5	`values-scenario-airflow-k8s-submit.yaml`	K8s
Airflow + Operator	3.5	`values-scenario-airflow-operator.yaml`	Operator

Usage:

# Spark 4.1 example
helm install spark charts/spark-4.1 -f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yaml

# Spark 3.5 example
helm install spark-connect charts/spark-3.5/charts/spark-connect \
  -f charts/spark-3.5/charts/spark-connect/values-scenario-jupyter-connect-k8s.yaml

Components

Component	Description	Use Case
spark-connect	Remote Spark server (gRPC)	Data Scientists, Engineers
spark-standalone	Master + Workers cluster	Batch processing, ETL
jupyter	JupyterLab with remote Spark	Interactive notebooks
hive-metastore	Table metadata warehouse	SQL queries, ACID
history-server	Job history and metrics	Debugging, monitoring
airflow	Pipeline orchestration	DAG scheduling
mlflow	Experiment tracking	ML workflows
minio	S3-compatible storage	Object storage, event logs

Recipes

Documentation	Scripts
Operations	scripts/recipes/operations
Troubleshooting	scripts/recipes/troubleshoot
Deployment	—
Integration	—

Quick Recipe Index

Operations:

Configure event log for MinIO
Enable event log (Spark 4.1)
Initialize Hive Metastore

Troubleshooting:

S3 connection failed
History Server empty
Spark properties syntax
Zstandard library missing
Driver not starting
Driver host resolution
Helm installation label validation
S3 credentials secret missing
Connect crashloop (RBAC)
Jupyter Python dependencies

Deployment:

Deploy Spark Connect for new team
Migrate Standalone → K8s
Add History Server HA
Setup resource quotas

Integration:

Airflow + Spark Connect
MLflow experiment tracking
External Hive Metastore
Kerberos authentication
Prometheus monitoring

What's New in v0.1.0

Features

✅ Spark 3.5.7 and Spark 4.1.0 support
✅ 11 preset values files for production scenarios
✅ 23 operation, troubleshooting, deployment, and integration recipes
✅ Jupyter notebooks with remote Spark Connect
✅ MinIO S3-compatible storage with auto-configuration
✅ E2E test suite (Minikube validated)
✅ Load testing support (synthetic and parquet data)
✅ Policy-as-code validation (OPA/Conftest)
✅ Quick Reference Card

Fixes

✅ ISSUE-031: Auto-create s3-credentials secret
✅ ✅ ISSUE-033: RBAC configmaps create permission
✅ ISSUE-034: Jupyter Python dependencies (grpcio, grpcio-status, zstandard)
✅ ISSUE-035: Parquet data loader upload mechanism

Known Issues

⚠️ ISSUE-030: Helm "N/A" label validation (workaround: install spark-base separately)

Documentation Structure

docs/
├── architecture/          # System architecture
│   └── spark-k8s-charts.md
├── guides/                # User guides
│   ├── en/               # English
│   │   ├── spark-k8s-constructor.md
│   │   └── quick-reference.md
│   └── ru/               # Russian
│       ├── spark-k8s-constructor.md
│       └── quick-reference.md
├── recipes/               # How-to guides
│   ├── operations/       # Day-to-day tasks
│   ├── troubleshoot/     # Problem diagnosis
│   ├── deployment/       # Setup procedures
│   └── integration/      # External systems
├── adr/                   # Architectural decisions
├── issues/                # Issue reports
└── PROJECT_MAP.md         # Repository map

See docs/PROJECT_MAP.md for complete navigation.

Quick Start

1. Install Spark Connect + Jupyter (Spark 4.1)

# Using preset
helm install spark charts/spark-4.1 \
  -f charts/spark-4.1/values-scenario-jupyter-connect-k8s.yaml \
  -n spark --create-namespace

# Or customize
helm install spark charts/spark-4.1 -n spark \
  --set connect.enabled=true \
  --set jupyter.enabled=true \
  --set spark-base.minio.enabled=true

2. Connect to Spark

# Port-forward Jupyter
kubectl port-forward -n spark svc/spark-4-1-spark-41-jupyter 8888:8888

# Open http://localhost:8888

In Jupyter:

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://spark-4-1-spark-41-connect:15002").getOrCreate()
df = spark.range(1000)
df.show()

3. Run E2E Test

scripts/test-e2e-jupyter-connect.sh spark-test spark-connect 4.1

Backend Modes

Spark Connect supports three backend modes:

Mode	Description	Use Case
k8s	Dynamic executors via Kubernetes API	Cloud-native, auto-scaling
standalone	Fixed cluster (master/workers)	Predictable resources, on-prem
operator	Spark Operator CRD-based	Advanced scheduling, pod templates

Example (Standalone mode):

helm install spark charts/spark-4.1 -n spark \
  --set connect.enabled=true \
  --set connect.backendMode=standalone \
  --set standalone.enabled=true

Validation

Preset Validation

# Validate all preset values
./scripts/validate-presets.sh

Policy Validation

# Validate against OPA policies
./scripts/validate-policy.sh

Linting

# Helm template validation
helm template test charts/spark-4.1 -f charts/spark-4.1/values-scenario-*.yaml --dry-run

Contributing

See CONTRIBUTING.md for development guidelines.

License

MIT License - see LICENSE file for details.

Acknowledgments

Apache Spark community
Kubernetes upstream
Helm charts maintainers

Links:

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.beads		.beads
.github		.github
.oneshot		.oneshot
.sdp @ 1205f2a		.sdp @ 1205f2a
charts		charts
config		config
configs/codestyle		configs/codestyle
dags		dags
docker		docker
docs		docs
examples		examples
hooks		hooks
k8s		k8s
notebooks/recipes		notebooks/recipes
notifications		notifications
policies		policies
prompts		prompts
schema		schema
scripts		scripts
templates		templates
tests		tests
.coverage		.coverage
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruffignore		.ruffignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_PATTERNS.md		CODE_PATTERNS.md
LICENSE		LICENSE
MODELS.md		MODELS.md
PRODUCT_VISION.md		PRODUCT_VISION.md
PROJECT_CONVENTIONS.md		PROJECT_CONVENTIONS.md
PROTOCOL.md		PROTOCOL.md
README.md		README.md
ROADMAP.md		ROADMAP.md
RULES_COMMON.md		RULES_COMMON.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
start.sh		start.sh
test-spark-35-connect-minimal.yaml		test-spark-35-connect-minimal.yaml
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark K8s Constructor: Apache Spark on Kubernetes

Quick Links

Testing Status

Charts

Spark 3.5 (Modular Charts)

Spark 4.1 (Unified Chart)

Component Versions

Preset Catalog

Components

Recipes

Quick Recipe Index

What's New in v0.1.0

Features

Fixes

Known Issues

Documentation Structure

Quick Start

1. Install Spark Connect + Jupyter (Spark 4.1)

2. Connect to Spark

3. Run E2E Test

Backend Modes

Validation

Preset Validation

Policy Validation

Linting

Contributing

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spark K8s Constructor: Apache Spark on Kubernetes

Quick Links

Testing Status

Charts

Spark 3.5 (Modular Charts)

Spark 4.1 (Unified Chart)

Component Versions

Preset Catalog

Components

Recipes

Quick Recipe Index

What's New in v0.1.0

Features

Fixes

Known Issues

Documentation Structure

Quick Start

1. Install Spark Connect + Jupyter (Spark 4.1)

2. Connect to Spark

3. Run E2E Test

Backend Modes

Validation

Preset Validation

Policy Validation

Linting

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages