An innovative approach to implementing Spark Jobs with Spring Boot ecosystem, enabling developer-friendly environment. It integrates cross-cutting concerns as reusable libraries to minimize boilerplate code. Moreover, the framework supports one-click deployment of Spark jobs with RESTful APIs, making it a breeze to run jobs locally, on Minikube or Kubernetes.
Apache Spark is a distributed computing framework designed for fast and efficient large-scale data processing.
Its architecture enables in-memory computing, making it significantly faster than traditional disk-based systems like Hadoop MapReduce.
Here’s a brief breakdown of Spark’s architecture:
- Core Components
- Driver: The driver is the central control unit that coordinates the execution of tasks. It:
- Converts user-defined code into tasks.
- Distributes these tasks across executors.
- Tracks the progress and status of the tasks.
- Executors: These are worker processes running on cluster nodes. They:
- Execute the tasks assigned by the driver.
- Store data partitions in memory or disk, enabling iterative computations.
- Cluster Manager: Spark relies on a cluster manager (e.g., YARN, Mesos and Kubernetes etc.) to manage resources and schedule jobs across the cluster.
-
Resilient Distributed Dataset (RDD)
At the core of Spark’s architecture is the RDD, a fault-tolerant and immutable distributed collection of objects.
RDDs allow parallel operations and automatic recovery from node failures.
The Latest Spark versions have introduced and recommend DataFrames and Datasets, which are back by RDD. -
Key Abstractions
- Transformations: Operations like map() or filter() create a new RDD from an existing one. Transformations are lazy and executed only when an action is called.
- Actions: Operations like reduce() or collect() trigger the computation of RDDs and return a result.
- Execution Workflow
- Spark splits the execution of a job into stages.
- Within each stage, tasks are executed in parallel across the cluster.
- The Directed Acyclic Graph (DAG) Scheduler optimizes task execution by constructing a dependency graph of RDD transformations.
Apache Spark Distributed Architecture from Spark in Action.
Spring Boot has become de-facto standard for Java application developement. It offers a robust framework for building scalable, enterprise-grade applications. It could be tedious to build Spark Jobs using Spring boot and deployment on local and Kubernetes. This framework aims to simplify this process by providing a unified solution for building Spark jobs with Spring Boot, running locally and deployment on Minikube or Kubernetes.
- Java 17, Spark does not support java versions above 17 yet. See Spark Java compatibility.
- Maven, Make sure environment variable
M2_REPO
is set to local maven repository<your user home>/.m2/repository
. - Scala 2.12.18
- Spark 3.5.3, Make sure environment variable
SPARK_HOME
is set to local spark installation. - Docker, Make sure Docker is allocated with enough resources.
- Minikube, Make sure Docker is allocated with enough resources.
- IDE (IntelliJ, Eclipse or VS Code), Recommended IntelliJ IDEA.
- Optional Configure Formatter in intelliJ, refer to fmt-maven-plugin for details.
Recommended sdkman for managing Java, Scala installations.
Make sure JAVA_HOME
set to Java 17 installation path and PATH
variable contains entry for $JAVA_HOME/bin
Check Java version as follows. It should look like following, showing major version 17.
% java -version
openjdk version "17.0.12" 2024-07-16
OpenJDK Runtime Environment Temurin-17.0.12+7 (build 17.0.12+7)
OpenJDK 64-Bit Server VM Temurin-17.0.12+7 (build 17.0.12+7, mixed mode)
Check Scala version as follows. It should look like following, showing scala version 2.12.18.
% scala -version
Scala code runner version 2.12.18 -- Copyright 2002-2023, LAMP/EPFL and Lightbend, Inc.
Download and extract spark-3.5.3-bin-hadoop3 on your machine and Set the following environment variables.
export SPARK_HOME="/<your directory>/spark-3.5.3-bin-hadoop3"
export SPARK_CONF_DIR=$SPARK_HOME/conf
export PATH="$SPARK_HOME/bin:$PATH"
The demo jobs and spark-job-service
need following services up and running.
- Make sure Postgres is running at
localhost:5432
with usernamepostgres
and passwordadmin
.
Create databasesspark_jobs_db
anderror_logs_db
if they do not exist. - Make sure MongoDB running at
localhost:27017
. - Make sure ArangoDB running at
localhost:8529
withroot
password asadmin
. - Make sure Kafka running with bootstrap servers
localhost:9092
. - Make sure Kafka UI running at
http://localhost:8100
. Create topicsjob-stop-requests
anderror-logs
if they do not exist.
Important
It is recommended to have port numbers same as mentioned above, otherwise you may need to change at multiple places i.e. in job's application-local.yml
, spark-job-service
application ymls and deployment yml etc.
All these services can be installed locally on your machine, and should be accessible at above-mentioned urls and credentials (wherever applicable).
- The docker-compose.yml file defines the services and configurations to run required infrastructure in Docker.
- In Terminal go to project root
spring-boot-spark-kubernetes
and execute following command and confirm if all services are running.
docker compose up -d
- Create databases
spark_jobs_db
anderror_logs_db
in Postgres and Kafka topicsjob-stop-requests
anderror-logs
if they do not exist.
Important
While using docker compose make sure the required ports are free on your machine, otherwise port busy error could be thrown.
- In Terminal go to project root
spring-boot-spark-kubernetes
and execute following commands to create a namespaceksoot
and necessary Kubernetes services in given namespace. Refer to Kubernetes configuration files section for more details.
kubectl apply -f infra-k8s-deployment.yml
kubectl apply -f spark-rbac.yml
- Set default namespace to
ksoot
in minikube. You can always rollback to default namespace.
kubectl config set-context --current --namespace=ksoot
- Check if all infra pods are running.
kubectl get pods
Output should look like below.
NAME READY STATUS RESTARTS AGE
arango-65d6fff6c5-4bjwq 1/1 Running 0 6m16s
kafka-74c8d9579f-jmcr5 1/1 Running 0 6m16s
kafka-ui-797446869-9d8zw 1/1 Running 0 6m16s
mongo-6785c5cf8b-mtbk7 1/1 Running 0 6m16s
postgres-685b766f66-7dnsl 1/1 Running 0 6m16s
zookeeper-6fc87d48df-2t5pf 1/1 Running 0 6m16s
- Establish minikube tunnel to expose services of type
LoadBalancer
running in Minikube cluster to local machine. It creates a bridge between your local network and the Minikube cluster, making the required infrastructure accessible to local.
minikube tunnel
Keep it running in a separate terminal. Output should look like below.
✅ Tunnel successfully started
📌 NOTE: Please do not close this terminal as this process must stay alive for the tunnel to be accessible ...
🏃 Starting tunnel for service arango.
🏃 Starting tunnel for service kafka.
🏃 Starting tunnel for service kafka-ui.
🏃 Starting tunnel for service mongo.
🏃 Starting tunnel for service postgres.
🏃 Starting tunnel for service zookeeper.
- No need to create any databases or kafka topics required by applications as they are automatically created by infra-k8s-deployment.yml.
- Job Launching: Trigger Spark jobs via REST endpoint for deployment on local and kubernetes. Either Job jars or Docker images can be launched using spark-submit
- Job Termination: Accept requests to stop running jobs via REST endpoint, though not a gauranteed method. You may need to kill the job manually if not terminated by this.
- Job Monitoring: Track job status, start and end time, duration taken, error messages if there is any, via REST endpoints.
- Auto-configurations: of Common components such as
SparkSession
, Job lifecycle listener and Connectors to read and write to various datasources. - Demo Jobs: A Spark Batch Job and another Spark Streaming Job, to start with.
The framework consists of following components. Refer to respective project's README for details.
- spark-job-service: A Spring Boot application to launch Spark jobs and monitor their status.
- spring-boot-starter-spark: Spring boot starter for Spark.
- spark-job-commons: A library to provide common Job components and utilities for Spark jobs.
- spark-batch-sales-report-job: A demo Spark Batch Job to generate Monthly sales reports.
- spark-stream-logs-analysis-job: A demo Spark Streaming Job to analyze logs in real-time.
The framework includes Kubernetes configuration files to deploy the required infrastructure and services in a Kubernetes cluster in namespace ksoot
. You can change the namespace in these two files as per your requirement.
Each service is configured with necessary environment variables, volume mounts, and ports to ensure proper operation within the Kubernetes cluster.
- The infra-k8s-deployment.yml file defines the Kubernetes resources required to deploy various services.
- Namespace: Creates a namespace
ksoot
. - MongoDB: Deployment, PersistentVolumeClaim, and Service for MongoDB.
- ArangoDB: Deployment, PersistentVolumeClaim, and Service for ArangoDB.
- PostgreSQL: Deployment, PersistentVolumeClaim, ConfigMap (for initialization script), and Service for PostgreSQL.
- Zookeeper: Deployment, PersistentVolumeClaims (for data and logs), and Service for Zookeeper.
- Kafka: Deployment, PersistentVolumeClaim, and Service for Kafka.
- Kafka UI: Deployment and Service for Kafka UI.
- The spark-rbac.yml file defines the Kubernetes
RBAC
(Role-Based Access Control) resources required to allow Spark to manage Driver and Executor pods within theksoot
namespace.
- ClusterRoleBinding: Binds the default
ServiceAccount
to thecluster-admin
ClusterRole
, allowing it to have cluster-wide administrative privileges. - ServiceAccount: Creates a ServiceAccount named
spark
. - ClusterRoleBinding: Binds the spark ServiceAccount to the
edit
ClusterRole
, granting it permissions to edit resources within the namespace.
- Individual Spark Jobs can be run as Spring boot application locally in your favorite IDE. Refer to sales-report-job and logs-analysis-job.
- Spark Job can be Launched locally via REST API provided by
spark-job-service
. Refer to spark-job-service for details.
Important
Spark Job's jar
files from Maven repository are deployed on local via spark-submit
command.
- Make sure minikube infra is ready as mentioned in Minikube Environment setup section.
- Build custom base Docker image for Spark for more control over it, Refer to base Dockerfile for details. Spark contains a lot of jars at
${SPARK_HOME/jars}
, some of which may conflict with your application jars. So you may need to exclude such jars from Spark.
For example following conflicting jars are excluded from Spark in base Dockerfile.
# Remove any spark jars that may be conflict with the ones in your application dependencies.
rm -f jars/protobuf-java-2.5.0.jar; \
rm -f jars/guava-14.0.1.jar; \
rm -f jars/HikariCP-2.5.1.jar; \
- In Terminal go to root project
spring-boot-spark-kubernetes
and execute the following command to build Spark base Docker image. All Job's Dockerfiles extend from this image.
docker image build . -t ksoot/spark:3.5.3 -f Dockerfile
- In Terminal go to project
spring-boot-spark-kubernetes/spark-batch-sales-report-job
and execute following command to build docker image forsales-report-job
.
docker image build . -t spark-batch-sales-report-job:0.0.1 -f Dockerfile
- In Terminal go to project
spring-boot-spark-kubernetes/spark-stream-logs-analysis-job
and execute following command to build docker image forlogs-analysis-job
.
docker image build . -t spark-stream-logs-analysis-job:0.0.1 -f Dockerfile
- Load Job images in minikube.
minikube image load spark-batch-sales-report-job:0.0.1
minikube image load spark-stream-logs-analysis-job:0.0.1
- In Terminal go to project
spring-boot-spark-kubernetes/spark-job-service
and execute following command to build docker image forspark-job-service
.
docker image build . -t spark-job-service:0.0.1 -f Dockerfile
- Load Job
spark-job-service
image in minikube.
minikube image load spark-job-service:0.0.1
- Make sure Environment setup on Minikube is already done and application artifacts are ready.
Important
No configuration change is required except specifically asked to run this code locally.
- You can override any configurations that are defined in spark-job-service application.yml and Spark Jobs using environment variables in spark-job-service deployment.yml as follows.
env:
# Spark properties
- name: SPARK_MASTER
value: k8s://https://kubernetes.default.svc
- name: SPARK_KUBERNETES_NAMESPACE
value: ksoot
- name: SPARK_KUBERNETES_AUTHENTICATE_DRIVER_SERVICE_ACCOUNT_NAME
value: spark
- name: SPARK_KUBERNETES_DRIVER_ENV_SPARK_USER
value: spark
- name: SPARK_SUBMIT_DEPLOY_MODE
value: cluster
# Application properties
- name: KAFKA_BOOTSTRAP_SERVERS
value: kafka:9092
- name: POSTGRES_URL
value: jdbc:postgresql://postgres:5432/spark_jobs_db
- name: JDBC_URL
value: jdbc:postgresql://postgres:5432
- name: PERSIST_JOBS
value: "true"
- name: CAPTURE_JOBS_LOGS
value: "true"
- You can override any configurations that are not defined in spark-job-service application.yml as follows.
args:
- "--spark.executor.instances=2"
- "--spark.default.parallelism=16"
- "--spark-launcher.env.POSTGRES_URL=jdbc:postgresql://postgres:5432/spark_jobs_db"
- "--spark-launcher.env.KAFKA_BOOTSTRAP_SERVERS=kafka:9092"
- "--spark-launcher.jobs.sales-report-job.env.MONGODB_URL=mongodb://mongo:27017"
- "--spark-launcher.jobs.sales-report-job.env.ARANGODB_URL=arango:8529"
- "--spark-launcher.jobs.logs-analysis-job.env.JDBC_URL=jdbc:postgresql://postgres:5432"
Important
In Production, it is recommended to use Kubernetes Secrets for sensitive information like passwords, tokens, and keys etc.
- Execute following command to deploy
spark-job-service
on minikube.
kubectl apply -f deployment.yml
- Verify that
spark-job-service
pod is running
kubectl get pods
Output should look like below.
NAME READY STATUS RESTARTS AGE
spark-job-service-f545bd7d8-s4sn5 1/1 Running 0 9s
- Port forward
spark-job-service
server port in a separate terminal, to access it from local. Replace following POD name with your POD name.
kubectl port-forward spark-job-service-f545bd7d8-s4sn5 8090:8090
Output should look like below.
Forwarding from 127.0.0.1:8090 -> 8090
Forwarding from [::1]:8090 -> 8090
- Now Minikube is ready for Spark Jobs deployment, make API calls from Swagger or Postman to start and explore jobs.
- If
spark-submit
command is executed successfully while Launching a Job, then you should be able to see the Spark Driver and Executor pods running for respective Job in minikube.
kubectl get pods
Output should look like below.
NAME READY STATUS RESTARTS AGE
sales-report-job-2e9c6f93ef784c17-driver 1/1 Running 0 11s
sales-report-job-9ac2e493ef78625a-exec-1 1/1 Running 0 6s
sales-report-job-9ac2e493ef78625a-exec-2 1/1 Running 0 6s
- Once the Job is complete, executor pods are terminated automatically. Though driver pod remains in
Completed
state, but it does not consume any resources.
NAME READY STATUS RESTARTS AGE
sales-report-job-2e9c6f93ef784c17-driver 0/1 Completed 0 2m56s
- If the Job fails, Executor pods are still terminated, but driver pod remains in
Error
state. For debugging, you can see pod logs. - Eventually you may want to clean up by deleting the pods or
minikube delete
.
Important
All applications run in default
profile on minikube.
Spark Job's Docker images are deployed on Kubernetes via spark-submit
command.
There are two deployment modes for Spark Job deployment on Kubernetes.
- Client Deploy Mode: The driver runs in the client’s JVM process and communicates with the executors managed by the cluster.
- Cluster Deploy Mode: The driver process runs as a separate JVM process in a cluster, and the cluster manages its resources.
Configurations can be provided at multiple levels. At individual project level, the precedence order is Standard Spring Boot configurations precedence order.
- In
application.yml
s of individual Jobs projects and profile specificyml
s. - In
application.yml
s ofspark-job-service
. - As environment in
deployment.yml
ofspark-job-service
. - As arguments in
deployment.yml
ofspark-job-service
.
Configurations are resolved in the following order.
Access Spark UI at http://localhost:4040
to monitor and inspect Spark Batch job execution.
On Minikube or Kubernetes you may need to do port forwarding to access it, and it may not be accessible if Job is not in running state at the moment.
Important
In case port 4040 is busy Spark UI would be started on another port, and this new port would be logged into application logs, so you can check logs to get the correct port.
24/12/26 01:07:11 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image. 24/12/26 01:07:12 ERROR Client: Please check "kubectl auth can-i create pod" first. It should be yes.
Possible cause: Wrongspark.master
value orspark-rbac.yml
is not applied properly in correct namespace.
Solution: Set correctspark.master
value inapplication.yml
anddeployment.yml
inspark-job-service
or applyspark-rbac.yml
properly.Factory method 'sparkSession' threw exception with message: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x2049a9c1) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x2049a9c1
.
Possible cause: The error message indicates that your Spark application is trying to access an internal Java class (sun.nio.ch.DirectBuffer) in the java.base module, which is not exported to Spark’s unnamed module. This issue arises because Java modules introduced in JDK 9 restrict access to internal APIs.
Solution: Add VM option--add-exports java.base/sun.nio.ch=ALL-UNNAMED
Open source The MIT License
Rajveer Singh, In case you find any issues or need any support, please email me at [email protected]. Give it a ⭐ on Github and a 👏 on medium.com if you find it helpful.