This project orchestrates Spark jobs written in Python, Scala, and Java using Apache Airflow, all within a Dockerized environment. The DAG sparking_flow
submits Spark jobs in multiple languages to a local Spark cluster, enabling robust, scalable, and repeatable data workflows.
Tool | Version |
---|---|
Python | 3.12.x |
Java | 21 (LTS) |
Scala | 3.4.x |
Apache Spark | 3.5.x |
Apache Airflow | 2.9.x |
SBT | 1.10.x |
Docker | Latest |
Docker Compose | Latest |
.
├── dags/
│ └── spark_airflow.py # Airflow DAG
├── img/
│ └── Capture.PNG # Airflow Graph
├── jobs/
│ ├── python/
│ │ └── wordcountjob.py # Python Spark job
│ ├── scala/
│ │ ├── build.sbt
│ │ └── wordcountjob.scala # Scala Spark job
│ └── java/
│ └── spark-job/ # Java Spark job
│ └── src/...
├── airflow.env
├── docker-compose.yml
├── Dockerfile
└── .dockerignore
mkdir dags jobs
touch airflow.env docker-compose.yml Dockerfile .dockerignore
mkdir -p jobs/python jobs/scala jobs/java
touch dags/spark_airflow.py
touch jobs/python/wordcountjob.py
touch jobs/scala/{build.sbt,wordcountjob.scala}
Optional: Create the Java job under
jobs/java/spark-job/
following Maven or Gradle structure.
docker compose up -d --build
To gracefully stop all running services:
docker compose down
To rebuild and restart the environment:
docker compose up -d --build
-
Airflow Web UI: http://localhost:8080
Authenticate using your configured Airflow credentials (configured in the Airflow environment).
-
Spark Master UI: http://localhost:9090
A basic Spark setup includes 1 Master and 1 Worker.
To scale up, duplicate the Spark Worker section in docker-compose.yml
, assigning each container a unique name and hostname.
spark-worker-2:
image: bitnami/spark:latest
container_name: spark-worker-2
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
depends_on:
- spark-master
environment:
SPARK_MODE: worker
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_MASTER_URL: spark://spark-master:7077
Inside your development environment or Docker container:
pip install apache-airflow apache-airflow-providers-apache-spark pyspark
Create the DAG file at dags/spark_airflow.py
, with tasks that trigger:
- A
PythonOperator
to mark the start - A
SparkSubmitOperator
for:- Python job
- Scala job
- Java job
- A
PythonOperator
to mark the end
jobs/python/wordcountjob.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCountPython").getOrCreate()
data = spark.read.text("path/to/input.txt")
words = data.selectExpr("explode(split(value, ' ')) as word")
word_counts = words.groupBy("word").count()
word_counts.show()
spark.stop()
Install Scala and SBT:
brew install scala sbt
jobs/scala/wordcountjob.scala
import org.apache.spark.sql.SparkSession
object WordCountScala {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("WordCountScala").getOrCreate()
val data = spark.read.textFile("path/to/input.txt")
val words = data.flatMap(_.split(" "))
val wordCounts = words.groupByKey(identity).count()
wordCounts.show()
spark.stop()
}
}
Compile and package:
cd jobs/scala
sbt compile
sbt package
Follow Maven/Gradle layout. Sample logic should mirror the Scala/Python version.
mkdir -p jobs/java/spark-job/src/main/java/com/example
# Add Java class and pom.xml/build.gradle accordingly
Build the Java JAR:
cd jobs/java/spark-job
mvn clean package
After successfully deploying the containers:
- Go to Airflow UI:
localhost:8080
- Activate and manually trigger the DAG:
sparking_flow
- You must configure the Spark connection in Airflow UI:
- Go to Admin > Connections > Add Connection
- Conn Type:
Spark
- Host:
spark://spark-master
- Port:
7077
docker compose down -v --remove-orphans
We welcome contributions via pull requests. Please follow best practices and provide detailed descriptions when contributing.