A configuration-driven framework for building Spark pipelines with HOCON config files and PureConfig.
Python/PySpark users may also be interested in pyspark-pipeline-framework, the Python implementation of this framework using dataconf and HOCON. You can find it on GitHub and PyPI.
- Type-safe configuration via PureConfig with automatic case class binding
- Dynamic component instantiation via reflection (no compile-time coupling)
- Lifecycle hooks for monitoring, metrics, and custom error handling
- Built-in hooks for structured logging, Micrometer metrics, and audit trails
- Configuration validation for CI/CD pre-flight checks without Spark
- Secrets management with pluggable providers (env, AWS, Vault)
- Streaming support for Spark Structured Streaming pipelines
- Cross-compilation for Spark 3.x/4.x and Scala 2.12/2.13
| Module | Description |
|---|---|
spark-pipeline-core |
Traits, config models, instantiation (no Spark dependency) |
spark-pipeline-runtime |
SparkSessionWrapper, DataFlow trait |
spark-pipeline-runner |
SimplePipelineRunner entry point |
// build.sbt
libraryDependencies += "io.github.dwsmith1983" %% "spark-pipeline-runtime-spark3" % "<version>"import io.github.dwsmith1983.spark.pipeline.config.ConfigurableInstance
import io.github.dwsmith1983.spark.pipeline.runtime.DataFlow
import pureconfig._
import pureconfig.generic.auto._
object MyComponent extends ConfigurableInstance {
case class Config(inputTable: String, outputPath: String)
override def createFromConfig(conf: com.typesafe.config.Config): MyComponent =
new MyComponent(ConfigSource.fromConfig(conf).loadOrThrow[Config])
}
class MyComponent(conf: MyComponent.Config) extends DataFlow {
override def run(): Unit = {
spark.table(conf.inputTable).write.parquet(conf.outputPath)
}
}# pipeline.conf
spark {
app-name = "My Pipeline"
}
pipeline {
pipeline-name = "My Data Pipeline"
pipeline-components = [
{
instance-type = "com.mycompany.MyComponent"
instance-name = "MyComponent(prod)"
instance-config {
input-table = "raw_data"
output-path = "/data/processed"
}
}
]
}spark-submit \
--class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
--jars /path/to/my-pipeline.jar \
/path/to/spark-pipeline-runner-spark3_2.12.jar \
-Dconfig.file=/path/to/pipeline.conf- Getting Started - Quick start guide
- Configuration - HOCON configuration reference
- Config Validation - CI/CD validation
- Secrets Management - Secure credential handling
- Lifecycle Hooks - Logging, metrics, audit trails
- Streaming - Structured Streaming support
- Deployment - Production deployment guides
- Contributing - Development setup
Apache 2.0