Skip to content

dwsmith1983/spark-pipeline-framework

spark-pipeline-framework

CI Maven Central License Scala Versions Spark Versions

A configuration-driven framework for building Spark pipelines with HOCON config files and PureConfig.

Python/PySpark users may also be interested in pyspark-pipeline-framework, the Python implementation of this framework using dataconf and HOCON. You can find it on GitHub and PyPI.

Features

  • Type-safe configuration via PureConfig with automatic case class binding
  • Dynamic component instantiation via reflection (no compile-time coupling)
  • Lifecycle hooks for monitoring, metrics, and custom error handling
  • Built-in hooks for structured logging, Micrometer metrics, and audit trails
  • Configuration validation for CI/CD pre-flight checks without Spark
  • Secrets management with pluggable providers (env, AWS, Vault)
  • Streaming support for Spark Structured Streaming pipelines
  • Cross-compilation for Spark 3.x/4.x and Scala 2.12/2.13

📚 Full Documentation

Modules

Module Description
spark-pipeline-core Traits, config models, instantiation (no Spark dependency)
spark-pipeline-runtime SparkSessionWrapper, DataFlow trait
spark-pipeline-runner SimplePipelineRunner entry point

Quick Start

1. Add dependency

// build.sbt
libraryDependencies += "io.github.dwsmith1983" %% "spark-pipeline-runtime-spark3" % "<version>"

2. Create a component

import io.github.dwsmith1983.spark.pipeline.config.ConfigurableInstance
import io.github.dwsmith1983.spark.pipeline.runtime.DataFlow
import pureconfig._
import pureconfig.generic.auto._

object MyComponent extends ConfigurableInstance {
  case class Config(inputTable: String, outputPath: String)

  override def createFromConfig(conf: com.typesafe.config.Config): MyComponent =
    new MyComponent(ConfigSource.fromConfig(conf).loadOrThrow[Config])
}

class MyComponent(conf: MyComponent.Config) extends DataFlow {
  override def run(): Unit = {
    spark.table(conf.inputTable).write.parquet(conf.outputPath)
  }
}

3. Create config file

# pipeline.conf
spark {
  app-name = "My Pipeline"
}

pipeline {
  pipeline-name = "My Data Pipeline"
  pipeline-components = [
    {
      instance-type = "com.mycompany.MyComponent"
      instance-name = "MyComponent(prod)"
      instance-config {
        input-table = "raw_data"
        output-path = "/data/processed"
      }
    }
  ]
}

4. Run

spark-submit \
  --class io.github.dwsmith1983.spark.pipeline.runner.SimplePipelineRunner \
  --jars /path/to/my-pipeline.jar \
  /path/to/spark-pipeline-runner-spark3_2.12.jar \
  -Dconfig.file=/path/to/pipeline.conf

Documentation

License

Apache 2.0

About

A configuration-driven framework for building Spark pipelines with HOCON config files and PureConfig.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages