Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions docs/docs/examples/mini-examples/partitions-vs-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Partitions vs config
description: Comparing Dagster's partitions with run configuration for parameterizing pipelines.
last_update:
author: Dennis Hume
sidebar_custom_props:
logo: images/dagster-primary-mark.svg
---

In this example, we'll explore two different approaches to parameterize Dagster pipelines. When you need to process data for different segments (like customers, regions, or dates), you can choose between Dagster's [partitions](/guides/build/partitions-and-backfills/partitioning-assets) or [run configuration](/guides/operate/configuration/run-configuration). Each approach has distinct trade-offs in terms of tracking, observability, and workflow.

## Problem: Processing data for multiple customers

Imagine you need to process data for multiple customers, where each customer's data should be processed independently. You want to be able to run the pipeline for specific customers and potentially reprocess historical data when needed.

The key question is: Should you use partitions to create a segment for each customer, or should you use config to pass the customer ID as a parameter?

### Solution 1: Using partitions

[Partitions](/guides/build/partitions-and-backfills/partitioning-assets) divide your data into discrete segments. Each customer becomes a partition, giving you full visibility into which customers have been processed and the ability to backfill specific customers.

<CodeExample
path="docs_projects/project_mini/src/project_mini/defs/partitions_vs_config/with_partitions.py"
language="python"
title="src/project_mini/defs/partitions_vs_config/with_partitions.py"
/>

| | **Partitions approach** |
| ---------------------------- | ---------------------------------------------------- |
| **Materialization tracking** | Per-customer history visible in UI |
| **Backfilling** | Built-in support for reprocessing specific customers |
| **Scheduling** | Native support for processing all partitions |
| **UI experience** | Partition status bar shows processing state |
| **Setup complexity** | Requires defining partition set upfront |

### Solution 2: Using config

[Run configuration](/guides/operate/configuration/run-configuration) passes the customer ID as a parameter at runtime. This approach is simpler to set up but doesn't track which customers have been processed.

<CodeExample
path="docs_projects/project_mini/src/project_mini/defs/partitions_vs_config/with_config.py"
language="python"
title="src/project_mini/defs/partitions_vs_config/with_config.py"
/>

| | **Config approach** |
| ---------------------------- | --------------------------------------------- |
| **Materialization tracking** | Single asset history (not per-customer) |
| **Backfilling** | Manual re-runs required |
| **Scheduling** | Requires custom logic to iterate customers |
| **UI experience** | Specify customer in Launchpad before each run |
| **Setup complexity** | Simple config class, no partition management |

## When to use each approach

The choice between partitions and config depends on your specific requirements:

**Use partitions when:**

- Your data naturally segments into discrete categories
- You need to track materialization status per segment
- Backfilling specific segments is a common operation
- You want to schedule processing for all segments automatically
- You need visibility into which segments are up-to-date vs stale

**Use config when:**

- Processing is infrequent or ad-hoc
- Parameters are dynamic or come from an unbounded set
- You don't need per-parameter tracking
- A single materialization history is sufficient
- You want simple parameterization without partition overhead

## Hybrid approach

You can also combine both approaches: use partitions for the primary segmentation (e.g., by customer) and config for additional runtime parameters (e.g., processing options).

<CodeExample
path="docs_projects/project_mini/src/project_mini/defs/partitions_vs_config/with_partitions_and_config.py"
language="python"
title="src/project_mini/defs/partitions_vs_config/with_partitions_and_config.py"
/>

This gives you the benefits of partition tracking while maintaining flexibility for runtime parameters.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import dagster as dg


class CustomerConfig(dg.Config):
customer_id: str


@dg.asset
def customer_orders(context: dg.AssetExecutionContext, config: CustomerConfig):
"""Fetch and process orders for a specific customer."""
customer_id = config.customer_id
context.log.info(f"Fetching orders for customer: {customer_id}")

orders = [
{"order_id": f"{customer_id}-001", "amount": 150.00},
{"order_id": f"{customer_id}-002", "amount": 275.50},
{"order_id": f"{customer_id}-003", "amount": 89.99},
]

context.log.info(f"Processed {len(orders)} orders for {customer_id}")
return orders


@dg.asset(deps=[customer_orders])
def customer_summary(context: dg.AssetExecutionContext, config: CustomerConfig):
"""Generate a summary report for a specific customer."""
customer_id = config.customer_id
context.log.info(f"Generating summary for customer: {customer_id}")

summary = {
"customer_id": customer_id,
"total_orders": 3,
"total_revenue": 515.49,
}

context.log.info(f"Summary for {customer_id}: {summary}")
return summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import dagster as dg

customer_partitions = dg.StaticPartitionsDefinition(["customer_a", "customer_b", "customer_c"])


@dg.asset(partitions_def=customer_partitions)
def customer_orders(context: dg.AssetExecutionContext):
"""Fetch and process orders for a specific customer partition."""
customer_id = context.partition_key
context.log.info(f"Fetching orders for customer: {customer_id}")

orders = [
{"order_id": f"{customer_id}-001", "amount": 150.00},
{"order_id": f"{customer_id}-002", "amount": 275.50},
{"order_id": f"{customer_id}-003", "amount": 89.99},
]

context.log.info(f"Processed {len(orders)} orders for {customer_id}")
return orders


@dg.asset(partitions_def=customer_partitions, deps=[customer_orders])
def customer_summary(context: dg.AssetExecutionContext):
"""Generate a summary report for a specific customer partition."""
customer_id = context.partition_key
context.log.info(f"Generating summary for customer: {customer_id}")

summary = {
"customer_id": customer_id,
"total_orders": 3,
"total_revenue": 515.49,
}

context.log.info(f"Summary for {customer_id}: {summary}")
return summary


@dg.schedule(
cron_schedule="0 1 * * *",
job=dg.define_asset_job(
"all_customers_job",
selection=[customer_orders, customer_summary],
partitions_def=customer_partitions,
),
)
def daily_customer_schedule():
"""Trigger processing for all customer partitions."""
for partition_key in customer_partitions.get_partition_keys():
yield dg.RunRequest(partition_key=partition_key)
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import dagster as dg

customer_partitions = dg.StaticPartitionsDefinition(["customer_a", "customer_b", "customer_c"])


class ProcessingConfig(dg.Config):
include_archived: bool = False
limit: int = 1000


@dg.asset(partitions_def=customer_partitions)
def customer_data(context: dg.AssetExecutionContext, config: ProcessingConfig):
customer_id = context.partition_key
context.log.info(f"Processing {customer_id} with include_archived={config.include_archived}")