Orchestration is the control plane of data engineering: scheduling, dependencies, retries, SLAs, environment promotion, and operational governance.
Without orchestration, data jobs become ad-hoc scripts with no reliability model.
Key orchestration responsibilities:
- Workflow scheduling (time/event-driven)
- Dependency management (upstream/downstream/task graph)
- Retry/backoff and timeout policies
- Alerting and incident hooks
- Parameterization (date ranges, environments)
- Lineage + auditability of runs
Scheduler/Event Trigger
→ DAG/State Machine
→ Ingestion Tasks
→ Validation Tasks
→ Transform Tasks
→ Publish Tasks
→ Notification + Metrics
ADF Trigger (schedule/event)
→ Metadata-driven ADF Pipeline
→ Databricks Notebook Activities
→ Synapse Stored Procedures
→ Azure Monitor + Logic Apps alerts
EventBridge/Cron
→ Step Functions state machine
→ Glue job / EMR step / Lambda task
→ Redshift publish step
→ CloudWatch + SNS + PagerDuty
Cloud Composer (Airflow DAG)
→ Dataflow/Dataproc tasks
→ BigQuery transformations
→ Cloud Monitoring alerting
Best for enterprise ETL with metadata-driven pipelines and managed connectors.
Pattern:
- Lookup source metadata table.
- ForEach loops over entities.
- Copy, transform, and load with dependency chains.
- Retry on transient errors.
- Webhook/Logic App notifications.
Production practices:
- global parameters per environment
- managed identity for secretless auth
- trigger windows aligned to upstream readiness
Best for Python-first orchestration and custom workflows.
Recommended DAG settings:
retries,retry_delay,execution_timeoutmax_active_runs,depends_on_past=Falseunless needed- task-level SLA and pools for resource control
Use cases:
- cross-cloud orchestration
- mixed Spark + SQL + API workloads
- complex branching and conditional logic
Best for serverless orchestration and explicit state transitions.
Strengths:
- visual state machine and native retry/catch
- strong integration with Lambda, Glue, ECS, EMR
- deterministic control flow and failure paths
Retry example concept:
{
"Retry": [
{
"ErrorEquals": ["States.Timeout", "States.TaskFailed"],
"IntervalSeconds": 10,
"BackoffRate": 2.0,
"MaxAttempts": 4
}
]
}Managed Airflow with deep GCP integration.
Typical pattern:
- Composer DAG triggers Dataflow templates
- waits for completion
- runs BigQuery SQL transformations
- updates data quality status table
- sends alerts via Cloud Monitoring
Orchestration-layer controls:
- task retries with exponential backoff
- timeout + circuit breaker pattern
- checkpoint/marker tables for re-entry
- skip/recover downstream only after upstream consistency checks
- failure domains (one table failing should not kill all critical tables)
- auto-ticket creation for persistent failures
DAG/state machine anti-fragility:
- make tasks idempotent
- isolate side effects
- explicit compensation tasks for partial writes
- ADF activity run history
- pipeline-level SLA dashboards
- alert rules for failure spikes and delay
- Step Functions state transition failures
- Glue/EMR duration anomalies
- centralized logs via CloudWatch log groups
- Composer DAG success/failure trends
- Dataflow/BigQuery job durations and error ratios
Operational KPIs:
- success rate by DAG
- median and P95 runtime
- retries per run
- mean time to recovery (MTTR)
00:30 - Trigger ingest DAG
00:40 - Validate raw completeness
01:00 - Transform orders/customers/inventory in parallel
02:00 - Publish marts
02:10 - Run reconciliation checks
02:20 - Notify BI readiness
Key properties:
- dependency graph with parallel branches
- hard stop on reconciliation failure
- partial rerun capability by domain (
orders,inventory)
- Batch settlement DAG hourly
- Streaming jobs health checks every 5 min
- On streaming lag threshold breach, fallback aggregate pipeline invoked
- Monolithic DAG with 200+ tightly coupled tasks.
- No environment promotion strategy (dev/stage/prod).
- Triggering pipelines on schedule only, ignoring upstream freshness.
- Hardcoding secrets/paths per task.
- Missing failure classification (transient vs permanent).
- No runbook for rerun/backfill.
- Unlimited retries causing resource exhaustion.
- Parallelize independent branches aggressively.
- Use task pools/queues to prevent noisy neighbor effects.
- Avoid orchestration engines doing heavy compute.
- Keep tasks short and composable.
- Use metadata-driven loops instead of copy-paste DAGs.
- Cap concurrent runs to protect downstream systems.
- Profile critical path and eliminate serial bottlenecks.
Tasks should be rerunnable without changing final state incorrectly. Use checkpoints, MERGE semantics, and deterministic outputs.
Use freshness sensors + delayed dependencies + correction DAGs. Keep mutable windows for recent partitions and run reconciliation jobs.
Exactly-once is achieved end-to-end through idempotent tasks, dedup keys, and transactional sinks, not by scheduler alone.
Small idempotent tasks, explicit retries/catches, isolated failure domains, durable state checkpoints, and replay/backfill workflows.
- ADF vs Airflow vs Step Functions vs Composer: when to choose what?
- How do you prevent DAG explosion for 500 tables?
- How do you enforce SLA and alert policies consistently?
- How do you orchestrate across multi-cloud boundaries?