This runbook describes how to operate the Streaming Pipeline (ETL), the Batch Pipeline (ELT + ML), and how to handle common incidents: replay, backfill, and model rollback.
This is a sanitized, documentation-only case study. The steps below describe patterns and checklists, not live production commands.
-
Streaming Pipeline (ETL):
- Cloud Pub/Sub –
transactions.realtime - Dataflow – streaming job
- BigQuery –
raw_transactions,transaction_features - DLQ – dead-letter topic/table for bad events
- Cloud Pub/Sub –
-
Batch Pipeline (ELT + ML):
- Cloud Composer – DAGs for:
- feature aggregation
- model training (BigQuery ML)
- batch scoring
- BigQuery –
bq_feats.*,bq_scores.transaction_risk_scores
- Cloud Composer – DAGs for:
-
Observability:
- Cloud Monitoring – pipeline SLOs, alerts
- Cloud Logging – job logs, error details
- DQ results tables – rule-level pass/fail
-
P1 – Scoring blocked
- Streaming and batch scoring both impacted, or scores are clearly wrong.
- Example: Dataflow streaming job down, Composer DAG failures, wrong model deployed.
-
P2 – Partial degradation
- One region/channel impacted, or only batch scoring delayed.
- Example: Only nightly batch scoring failed, streaming still live.
-
P3 – Data-quality / reporting issue
- DQ thresholds breached, but scoring still running.
- Example: new upstream field missing, causing higher DLQ rate.
Each incident gets a JIRA / ticket, mapped to the SLO it violates.
- Streaming Dataflow job in error state or stopped.
- Sudden drop in messages processed per second.
- DLQ volume > agreed threshold.
- Gaps in
raw_transactionsortransaction_featurestables.
- Check Dataflow job health
- Inspect job status, worker errors, and autoscaling.
- Check Pub/Sub backlog
- Confirm messages are queued and not being consumed.
- Check DLQ
- Understand whether the issue is schema-related or infrastructure-related.
- Stop / drain the broken streaming job if required.
- Fix configuration or code (e.g., schema, transform, credentials) and start a new job with:
- The same Pub/Sub subscription
- Correct template version / container image
- Reprocess from Pub/Sub backlog
- Let the new job consume accumulated messages until backlog returns to normal.
- Replay from DLQ if needed:
- Export DLQ messages to a GCS bucket.
- Start a short-lived batch Dataflow job that:
- reads from GCS (or DLQ topic)
- applies the same ETL logic
- writes to BigQuery with
insertIdto avoid duplicates
- Verify that:
raw_transactionsrow counts match expectations over time windows.- DQ tables show rules within thresholds.
- Downstream
transaction_risk_scoresdo not have large gaps.
- Nightly Composer DAG failed and missed a feature or scoring run.
- New features were added and need historical values.
- A defect was fixed and scores/drivers must be recomputed from a specific date.
-
Identify backfill window
- Example:
event_datefrom2024-01-10to2024-01-15.
- Example:
-
Disable conflicting schedules
- Temporarily pause the standard Composer DAG for that window, or use a dedicated backfill DAG.
-
Run feature backfill
- Create a backfill task that:
- reads from
raw_transactions - recomputes
transaction_featuresfor the window - writes with idempotent keys (date + transaction_id)
- reads from
- Create a backfill task that:
-
Run model scoring backfill
- Use the approved model_version for that period (see model governance).
- Re-run BigQuery ML
PREDICTfor the backfill window into a staging scores table:bq_scores.transaction_risk_scores_backfill
-
Swap-in / merge scores
- After verification, merge backfilled rows into the main
transaction_risk_scorestable. - Maintain an
is_backfillflag andbackfill_run_idfor audit.
- After verification, merge backfilled rows into the main
- Row counts in features and scores match expectations.
- SLO dashboards show recovered freshness/latency.
- No duplicate scores for the same
(transaction_id, model_version).
- New model_version is causing higher false positives/negatives.
- Business or risk team requests immediate reversion.
- Monitoring indicates drift or performance degradation after deployment.
- Each model is stored as a named BigQuery ML model with metadata:
model_namemodel_versiontraining_data_snapshot
- The serving pipeline reads the active
model_versionfrom a configuration table:bq_admin.active_models
-
Freeze current version
- Mark the current
model_versionasROLLBACK_CANDIDATEinbq_admin.active_models.
- Mark the current
-
Promote previous stable version
- Update
bq_admin.active_modelsto point to the previousmodel_version(e.g. fromv2024_02_15back tov2024_01_20).
- Update
-
Restart scoring jobs if necessary
- For streaming scoring: ensure the Dataflow job reads the updated config.
- For batch scoring: update Composer DAG parameters for the next run.
-
Optional: re-score critical window
- For the period impacted by the bad model, re-run scoring with the stable version into a staging table and merge.
-
Document the rollback
- Capture:
- incident id
- reason for rollback
- old vs new
model_version - metrics before/after
- Capture:
When a DQ rule or SLO is breached:
- Identify which rule/SLO (freshness, volume, error rate, DLQ %, model performance).
- Check recent changes
- new features, schema changes, config updates.
- Apply either replay, backfill, or model rollback
- follow sections 3, 4, or 5 accordingly.
- Create an RCA note
- Cause, fix, and prevention steps logged in your internal tracker.
In a real engagement, you would list:
- Data Engineering owner
- ML Engineering owner
- Risk / Business owner
- On-call rotation
For this sanitized case study, ownership is documented conceptually in the
docs/07-security-and-governance.md and docs/11-ml-governance-and-model-risk.md files.