RUNBOOK – ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

This runbook describes how to operate the Streaming Pipeline (ETL), the Batch Pipeline (ELT + ML), and how to handle common incidents: replay, backfill, and model rollback.

This is a sanitized, documentation-only case study. The steps below describe patterns and checklists, not live production commands.

1. Components

Streaming Pipeline (ETL):
- Cloud Pub/Sub – transactions.realtime
- Dataflow – streaming job
- BigQuery – raw_transactions, transaction_features
- DLQ – dead-letter topic/table for bad events
Batch Pipeline (ELT + ML):
- Cloud Composer – DAGs for:
  - feature aggregation
  - model training (BigQuery ML)
  - batch scoring
- BigQuery – bq_feats.*, bq_scores.transaction_risk_scores
Observability:
- Cloud Monitoring – pipeline SLOs, alerts
- Cloud Logging – job logs, error details
- DQ results tables – rule-level pass/fail

2. Incident classification

P1 – Scoring blocked
- Streaming and batch scoring both impacted, or scores are clearly wrong.
- Example: Dataflow streaming job down, Composer DAG failures, wrong model deployed.
P2 – Partial degradation
- One region/channel impacted, or only batch scoring delayed.
- Example: Only nightly batch scoring failed, streaming still live.
P3 – Data-quality / reporting issue
- DQ thresholds breached, but scoring still running.
- Example: new upstream field missing, causing higher DLQ rate.

Each incident gets a JIRA / ticket, mapped to the SLO it violates.

3. Streaming Pipeline – replay procedure

3.1 Symptoms

Streaming Dataflow job in error state or stopped.
Sudden drop in messages processed per second.
DLQ volume > agreed threshold.
Gaps in raw_transactions or transaction_features tables.

3.2 Immediate actions

Check Dataflow job health
- Inspect job status, worker errors, and autoscaling.
Check Pub/Sub backlog
- Confirm messages are queued and not being consumed.
Check DLQ
- Understand whether the issue is schema-related or infrastructure-related.

3.3 Replay strategy (pattern)

Stop / drain the broken streaming job if required.
Fix configuration or code (e.g., schema, transform, credentials) and start a new job with:
- The same Pub/Sub subscription
- Correct template version / container image
Reprocess from Pub/Sub backlog
- Let the new job consume accumulated messages until backlog returns to normal.
Replay from DLQ if needed:
- Export DLQ messages to a GCS bucket.
- Start a short-lived batch Dataflow job that:
  - reads from GCS (or DLQ topic)
  - applies the same ETL logic
  - writes to BigQuery with insertId to avoid duplicates

3.4 Validation

Verify that:
- raw_transactions row counts match expectations over time windows.
- DQ tables show rules within thresholds.
- Downstream transaction_risk_scores do not have large gaps.

4. Batch Pipeline – backfill procedure

4.1 When to backfill

Nightly Composer DAG failed and missed a feature or scoring run.
New features were added and need historical values.
A defect was fixed and scores/drivers must be recomputed from a specific date.

4.2 Backfill pattern

Identify backfill window
- Example: event_date from 2024-01-10 to 2024-01-15.
Disable conflicting schedules
- Temporarily pause the standard Composer DAG for that window, or use a dedicated backfill DAG.
Run feature backfill
- Create a backfill task that:
  - reads from raw_transactions
  - recomputes transaction_features for the window
  - writes with idempotent keys (date + transaction_id)
Run model scoring backfill
- Use the approved model_version for that period (see model governance).
- Re-run BigQuery ML PREDICT for the backfill window into a staging scores table:
  - bq_scores.transaction_risk_scores_backfill
Swap-in / merge scores
- After verification, merge backfilled rows into the main transaction_risk_scores table.
- Maintain an is_backfill flag and backfill_run_id for audit.

4.3 Validation checklist

Row counts in features and scores match expectations.
SLO dashboards show recovered freshness/latency.
No duplicate scores for the same (transaction_id, model_version).

5. Model rollback procedure (BigQuery ML)

5.1 When to rollback

New model_version is causing higher false positives/negatives.
Business or risk team requests immediate reversion.
Monitoring indicates drift or performance degradation after deployment.

5.2 Assumptions

Each model is stored as a named BigQuery ML model with metadata:
- model_name
- model_version
- training_data_snapshot
The serving pipeline reads the active model_version from a configuration table:
- bq_admin.active_models

5.3 Rollback steps

Freeze current version
- Mark the current model_version as ROLLBACK_CANDIDATE in bq_admin.active_models.
Promote previous stable version
- Update bq_admin.active_models to point to the previous model_version (e.g. from v2024_02_15 back to v2024_01_20).
Restart scoring jobs if necessary
- For streaming scoring: ensure the Dataflow job reads the updated config.
- For batch scoring: update Composer DAG parameters for the next run.
Optional: re-score critical window
- For the period impacted by the bad model, re-run scoring with the stable version into a staging table and merge.
Document the rollback
- Capture:
  - incident id
  - reason for rollback
  - old vs new model_version
  - metrics before/after

6. DQ and SLO breach handling

When a DQ rule or SLO is breached:

Identify which rule/SLO (freshness, volume, error rate, DLQ %, model performance).
Check recent changes
- new features, schema changes, config updates.
Apply either replay, backfill, or model rollback
- follow sections 3, 4, or 5 accordingly.
Create an RCA note
- Cause, fix, and prevention steps logged in your internal tracker.

7. Contact & ownership (sanitized)

In a real engagement, you would list:

Data Engineering owner
ML Engineering owner
Risk / Business owner
On-call rotation

For this sanitized case study, ownership is documented conceptually in the docs/07-security-and-governance.md and docs/11-ml-governance-and-model-risk.md files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUNBOOK – ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

1. Components

2. Incident classification

3. Streaming Pipeline – replay procedure

3.1 Symptoms

3.2 Immediate actions

3.3 Replay strategy (pattern)

3.4 Validation

4. Batch Pipeline – backfill procedure

4.1 When to backfill

4.2 Backfill pattern

4.3 Validation checklist

5. Model rollback procedure (BigQuery ML)

5.1 When to rollback

5.2 Assumptions

5.3 Rollback steps

6. DQ and SLO breach handling

7. Contact & ownership (sanitized)

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

RUNBOOK – ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

1. Components

2. Incident classification

3. Streaming Pipeline – replay procedure

3.1 Symptoms

3.2 Immediate actions

3.3 Replay strategy (pattern)

3.4 Validation

4. Batch Pipeline – backfill procedure

4.1 When to backfill

4.2 Backfill pattern

4.3 Validation checklist

5. Model rollback procedure (BigQuery ML)

5.1 When to rollback

5.2 Assumptions

5.3 Rollback steps

6. DQ and SLO breach handling

7. Contact & ownership (sanitized)