Skip to content

Latest commit

 

History

History
198 lines (143 loc) · 6.54 KB

File metadata and controls

198 lines (143 loc) · 6.54 KB

RUNBOOK – ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

This runbook describes how to operate the Streaming Pipeline (ETL), the Batch Pipeline (ELT + ML), and how to handle common incidents: replay, backfill, and model rollback.

This is a sanitized, documentation-only case study. The steps below describe patterns and checklists, not live production commands.


1. Components

  • Streaming Pipeline (ETL):

    • Cloud Pub/Sub – transactions.realtime
    • Dataflow – streaming job
    • BigQuery – raw_transactions, transaction_features
    • DLQ – dead-letter topic/table for bad events
  • Batch Pipeline (ELT + ML):

    • Cloud Composer – DAGs for:
      • feature aggregation
      • model training (BigQuery ML)
      • batch scoring
    • BigQuery – bq_feats.*, bq_scores.transaction_risk_scores
  • Observability:

    • Cloud Monitoring – pipeline SLOs, alerts
    • Cloud Logging – job logs, error details
    • DQ results tables – rule-level pass/fail

2. Incident classification

  1. P1 – Scoring blocked

    • Streaming and batch scoring both impacted, or scores are clearly wrong.
    • Example: Dataflow streaming job down, Composer DAG failures, wrong model deployed.
  2. P2 – Partial degradation

    • One region/channel impacted, or only batch scoring delayed.
    • Example: Only nightly batch scoring failed, streaming still live.
  3. P3 – Data-quality / reporting issue

    • DQ thresholds breached, but scoring still running.
    • Example: new upstream field missing, causing higher DLQ rate.

Each incident gets a JIRA / ticket, mapped to the SLO it violates.


3. Streaming Pipeline – replay procedure

3.1 Symptoms

  • Streaming Dataflow job in error state or stopped.
  • Sudden drop in messages processed per second.
  • DLQ volume > agreed threshold.
  • Gaps in raw_transactions or transaction_features tables.

3.2 Immediate actions

  1. Check Dataflow job health
    • Inspect job status, worker errors, and autoscaling.
  2. Check Pub/Sub backlog
    • Confirm messages are queued and not being consumed.
  3. Check DLQ
    • Understand whether the issue is schema-related or infrastructure-related.

3.3 Replay strategy (pattern)

  1. Stop / drain the broken streaming job if required.
  2. Fix configuration or code (e.g., schema, transform, credentials) and start a new job with:
    • The same Pub/Sub subscription
    • Correct template version / container image
  3. Reprocess from Pub/Sub backlog
    • Let the new job consume accumulated messages until backlog returns to normal.
  4. Replay from DLQ if needed:
    • Export DLQ messages to a GCS bucket.
    • Start a short-lived batch Dataflow job that:
      • reads from GCS (or DLQ topic)
      • applies the same ETL logic
      • writes to BigQuery with insertId to avoid duplicates

3.4 Validation

  • Verify that:
    • raw_transactions row counts match expectations over time windows.
    • DQ tables show rules within thresholds.
    • Downstream transaction_risk_scores do not have large gaps.

4. Batch Pipeline – backfill procedure

4.1 When to backfill

  • Nightly Composer DAG failed and missed a feature or scoring run.
  • New features were added and need historical values.
  • A defect was fixed and scores/drivers must be recomputed from a specific date.

4.2 Backfill pattern

  1. Identify backfill window

    • Example: event_date from 2024-01-10 to 2024-01-15.
  2. Disable conflicting schedules

    • Temporarily pause the standard Composer DAG for that window, or use a dedicated backfill DAG.
  3. Run feature backfill

    • Create a backfill task that:
      • reads from raw_transactions
      • recomputes transaction_features for the window
      • writes with idempotent keys (date + transaction_id)
  4. Run model scoring backfill

    • Use the approved model_version for that period (see model governance).
    • Re-run BigQuery ML PREDICT for the backfill window into a staging scores table:
      • bq_scores.transaction_risk_scores_backfill
  5. Swap-in / merge scores

    • After verification, merge backfilled rows into the main transaction_risk_scores table.
    • Maintain an is_backfill flag and backfill_run_id for audit.

4.3 Validation checklist

  • Row counts in features and scores match expectations.
  • SLO dashboards show recovered freshness/latency.
  • No duplicate scores for the same (transaction_id, model_version).

5. Model rollback procedure (BigQuery ML)

5.1 When to rollback

  • New model_version is causing higher false positives/negatives.
  • Business or risk team requests immediate reversion.
  • Monitoring indicates drift or performance degradation after deployment.

5.2 Assumptions

  • Each model is stored as a named BigQuery ML model with metadata:
    • model_name
    • model_version
    • training_data_snapshot
  • The serving pipeline reads the active model_version from a configuration table:
    • bq_admin.active_models

5.3 Rollback steps

  1. Freeze current version

    • Mark the current model_version as ROLLBACK_CANDIDATE in bq_admin.active_models.
  2. Promote previous stable version

    • Update bq_admin.active_models to point to the previous model_version (e.g. from v2024_02_15 back to v2024_01_20).
  3. Restart scoring jobs if necessary

    • For streaming scoring: ensure the Dataflow job reads the updated config.
    • For batch scoring: update Composer DAG parameters for the next run.
  4. Optional: re-score critical window

    • For the period impacted by the bad model, re-run scoring with the stable version into a staging table and merge.
  5. Document the rollback

    • Capture:
      • incident id
      • reason for rollback
      • old vs new model_version
      • metrics before/after

6. DQ and SLO breach handling

When a DQ rule or SLO is breached:

  1. Identify which rule/SLO (freshness, volume, error rate, DLQ %, model performance).
  2. Check recent changes
    • new features, schema changes, config updates.
  3. Apply either replay, backfill, or model rollback
    • follow sections 3, 4, or 5 accordingly.
  4. Create an RCA note
    • Cause, fix, and prevention steps logged in your internal tracker.

7. Contact & ownership (sanitized)

In a real engagement, you would list:

  • Data Engineering owner
  • ML Engineering owner
  • Risk / Business owner
  • On-call rotation

For this sanitized case study, ownership is documented conceptually in the docs/07-security-and-governance.md and docs/11-ml-governance-and-model-risk.md files.