Sanitized case study — ML-based Risk Scoring for a Tier-1 UK Retail Bank on GCP
(Streaming Pipeline + Batch Pipeline using Pub/Sub · Dataflow · BigQuery · BigQuery ML · Cloud Composer · GCS).
Patterns only; no client code or client data.
- Domain: Retail Banking · BFSI · Fraud & Credit Risk · Audit/Compliance
- Pipelines: Streaming Pipeline (ETL) + Batch Pipeline (ELT) on GCP
- Stack: Cloud Pub/Sub, Dataflow (Apache Beam – Python), BigQuery, BigQuery ML, Cloud Composer, GCS, Power BI / Looker Studio
- Throughput (simulated):
- ~50–100 transactions per second (steady)
- Up to ~5–10 million transactions per day
- SLOs (simulated):
- p95 end-to-end risk-score latency: < 90 seconds from transaction to score
- Data Quality (DQ) pass rate: ≥ 95% for reportable features and risk scores
- Streaming Pipeline availability: ≥ 99.5%
This project shows how a Tier-1 UK Retail Bank could implement an ML-Based Risk Scoring platform on GCP using a mix of:
- Streaming Pipeline (ETL Pipeline) for near-real-time ingestion and scoring of transactions
- Batch Pipeline (ELT Pipeline) for daily aggregates, model training, and re-scoring
The goal is to:
- Continuously ingest card & account transactions + customer behaviour events
- Build/maintain feature tables in BigQuery
- Train BigQuery ML models for fraud risk / credit risk
- Generate risk scores that are auditable, governed, and easy to consume by downstream systems and dashboards
The repo is docs-only (no client data, no production code).
It focuses on architecture, contracts, DQ, SLOs, ML governance, and operational patterns.
-
Transactional events
- Card payments, ATM withdrawals, online banking transactions
- Ingested via Cloud Pub/Sub – topic:
transactions.realtime - Payload schema:
contracts/transactions.schema.json
-
Customer & account attributes
- Static/dimensional data (KYC, limits, risk bands)
- Landed as Batch Pipeline loads into BigQuery staging tables
- Schema:
contracts/customers.schema.json
-
Behavioural / device events (optional)
- Login attempts, device fingerprints, channel usage
- Either ingested through a separate Pub/Sub topic or batch tables
-
Feature tables (BigQuery)
- Streaming + Batch ETL/ELT pipelines create curated feature tables:
bq_feats.transaction_featuresbq_feats.customer_features
- Partitioned by event_date, clustered by customer_id / account_id
- Streaming + Batch ETL/ELT pipelines create curated feature tables:
-
Risk score tables (BigQuery ML predictions)
bq_scores.transaction_risk_scores- Columns: transaction_id, customer_id, model_version, risk_score, risk_band, decision_flags, metadata
- CMEK-encrypted, row-level access controls (RLS) for teams
-
Aggregated risk views for dashboards
bq_marts.daily_risk_summary- Used by Power BI / Looker Studio for operational risk monitoring
-
Audit & DQ evidence
- DQ run results with run_id, rules_passed/failed, and DQ score
- DLQ tables/topics for rejected messages with replay capability
- Ingest every transaction in near real time through a Streaming Pipeline (ETL).
- Enrich transaction events with customer/account attributes and historical aggregates.
- Engineer features (per-customer, per-card, per-device) in BigQuery.
- Train ML models (fraud/credit risk) using BigQuery ML on daily snapshots via the Batch Pipeline (ELT).
- Score new transactions:
- Streaming path: low-latency scoring using latest approved model
- Batch path: end-of-day/offline re-scoring or challenger models
- Serve risk scores to downstream systems (decision engines, case management tools, dashboards).
- Govern everything with CMEK, VPC-SC, IAM/RBAC, Policy Tags, RLS/CLS, and full lineage.
Final PNG committed as
assets/architecture_l2.png.
Mermaid version kept here for readability.
flowchart LR
subgraph VPC_SC[VPC-SC Protected Boundary]
TX[Client Channels\n(Card, ATM, Online)]
PUB[Cloud Pub/Sub\ntransactions.realtime]
DF_STREAM[Dataflow\nStreaming Pipeline (ETL)]
BQ_RAW[BigQuery\nraw_transactions]
BQ_FEAT[BigQuery\nfeature tables]
BQ_ML[BigQuery ML\nmodels]
DF_BATCH[Dataflow\nBatch Pipeline (ELT)]
COMP[Cloud Composer\n(Orchestration)]
GCS[GCS\nModel & DQ Artifacts]
BQ_SCORES[BigQuery\nrisk_scores tables]
end
TX --> PUB
PUB --> DF_STREAM
DF_STREAM --> BQ_RAW
DF_STREAM --> BQ_FEAT
COMP --> DF_BATCH
DF_BATCH --> BQ_FEAT
DF_BATCH --> BQ_ML
BQ_ML --> BQ_SCORES
BQ_SCORES -->|BI / Ops| BI[(Power BI / Looker Studio)]
BQ_SCORES --> DOWNSTREAM[(Downstream\nRisk Engines)]
BQ_FEAT --> GCS
BQ_ML --> GCS
sequenceDiagram
participant Channel as Channel (POS/ATM/Online)
participant PubSub as Cloud Pub/Sub
participant DFStream as Dataflow\nStreaming Pipeline (ETL)
participant BQRaw as BigQuery\nraw_transactions
participant BQFeat as BigQuery\nfeature tables
participant Composer as Cloud Composer
participant BQML as BigQuery ML
participant BQScores as BigQuery\nrisk_scores
participant BI as Dashboards / Risk Ops
Channel->>PubSub: Publish transaction event
PubSub->>DFStream: Push message
DFStream->>DFStream: Validate + ETL transforms\n(schema, enrichment, DQ checks)
DFStream->>BQRaw: Insert raw record (partitioned)
DFStream->>BQFeat: Update streaming feature tables
DFStream-->>BQScores: (optional) Low-latency scoring call
Composer->>BQRaw: Nightly Batch Pipeline (ELT) query
Composer->>BQFeat: Build training features
Composer->>BQML: Train / retrain model\n(tag model_version)
Composer->>BQScores: Batch scoring jobs
BQScores-->>BI: Risk dashboards, alerts, queues
Detailed documentation lives under /docs:
docs/01-context-and-usecase.mddocs/02-architecture-overview.mddocs/03-streaming-pipeline-event-flow.mddocs/04-batch-pipeline-elt-and-ml-training.mddocs/05-data-models-and-feature-store.mddocs/06-data-quality-and-risk-metrics.mddocs/07-security-and-governance.mddocs/08-lineage-and-auditability.mddocs/09-slos-observability-and-dashboards.mddocs/10-cost-and-scaling-guardrails.mddocs/11-ml-governance-and-model-risk.mddocs/12-roadmap-and-future-work.md
- README.md
- RUNBOOK.md
- SECURITY.md
- ETHICS.md
- LICENSE
- CODEOWNERS
- CODE_OF_CONDUCT.md
- CONTRIBUTING.md
- .pre-commit-config.yaml
- .markdownlint.jsonc
- .markdownlint-cli2.jsonc
- .editorconfig
- docs/
- contracts/
- adr/
- assets/
- qc_examples.sql
This is a documentation-only case study designed for LinkedIn, GitHub, and portfolio review.