Skip to content

Latest commit

 

History

History
213 lines (164 loc) · 7.26 KB

File metadata and controls

213 lines (164 loc) · 7.26 KB

ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

Sanitized case study — ML-based Risk Scoring for a Tier-1 UK Retail Bank on GCP
(Streaming Pipeline + Batch Pipeline using Pub/Sub · Dataflow · BigQuery · BigQuery ML · Cloud Composer · GCS).
Patterns only; no client code or client data.


🔍 Quick Facts

  • Domain: Retail Banking · BFSI · Fraud & Credit Risk · Audit/Compliance
  • Pipelines: Streaming Pipeline (ETL) + Batch Pipeline (ELT) on GCP
  • Stack: Cloud Pub/Sub, Dataflow (Apache Beam – Python), BigQuery, BigQuery ML, Cloud Composer, GCS, Power BI / Looker Studio
  • Throughput (simulated):
    • ~50–100 transactions per second (steady)
    • Up to ~5–10 million transactions per day
  • SLOs (simulated):
    • p95 end-to-end risk-score latency: < 90 seconds from transaction to score
    • Data Quality (DQ) pass rate: ≥ 95% for reportable features and risk scores
    • Streaming Pipeline availability: ≥ 99.5%

1. What this project is about

This project shows how a Tier-1 UK Retail Bank could implement an ML-Based Risk Scoring platform on GCP using a mix of:

  • Streaming Pipeline (ETL Pipeline) for near-real-time ingestion and scoring of transactions
  • Batch Pipeline (ELT Pipeline) for daily aggregates, model training, and re-scoring

The goal is to:

  • Continuously ingest card & account transactions + customer behaviour events
  • Build/maintain feature tables in BigQuery
  • Train BigQuery ML models for fraud risk / credit risk
  • Generate risk scores that are auditable, governed, and easy to consume by downstream systems and dashboards

The repo is docs-only (no client data, no production code).
It focuses on architecture, contracts, DQ, SLOs, ML governance, and operational patterns.


2. Inputs and outputs

2.1 Inputs (simulated)

  1. Transactional events

    • Card payments, ATM withdrawals, online banking transactions
    • Ingested via Cloud Pub/Sub – topic: transactions.realtime
    • Payload schema: contracts/transactions.schema.json
  2. Customer & account attributes

    • Static/dimensional data (KYC, limits, risk bands)
    • Landed as Batch Pipeline loads into BigQuery staging tables
    • Schema: contracts/customers.schema.json
  3. Behavioural / device events (optional)

    • Login attempts, device fingerprints, channel usage
    • Either ingested through a separate Pub/Sub topic or batch tables

2.2 Outputs

  1. Feature tables (BigQuery)

    • Streaming + Batch ETL/ELT pipelines create curated feature tables:
      • bq_feats.transaction_features
      • bq_feats.customer_features
    • Partitioned by event_date, clustered by customer_id / account_id
  2. Risk score tables (BigQuery ML predictions)

    • bq_scores.transaction_risk_scores
    • Columns: transaction_id, customer_id, model_version, risk_score, risk_band, decision_flags, metadata
    • CMEK-encrypted, row-level access controls (RLS) for teams
  3. Aggregated risk views for dashboards

    • bq_marts.daily_risk_summary
    • Used by Power BI / Looker Studio for operational risk monitoring
  4. Audit & DQ evidence

    • DQ run results with run_id, rules_passed/failed, and DQ score
    • DLQ tables/topics for rejected messages with replay capability

3. High-level business logic (simplified)

  1. Ingest every transaction in near real time through a Streaming Pipeline (ETL).
  2. Enrich transaction events with customer/account attributes and historical aggregates.
  3. Engineer features (per-customer, per-card, per-device) in BigQuery.
  4. Train ML models (fraud/credit risk) using BigQuery ML on daily snapshots via the Batch Pipeline (ELT).
  5. Score new transactions:
    • Streaming path: low-latency scoring using latest approved model
    • Batch path: end-of-day/offline re-scoring or challenger models
  6. Serve risk scores to downstream systems (decision engines, case management tools, dashboards).
  7. Govern everything with CMEK, VPC-SC, IAM/RBAC, Policy Tags, RLS/CLS, and full lineage.

4. Architecture diagram (L2 – GCP components)

Final PNG committed as assets/architecture_l2.png.
Mermaid version kept here for readability.

flowchart LR
    subgraph VPC_SC[VPC-SC Protected Boundary]
        TX[Client Channels\n(Card, ATM, Online)]
        PUB[Cloud Pub/Sub\ntransactions.realtime]
        DF_STREAM[Dataflow\nStreaming Pipeline (ETL)]
        BQ_RAW[BigQuery\nraw_transactions]
        BQ_FEAT[BigQuery\nfeature tables]
        BQ_ML[BigQuery ML\nmodels]
        DF_BATCH[Dataflow\nBatch Pipeline (ELT)]
        COMP[Cloud Composer\n(Orchestration)]
        GCS[GCS\nModel & DQ Artifacts]
        BQ_SCORES[BigQuery\nrisk_scores tables]
    end

    TX --> PUB
    PUB --> DF_STREAM
    DF_STREAM --> BQ_RAW
    DF_STREAM --> BQ_FEAT

    COMP --> DF_BATCH
    DF_BATCH --> BQ_FEAT
    DF_BATCH --> BQ_ML
    BQ_ML --> BQ_SCORES

    BQ_SCORES -->|BI / Ops| BI[(Power BI / Looker Studio)]
    BQ_SCORES --> DOWNSTREAM[(Downstream\nRisk Engines)]

    BQ_FEAT --> GCS
    BQ_ML --> GCS
Loading

5. Dataflow / lifecycle diagram – from transaction to ML

sequenceDiagram
    participant Channel as Channel (POS/ATM/Online)
    participant PubSub as Cloud Pub/Sub
    participant DFStream as Dataflow\nStreaming Pipeline (ETL)
    participant BQRaw as BigQuery\nraw_transactions
    participant BQFeat as BigQuery\nfeature tables
    participant Composer as Cloud Composer
    participant BQML as BigQuery ML
    participant BQScores as BigQuery\nrisk_scores
    participant BI as Dashboards / Risk Ops

    Channel->>PubSub: Publish transaction event
    PubSub->>DFStream: Push message
    DFStream->>DFStream: Validate + ETL transforms\n(schema, enrichment, DQ checks)
    DFStream->>BQRaw: Insert raw record (partitioned)
    DFStream->>BQFeat: Update streaming feature tables
    DFStream-->>BQScores: (optional) Low-latency scoring call

    Composer->>BQRaw: Nightly Batch Pipeline (ELT) query
    Composer->>BQFeat: Build training features
    Composer->>BQML: Train / retrain model\n(tag model_version)
    Composer->>BQScores: Batch scoring jobs

    BQScores-->>BI: Risk dashboards, alerts, queues
Loading

6. Docs index

Detailed documentation lives under /docs:

  • docs/01-context-and-usecase.md
  • docs/02-architecture-overview.md
  • docs/03-streaming-pipeline-event-flow.md
  • docs/04-batch-pipeline-elt-and-ml-training.md
  • docs/05-data-models-and-feature-store.md
  • docs/06-data-quality-and-risk-metrics.md
  • docs/07-security-and-governance.md
  • docs/08-lineage-and-auditability.md
  • docs/09-slos-observability-and-dashboards.md
  • docs/10-cost-and-scaling-guardrails.md
  • docs/11-ml-governance-and-model-risk.md
  • docs/12-roadmap-and-future-work.md

7. Repository map

- README.md
- RUNBOOK.md
- SECURITY.md
- ETHICS.md
- LICENSE
- CODEOWNERS
- CODE_OF_CONDUCT.md
- CONTRIBUTING.md
- .pre-commit-config.yaml
- .markdownlint.jsonc
- .markdownlint-cli2.jsonc
- .editorconfig
- docs/
- contracts/
- adr/
- assets/
- qc_examples.sql

8. Status

This is a documentation-only case study designed for LinkedIn, GitHub, and portfolio review.