ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

Sanitized case study — ML-based Risk Scoring for a Tier-1 UK Retail Bank on GCP
(Streaming Pipeline + Batch Pipeline using Pub/Sub · Dataflow · BigQuery · BigQuery ML · Cloud Composer · GCS).
Patterns only; no client code or client data.

🔍 Quick Facts

Domain: Retail Banking · BFSI · Fraud & Credit Risk · Audit/Compliance
Pipelines: Streaming Pipeline (ETL) + Batch Pipeline (ELT) on GCP
Stack: Cloud Pub/Sub, Dataflow (Apache Beam – Python), BigQuery, BigQuery ML, Cloud Composer, GCS, Power BI / Looker Studio
Throughput (simulated):
- ~50–100 transactions per second (steady)
- Up to ~5–10 million transactions per day
SLOs (simulated):
- p95 end-to-end risk-score latency: < 90 seconds from transaction to score
- Data Quality (DQ) pass rate: ≥ 95% for reportable features and risk scores
- Streaming Pipeline availability: ≥ 99.5%

1. What this project is about

This project shows how a Tier-1 UK Retail Bank could implement an ML-Based Risk Scoring platform on GCP using a mix of:

Streaming Pipeline (ETL Pipeline) for near-real-time ingestion and scoring of transactions
Batch Pipeline (ELT Pipeline) for daily aggregates, model training, and re-scoring

The goal is to:

Continuously ingest card & account transactions + customer behaviour events
Build/maintain feature tables in BigQuery
Train BigQuery ML models for fraud risk / credit risk
Generate risk scores that are auditable, governed, and easy to consume by downstream systems and dashboards

The repo is docs-only (no client data, no production code).
It focuses on architecture, contracts, DQ, SLOs, ML governance, and operational patterns.

2. Inputs and outputs

2.1 Inputs (simulated)

Transactional events
- Card payments, ATM withdrawals, online banking transactions
- Ingested via Cloud Pub/Sub – topic: transactions.realtime
- Payload schema: contracts/transactions.schema.json
Customer & account attributes
- Static/dimensional data (KYC, limits, risk bands)
- Landed as Batch Pipeline loads into BigQuery staging tables
- Schema: contracts/customers.schema.json
Behavioural / device events (optional)
- Login attempts, device fingerprints, channel usage
- Either ingested through a separate Pub/Sub topic or batch tables

2.2 Outputs

Feature tables (BigQuery)
- Streaming + Batch ETL/ELT pipelines create curated feature tables:
  - bq_feats.transaction_features
  - bq_feats.customer_features
- Partitioned by event_date, clustered by customer_id / account_id
Risk score tables (BigQuery ML predictions)
- bq_scores.transaction_risk_scores
- Columns: transaction_id, customer_id, model_version, risk_score, risk_band, decision_flags, metadata
- CMEK-encrypted, row-level access controls (RLS) for teams
Aggregated risk views for dashboards
- bq_marts.daily_risk_summary
- Used by Power BI / Looker Studio for operational risk monitoring
Audit & DQ evidence
- DQ run results with run_id, rules_passed/failed, and DQ score
- DLQ tables/topics for rejected messages with replay capability

3. High-level business logic (simplified)

Ingest every transaction in near real time through a Streaming Pipeline (ETL).
Enrich transaction events with customer/account attributes and historical aggregates.
Engineer features (per-customer, per-card, per-device) in BigQuery.
Train ML models (fraud/credit risk) using BigQuery ML on daily snapshots via the Batch Pipeline (ELT).
Score new transactions:
- Streaming path: low-latency scoring using latest approved model
- Batch path: end-of-day/offline re-scoring or challenger models
Serve risk scores to downstream systems (decision engines, case management tools, dashboards).
Govern everything with CMEK, VPC-SC, IAM/RBAC, Policy Tags, RLS/CLS, and full lineage.

4. Architecture diagram (L2 – GCP components)

Final PNG committed as assets/architecture_l2.png.
Mermaid version kept here for readability.

flowchart LR
    subgraph VPC_SC[VPC-SC Protected Boundary]
        TX[Client Channels\n(Card, ATM, Online)]
        PUB[Cloud Pub/Sub\ntransactions.realtime]
        DF_STREAM[Dataflow\nStreaming Pipeline (ETL)]
        BQ_RAW[BigQuery\nraw_transactions]
        BQ_FEAT[BigQuery\nfeature tables]
        BQ_ML[BigQuery ML\nmodels]
        DF_BATCH[Dataflow\nBatch Pipeline (ELT)]
        COMP[Cloud Composer\n(Orchestration)]
        GCS[GCS\nModel & DQ Artifacts]
        BQ_SCORES[BigQuery\nrisk_scores tables]
    end

    TX --> PUB
    PUB --> DF_STREAM
    DF_STREAM --> BQ_RAW
    DF_STREAM --> BQ_FEAT

    COMP --> DF_BATCH
    DF_BATCH --> BQ_FEAT
    DF_BATCH --> BQ_ML
    BQ_ML --> BQ_SCORES

    BQ_SCORES -->|BI / Ops| BI[(Power BI / Looker Studio)]
    BQ_SCORES --> DOWNSTREAM[(Downstream\nRisk Engines)]

    BQ_FEAT --> GCS
    BQ_ML --> GCS

5. Dataflow / lifecycle diagram – from transaction to ML

sequenceDiagram
    participant Channel as Channel (POS/ATM/Online)
    participant PubSub as Cloud Pub/Sub
    participant DFStream as Dataflow\nStreaming Pipeline (ETL)
    participant BQRaw as BigQuery\nraw_transactions
    participant BQFeat as BigQuery\nfeature tables
    participant Composer as Cloud Composer
    participant BQML as BigQuery ML
    participant BQScores as BigQuery\nrisk_scores
    participant BI as Dashboards / Risk Ops

    Channel->>PubSub: Publish transaction event
    PubSub->>DFStream: Push message
    DFStream->>DFStream: Validate + ETL transforms\n(schema, enrichment, DQ checks)
    DFStream->>BQRaw: Insert raw record (partitioned)
    DFStream->>BQFeat: Update streaming feature tables
    DFStream-->>BQScores: (optional) Low-latency scoring call

    Composer->>BQRaw: Nightly Batch Pipeline (ELT) query
    Composer->>BQFeat: Build training features
    Composer->>BQML: Train / retrain model\n(tag model_version)
    Composer->>BQScores: Batch scoring jobs

    BQScores-->>BI: Risk dashboards, alerts, queues

6. Docs index

Detailed documentation lives under /docs:

docs/01-context-and-usecase.md
docs/02-architecture-overview.md
docs/03-streaming-pipeline-event-flow.md
docs/04-batch-pipeline-elt-and-ml-training.md
docs/05-data-models-and-feature-store.md
docs/06-data-quality-and-risk-metrics.md
docs/07-security-and-governance.md
docs/08-lineage-and-auditability.md
docs/09-slos-observability-and-dashboards.md
docs/10-cost-and-scaling-guardrails.md
docs/11-ml-governance-and-model-risk.md
docs/12-roadmap-and-future-work.md

7. Repository map

- README.md
- RUNBOOK.md
- SECURITY.md
- ETHICS.md
- LICENSE
- CODEOWNERS
- CODE_OF_CONDUCT.md
- CONTRIBUTING.md
- .pre-commit-config.yaml
- .markdownlint.jsonc
- .markdownlint-cli2.jsonc
- .editorconfig
- docs/
- contracts/
- adr/
- assets/
- qc_examples.sql

8. Status

This is a documentation-only case study designed for LinkedIn, GitHub, and portfolio review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

🔍 Quick Facts

1. What this project is about

2. Inputs and outputs

2.1 Inputs (simulated)

2.2 Outputs

3. High-level business logic (simplified)

4. Architecture diagram (L2 – GCP components)

5. Dataflow / lifecycle diagram – from transaction to ML

6. Docs index

7. Repository map

8. Status

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)

🔍 Quick Facts

1. What this project is about

2. Inputs and outputs

2.1 Inputs (simulated)

2.2 Outputs

3. High-level business logic (simplified)

4. Architecture diagram (L2 – GCP components)

5. Dataflow / lifecycle diagram – from transaction to ML

6. Docs index

7. Repository map

8. Status