Skip to content

Commit f8a68ac

Browse files
committed
Initial commit: added full ML Risk Scoring repo structure
0 parents  commit f8a68ac

31 files changed

Lines changed: 810 additions & 0 deletions

.editorconfig

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
root = true
2+
3+
[*]
4+
charset = utf-8
5+
end_of_line = lf
6+
insert_final_newline = true
7+
indent_style = space
8+
indent_size = 2

.markdownlint-cli2.jsonc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"default": true
3+
}

.markdownlint.jsonc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"default": true
3+
}

.pre-commit-config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
repos:
2+
- repo: https://github.com/markdownlint/markdownlint
3+
rev: v0.13.0
4+
hooks:
5+
- id: markdownlint

CODEOWNERS

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# CODEOWNERS – for illustration only
2+
3+
* @your-github-handle
4+
docs/* @your-github-handle

CODE_OF_CONDUCT.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# CODE OF CONDUCT
2+
3+
This is a small, sanitized portfolio repository.
4+
5+
- Be respectful when opening issues or discussing ideas.
6+
- No harassment, hate speech, or abusive behaviour.

CONTRIBUTING.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# CONTRIBUTING
2+
3+
This repository is primarily a **docs-only case study**.
4+
5+
If you want to extend it:
6+
7+
1. Open an issue describing the improvement.
8+
2. Follow the existing docs structure in the `docs/` folder.
9+
3. Keep all examples **fully sanitized** – no real client data or secrets.

ETHICS.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# ETHICS.md – Sanitization & Responsible Use
2+
3+
This repository is a **sanitized case study**.
4+
5+
- No real client code or client data is included.
6+
- Bank names, volumes, and SLOs are illustrative.
7+
- JSON schemas and table names are synthetic.
8+
9+
The patterns shown here – Streaming Pipelines, Batch Pipelines, ETL/ELT, and ML governance –
10+
are intended for **learning, interview discussions, and portfolio demonstration only**.
11+
12+
When using similar patterns in a real environment:
13+
14+
- Follow your organisation's security, privacy, and model risk policies.
15+
- Do not expose PII or confidential business metrics.
16+
- Engage risk, legal, and compliance teams before deploying ML-based risk scoring to production.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# ML-Based Risk Scoring – Tier-1 UK Retail Bank (GCP + BigQuery ML)
2+
3+
> Sanitized case study — ML-based Risk Scoring for a Tier-1 UK Retail Bank on GCP
4+
> (Streaming Pipeline + Batch Pipeline using Pub/Sub · Dataflow · BigQuery · BigQuery ML · Cloud Composer · GCS).
5+
> Patterns only; no client code or client data.
6+
7+
---
8+
9+
## 🔍 Quick Facts
10+
11+
- **Domain:** Retail Banking · BFSI · Fraud & Credit Risk · Audit/Compliance
12+
- **Pipelines:** Streaming Pipeline (ETL) + Batch Pipeline (ELT) on GCP
13+
- **Stack:** Cloud Pub/Sub, Dataflow (Apache Beam – Python), BigQuery, BigQuery ML, Cloud Composer, GCS, Power BI / Looker Studio
14+
- **Throughput (simulated):**
15+
- ~50–100 transactions per second (steady)
16+
- Up to ~5–10 million transactions per day
17+
- **SLOs (simulated):**
18+
- p95 end-to-end risk-score latency: **< 90 seconds** from transaction to score
19+
- Data Quality (DQ) pass rate: **≥ 95%** for reportable features and risk scores
20+
- Streaming Pipeline availability: **≥ 99.5%**
21+
22+
---
23+
24+
## 1. What this project is about
25+
26+
This project shows how a **Tier-1 UK Retail Bank** could implement an **ML-Based Risk Scoring platform on GCP** using a mix of:
27+
28+
- **Streaming Pipeline (ETL Pipeline)** for near-real-time ingestion and scoring of transactions
29+
- **Batch Pipeline (ELT Pipeline)** for daily aggregates, model training, and re-scoring
30+
31+
The goal is to:
32+
33+
- Continuously ingest **card & account transactions + customer behaviour events**
34+
- Build/maintain **feature tables** in BigQuery
35+
- Train **BigQuery ML** models for fraud risk / credit risk
36+
- Generate **risk scores** that are **auditable**, **governed**, and easy to consume by downstream systems and dashboards
37+
38+
The repo is **docs-only** (no client data, no production code).
39+
It focuses on architecture, contracts, DQ, SLOs, ML governance, and operational patterns.
40+
41+
---
42+
43+
## 2. Inputs and outputs
44+
45+
### 2.1 Inputs (simulated)
46+
47+
1. **Transactional events**
48+
- Card payments, ATM withdrawals, online banking transactions
49+
- Ingested via **Cloud Pub/Sub** – topic: `transactions.realtime`
50+
- Payload schema: `contracts/transactions.schema.json`
51+
52+
2. **Customer & account attributes**
53+
- Static/dimensional data (KYC, limits, risk bands)
54+
- Landed as **Batch Pipeline** loads into BigQuery staging tables
55+
- Schema: `contracts/customers.schema.json`
56+
57+
3. **Behavioural / device events (optional)**
58+
- Login attempts, device fingerprints, channel usage
59+
- Either ingested through a separate Pub/Sub topic or batch tables
60+
61+
### 2.2 Outputs
62+
63+
1. **Feature tables (BigQuery)**
64+
- Streaming + Batch ETL/ELT pipelines create curated **feature tables**:
65+
- `bq_feats.transaction_features`
66+
- `bq_feats.customer_features`
67+
- Partitioned by **event_date**, clustered by **customer_id / account_id**
68+
69+
2. **Risk score tables (BigQuery ML predictions)**
70+
- `bq_scores.transaction_risk_scores`
71+
- Columns: transaction_id, customer_id, model_version, risk_score, risk_band, decision_flags, metadata
72+
- CMEK-encrypted, row-level access controls (RLS) for teams
73+
74+
3. **Aggregated risk views for dashboards**
75+
- `bq_marts.daily_risk_summary`
76+
- Used by **Power BI / Looker Studio** for operational risk monitoring
77+
78+
4. **Audit & DQ evidence**
79+
- DQ run results with **run_id**, **rules_passed/failed**, and **DQ score**
80+
- DLQ tables/topics for rejected messages with replay capability
81+
82+
---
83+
84+
## 3. High-level business logic (simplified)
85+
86+
1. **Ingest** every transaction in near real time through a **Streaming Pipeline (ETL)**.
87+
2. **Enrich** transaction events with customer/account attributes and historical aggregates.
88+
3. **Engineer features** (per-customer, per-card, per-device) in BigQuery.
89+
4. **Train** ML models (fraud/credit risk) using **BigQuery ML** on daily snapshots via the **Batch Pipeline (ELT)**.
90+
5. **Score** new transactions:
91+
- Streaming path: low-latency scoring using latest approved model
92+
- Batch path: end-of-day/offline re-scoring or challenger models
93+
6. **Serve** risk scores to downstream systems (decision engines, case management tools, dashboards).
94+
7. **Govern** everything with **CMEK, VPC-SC, IAM/RBAC, Policy Tags, RLS/CLS, and full lineage**.
95+
96+
---
97+
98+
## 4. Architecture diagram (L2 – GCP components)
99+
100+
> Final PNG committed as `assets/architecture_l2.png`.
101+
> Mermaid version kept here for readability.
102+
103+
```mermaid
104+
flowchart LR
105+
subgraph VPC_SC[VPC-SC Protected Boundary]
106+
TX[Client Channels\n(Card, ATM, Online)]
107+
PUB[Cloud Pub/Sub\ntransactions.realtime]
108+
DF_STREAM[Dataflow\nStreaming Pipeline (ETL)]
109+
BQ_RAW[BigQuery\nraw_transactions]
110+
BQ_FEAT[BigQuery\nfeature tables]
111+
BQ_ML[BigQuery ML\nmodels]
112+
DF_BATCH[Dataflow\nBatch Pipeline (ELT)]
113+
COMP[Cloud Composer\n(Orchestration)]
114+
GCS[GCS\nModel & DQ Artifacts]
115+
BQ_SCORES[BigQuery\nrisk_scores tables]
116+
end
117+
118+
TX --> PUB
119+
PUB --> DF_STREAM
120+
DF_STREAM --> BQ_RAW
121+
DF_STREAM --> BQ_FEAT
122+
123+
COMP --> DF_BATCH
124+
DF_BATCH --> BQ_FEAT
125+
DF_BATCH --> BQ_ML
126+
BQ_ML --> BQ_SCORES
127+
128+
BQ_SCORES -->|BI / Ops| BI[(Power BI / Looker Studio)]
129+
BQ_SCORES --> DOWNSTREAM[(Downstream\nRisk Engines)]
130+
131+
BQ_FEAT --> GCS
132+
BQ_ML --> GCS
133+
```
134+
135+
---
136+
137+
## 5. Dataflow / lifecycle diagram – from transaction to ML
138+
139+
```mermaid
140+
sequenceDiagram
141+
participant Channel as Channel (POS/ATM/Online)
142+
participant PubSub as Cloud Pub/Sub
143+
participant DFStream as Dataflow\nStreaming Pipeline (ETL)
144+
participant BQRaw as BigQuery\nraw_transactions
145+
participant BQFeat as BigQuery\nfeature tables
146+
participant Composer as Cloud Composer
147+
participant BQML as BigQuery ML
148+
participant BQScores as BigQuery\nrisk_scores
149+
participant BI as Dashboards / Risk Ops
150+
151+
Channel->>PubSub: Publish transaction event
152+
PubSub->>DFStream: Push message
153+
DFStream->>DFStream: Validate + ETL transforms\n(schema, enrichment, DQ checks)
154+
DFStream->>BQRaw: Insert raw record (partitioned)
155+
DFStream->>BQFeat: Update streaming feature tables
156+
DFStream-->>BQScores: (optional) Low-latency scoring call
157+
158+
Composer->>BQRaw: Nightly Batch Pipeline (ELT) query
159+
Composer->>BQFeat: Build training features
160+
Composer->>BQML: Train / retrain model\n(tag model_version)
161+
Composer->>BQScores: Batch scoring jobs
162+
163+
BQScores-->>BI: Risk dashboards, alerts, queues
164+
```
165+
166+
---
167+
168+
## 6. Docs index
169+
170+
Detailed documentation lives under `/docs`:
171+
172+
- `docs/01-context-and-usecase.md`
173+
- `docs/02-architecture-overview.md`
174+
- `docs/03-streaming-pipeline-event-flow.md`
175+
- `docs/04-batch-pipeline-elt-and-ml-training.md`
176+
- `docs/05-data-models-and-feature-store.md`
177+
- `docs/06-data-quality-and-risk-metrics.md`
178+
- `docs/07-security-and-governance.md`
179+
- `docs/08-lineage-and-auditability.md`
180+
- `docs/09-slos-observability-and-dashboards.md`
181+
- `docs/10-cost-and-scaling-guardrails.md`
182+
- `docs/11-ml-governance-and-model-risk.md`
183+
- `docs/12-roadmap-and-future-work.md`
184+
185+
---
186+
187+
## 7. Repository map
188+
189+
```text
190+
- README.md
191+
- RUNBOOK.md
192+
- SECURITY.md
193+
- ETHICS.md
194+
- LICENSE
195+
- CODEOWNERS
196+
- CODE_OF_CONDUCT.md
197+
- CONTRIBUTING.md
198+
- .pre-commit-config.yaml
199+
- .markdownlint.jsonc
200+
- .markdownlint-cli2.jsonc
201+
- .editorconfig
202+
- docs/
203+
- contracts/
204+
- adr/
205+
- assets/
206+
- qc_examples.sql
207+
```
208+
209+
---
210+
211+
## 8. Status
212+
213+
This is a **documentation-only** case study designed for LinkedIn, GitHub, and portfolio review.

0 commit comments

Comments
 (0)