Medallion Architecture implementation (Bronze → Silver → Gold) for Hackerschool data lakehouse on Microsoft Fabric.
- Overview
- Architecture
- Project Structure
- Getting Started
- Deployment
- Configuration
- Monitoring
- Development
- Contributing
This project implements a production-ready data lakehouse for Hackerschool using Microsoft Fabric's Medallion Architecture pattern. It processes data from Microsoft Dataverse through three quality layers:
- Bronze: Raw data (1:1 copy from Dataverse)
- Silver: Cleaned and validated data with quality checks
- Gold: Business-ready analytics with dimensional modeling and SCD Type 2
✅ Config-Driven Transformations - Add new tables without code changes ✅ Data Quality Framework - Automated validation with quarantine pattern ✅ SCD Type 2 - Historical tracking for dimension tables ✅ Comprehensive Monitoring - Metrics, logging, and alerting ✅ Multi-Environment - Dev, Test, Prod configurations ✅ Delta Lake - ACID transactions and time travel
Tables in Scope:
contact- Students, teachers, coaches (SCD Type 2)account- Partner organizations (SCD Type 2)hs_course- Course/session data
Status: 🚧 In Development
- ✅ SQL Schemas created
- ✅ Bronze → Silver (contact) notebook
- ✅ Silver → Gold SCD (contact) notebook
- ✅ Monitoring infrastructure
- ⏳ Account and Course notebooks (next)
- ⏳ Pipeline orchestration deployment
- ⏳ Power BI dashboards
┌─────────────────────────────────────────────────────────────────┐
│ DATAVERSE │
│ (Microsoft Dynamics 365) │
└────────────────────────┬────────────────────────────────────────┘
│ Fabric Link
▼
┌─────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER - Raw Data (hs_bronze_dev) │
│ • 1:1 copy from Dataverse │
│ • No transformations │
│ • Delta Lake format │
└────────────────────────┬────────────────────────────────────────┘
│ DQ Checks + Cleansing
▼
┌─────────────────────────────────────────────────────────────────┐
│ SILVER LAYER - Cleaned Data (hs_silver_dev) │
│ • Data quality validation │
│ • Standardization (trim, lowercase) │
│ • Deduplication │
│ • Invalid → Quarantine │
└────────────────────────┬────────────────────────────────────────┘
│ SCD Type 2 + Aggregations
▼
┌─────────────────────────────────────────────────────────────────┐
│ GOLD LAYER - Analytics Ready (hs_gold_dev) │
│ • Dimensional modeling (Star schema) │
│ • SCD Type 2 for history tracking │
│ • Business aggregations │
│ • Power BI ready │
└─────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ MONITORING LAYER (hs_monitoring_dev) │
│ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ Pipeline Metrics │ │ Quarantine Records │ │
│ │ • Records in/out │ │ • Invalid data │ │
│ │ • Duration │ │ • Error details │ │
│ │ • Error rates │ │ • Resolution tracking │ │
│ └──────────────────┘ └────────────────────────┘ │
│ │
│ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ DQ Rules │ │ Pipeline Run History │ │
│ └──────────────────┘ └────────────────────────┘ │
└──────────────────────────────────────────────────────┘
hckrschl-deploy/
├── notebooks/ # Spark notebooks for data processing
│ ├── bronze_to_silver_contact.py # ✅ Contact DQ pipeline
│ ├── silver_to_gold_scd_contact.py # ✅ Contact SCD Type 2
│ └── utils/
│ └── common.py # ✅ Shared utility functions
│
├── sql/ # SQL scripts
│ ├── schema/
│ │ ├── 01_monitoring_tables.sql # ✅ Monitoring infrastructure
│ │ └── 02_transformation_config.sql # ✅ Config-driven framework
│ └── seeds/
│ └── transformation_config_seed.sql # ✅ Initial config data
│
├── pipelines/ # Fabric pipeline definitions
│ └── medallion_orchestrator.json # ✅ Main orchestration pipeline
│
├── config/ # Environment configurations
│ ├── dev.json # ✅ Development settings
│ ├── test.json # ✅ Test settings
│ └── prod.json # ✅ Production settings
│
├── context/ # Planning & architecture docs
│ ├── architecture-flow.md # Detailed architecture guide
│ ├── medallion-plan.md # Technical specs
│ └── deployment-plan.md # 2-week implementation plan
│
├── tests/ # Unit and integration tests (TBD)
├── .github/workflows/ # CI/CD pipelines (TBD)
└── README.md # This file
-
Microsoft Fabric Workspace with:
- Lakehouse capability enabled
- Notebook capability enabled
- Pipeline capability enabled
-
Microsoft Dataverse environment:
- Fabric Link configured
- Tables: contact, account, hs_course
-
Access & Permissions:
- Fabric Workspace Contributor or Admin
- Dataverse System Administrator (for link setup)
In Microsoft Fabric workspace hs-fabric-dev, create:
hs_bronze_dev (use existing Dataverse link: hs_shareddev)
hs_silver_dev (new)
hs_gold_dev (new)
hs_monitoring_dev (new)
- Open SQL endpoint for
hs_monitoring_dev - Run:
sql/schema/01_monitoring_tables.sql - Verify tables created:
pipeline_metricsquarantine_recordsdq_rulespipeline_run_history
- Open SQL endpoint for
hs_silver_dev - Run:
sql/schema/02_transformation_config.sql - Run:
sql/seeds/transformation_config_seed.sql - Verify config loaded:
SELECT config_id, source_table, target_table, active FROM transformation_config WHERE active = TRUE;
-
Import notebooks from
notebooks/to Fabric workspace:bronze_to_silver_contact.pysilver_to_gold_scd_contact.pyutils/common.py
-
Attach to lakehouse:
hs_silver_dev(default)
Manual execution:
# In Fabric notebook or SQL endpoint
# 1. Run Bronze → Silver
notebook: bronze_to_silver_contact
parameters: {"environment": "dev"}
# 2. Run Silver → Gold
notebook: silver_to_gold_scd_contact
parameters: {"environment": "dev"}Check results:
-- Silver data
SELECT COUNT(*) as record_count FROM hs_silver_dev.contact;
-- Gold data (current)
SELECT COUNT(*) as current_count
FROM hs_gold_dev.dim_contact_scd
WHERE is_current = TRUE;
-- Metrics
SELECT * FROM hs_monitoring_dev.pipeline_metrics
ORDER BY start_time DESC
LIMIT 5;
-- Quarantine
SELECT error_type, COUNT(*) as error_count
FROM hs_monitoring_dev.quarantine_records
GROUP BY error_type;Configuration files in config/ control behavior per environment:
| Setting | Dev | Test | Prod |
|---|---|---|---|
| Error Threshold | 10% | 5% | 2% |
| Batch Size | 1,000 | 5,000 | 10,000 |
| Schedule | 2:00 AM | 3:00 AM | 1:00 AM |
| Monitoring | INFO | INFO | WARN |
Option 1: Config-Driven (Recommended)
Add entry to transformation_config:
INSERT INTO transformation_config VALUES (
'mytable_bronze_silver_dq',
'bronze.mytable',
'silver.mytable',
'silver',
'dq_check',
'{"validations": [...], "transformations": [...]}',
10,
true,
1000,
NULL,
CURRENT_TIMESTAMP(),
NULL,
'data-team',
'Description of mytable transformation'
);Option 2: Copy Existing Notebook
- Copy
bronze_to_silver_contact.py→bronze_to_silver_mytable.py - Update table names and validation rules
- Add to pipeline orchestrator
Pipeline Metrics Table:
SELECT
stage,
source_table,
AVG(error_rate) as avg_error_rate,
AVG(duration_sec) as avg_duration,
SUM(records_written) as total_records
FROM hs_monitoring_dev.pipeline_metrics
WHERE start_time >= CURRENT_DATE() - INTERVAL 7 DAYS
GROUP BY stage, source_table;Quarantine Dashboard:
SELECT
source_table,
error_type,
COUNT(*) as error_count,
MAX(created_at) as last_occurrence
FROM hs_monitoring_dev.quarantine_records
WHERE status = 'new'
GROUP BY source_table, error_type
ORDER BY error_count DESC;Azure Monitor alerts trigger on:
- Error rate > threshold
- Pipeline failures
- Long-running pipelines (> 2 hours)
- Quarantine spike
Prerequisites:
- Python 3.10+
- PySpark 3.4+
- pytest
# Install dependencies
pip install -r requirements.txt # TBD
# Run unit tests
pytest tests/ # TBD
# Run notebook locally (with Databricks Connect)
python notebooks/bronze_to_silver_contact.py # TBDmain- Production codedevelop- Integration branchfeature/*- Feature brancheshotfix/*- Emergency fixes
- Linting:
flake8(Python),sqlfluff(SQL) - Formatting:
black(Python) - Type Hints: Use for utility functions
- Documentation: Docstrings for all functions
See context/ folder for detailed planning documents:
- architecture-flow.md - Deep-dive into architecture decisions (899 lines)
- medallion-plan.md - Executive summary and technical specs (301 lines)
- deployment-plan.md - 14-day implementation roadmap (517 lines)
- Stefan Kochems (EY) - Architecture & Planning
- Saber - Implementation
- Create feature branch:
git checkout -b feature/my-feature - Make changes and test locally
- Commit with clear messages
- Push and create Pull Request
- Tag reviewers: @stefan-kochems
Internal Hackerschool project - not for public distribution.
Issues: Create GitHub issue with label:
bug- Something isn't workingenhancement- New feature requestquestion- Need help
Contact:
- Data Team: [email protected]
- Stefan Kochems: [email protected]
- Architecture design
- SQL schemas
- Contact table pipeline
- Account table pipeline
- Course table pipeline
- Pipeline orchestration
- Power BI dashboards
- Config-driven generic notebooks
- All Dataverse tables onboarded
- CI/CD pipeline
- Test environment deployment
- Performance optimization
- Production deployment
- Monitoring & alerting
- Data lineage tracking
- User documentation
- Training sessions
Last Updated: 2025-10-26 Version: 0.1.0 (MVP in progress) Repository: github.com/hackerschool/fabric-medallion