Only the Executive Summary Completed, Report is actievely worked on
This project demonstrates a professional end-to-end data engineering pipeline built in a local environment to simulate a production cloud-scale architecture. The system extracts, transforms, and visualizes regional economic disparities in Poland using data from the Statistics Poland (GUS) BDL API.
Key Technical Pillars:
- Cloud Simulation: Full Azure Data Lake simulation using Azurite (Blob Storage) and local PySpark on Windows.
- Infrastructure Portability: Automated environment normalization (WinUtils/Hadoop integration) for seamless execution across multiple machines (Laptop/Desktop).
- Data Quality & Imputation: Advanced transformation logic implementing linear interpolation to fill data gaps or confidential records (e.g., Opolskie 2023).
- Business Intelligence: Enterprise-grade Power BI reporting featuring a Bento Grid layout, multi-page navigation, and dynamic DAX narratives.
- Source: Statistics Poland API (GUS BDL).
- Ingestion (Extract): Python-based client with pagination handling and rate-limiting, persisting raw JSON telemetry.
- Storage (Data Lake): Azurite Blob Storage Emulator organized into
raw,staging, andcuratedzones. - Processing (Transform): Apache Spark (PySpark) executing:
- Hierarchical JSON flattening.
- Data Imputation: Linear interpolation for missing/confidential data (
attr_id != 1). - Relational modeling (Star Schema).
- Serving (Load): Final assets stored as Parquet files with static URI mapping (
data.parquet) for stable BI connectivity. - Visualization: Power BI Desktop connected via HTTP/WASB, featuring dynamic trend analysis and regional benchmarking.
The pipeline monitors a comprehensive set of indicators across 16 Voivodeships:
- Labor Market: Average Gross Wages, Registered Unemployment Rate.
- Living Standards: Disposable Income vs. Expenditures per capita.
- Macroeconomics: GDP per capita, Total GDP, Investment Outlays.
- Housing Market: Residential Price per m2, Dwellings Completed, Market Transactions Volume.
- Public Finance & Business: Budget Revenues/Expenditures, Business Entities per 10k population.
data/: Raw (JSON), Staging (Parquet), and Curated data layers.src/pyspark/: Core ETL logic (Main orchestrator, Spark setup, GUS client, Transformers).configs/: Environment-specific settings (dev/prod) and metric definitions.scripts/: Automation scripts for service management and pipeline execution.exploration/: Advanced debugging tools, API inspectors, and data availability checkers.assets/maps/: TopoJSON files processed for Power BI Shape Map integration.docs/: Detailed technical documentation, DAX blueprints, and setup guides.
- OS: Windows 10/11.
- Runtime: Python 3.11 (optimized for PySpark 3.4.1 compatibility).
- Java: JDK 17 (Required for Spark/Hadoop ecosystem).
- Emulator: Node.js for Azurite Blob Storage.
- Start Services (Azurite):
.\scripts\start_all.ps1 - Run ETL Pipeline:
.\scripts\run_etl_dev.ps1
To maintain a clean environment or reset data states, use the following utility scripts:
- Reset Cloud Storage:
python .\exploration\tools\reset_azurite.py(Wipes Azurite containers). - Clear Spark Staging:
.\scripts\maintenance\clean_staging.ps1(Removes transient Parquet files). - Purge Raw API Data:
.\scripts\maintenance\clean_raw_data.ps1(Deletes all JSON source files).
Active settings.json files are ignored by Git. Use the provided templates:
- Local/Server Mode: Copy
settings.template.jsontosettings.json. - LAN Client Mode: Copy
settings.lan.template.jsontosettings.json(update Host IP).
This project was developed in collaboration with Google Gemini 2.5 Flash Preview. The AI served as a pair-programmer for:
- Architecting the Windows-compatible Spark environment.
- Designing complex DAX measures for economic benchmarking.
- Refactoring code for professional documentation standards and idempotency.