Skip to content

Bebelon2000/public-contracts-oltp-to-olap

Repository files navigation

Public Contracts Analytics: OLTP to OLAP Data Pipeline

A complete data engineering pipeline that extracts public procurement contracts from the Portuguese public portal (BASE.gov.pt), structures them into a normalized operational database (OLTP), and transforms them into an optimized dimensional model (OLAP / Star Schema) for analytical reporting and business intelligence.


🌟 Key Features

  • Automated Data Extraction: Scrapes and processes public contract records from the official BASE.gov API.
  • OLTP Relational Database: Stores raw transactional data in a normalized structure (Districts, Municipalities, Parishes, and Contracts).
  • OLAP Star Schema: Implements a clean dimensional model optimized for fast BI reporting (PowerBI, Metabase, etc.).
  • ETL Aggregations in Python: Pre-calculates heavy statistical measures during the ETL process, including:
    • Contract Count (quantidade_contratos)
    • Total Cost (preco_total)
    • Maximum Price (preco_maximo)
    • Minimum Price (preco_minimo)
    • Average Price (preco_medio)
    • Median Price (preco_mediano) (computed in Python to bypass complex database-side median calculations).
  • Performance Optimization: Star schema queries run up to 3.45x faster compared to equivalent queries in the normalized OLTP schema.

📐 Architecture & Database Models

1. Operational Database (OLTP)

Highly normalized schema to ensure transactional integrity and avoid redundancy.

  • Tabelas: distritosmunicipiosfreguesiascontratos_mun / contratos_freg

2. Analytical Database (OLAP - Star Schema)

Denormalized star schema designed around a central fact table and surrounding dimension tables to allow fast slice-and-dice operations.

erDiagram
    fact_contratos }|--|| dim_localidade : joins
    fact_contratos }|--|| dim_tempo : joins
    fact_contratos }|--|| dim_entidade : joins
    fact_contratos }|--|| dim_pais : joins
    fact_contratos }|--|| dim_procedimento : joins

    fact_contratos {
        bigint id_fact PK
        enum origem_contrato
        int id_localidade FK
        int id_tempo_publicacao FK
        int id_tempo_assinatura FK
        int id_entidade_contratante FK
        int id_entidade_contratada FK
        int id_pais FK
        int id_procedimento FK
        int quantidade_contratos
        decimal preco_total
        decimal preco_maximo
        decimal preco_minimo
        decimal preco_medio
        decimal preco_mediano
    }

    dim_localidade {
        int id_localidade PK
        enum nivel_entidade
        int nif_mun
        varchar nome_municipio
        char ine_mun
        int nif_freg
        varchar nome_freguesia
        char ine_freg
        char ine_dist
        varchar nome_distrito
    }

    dim_tempo {
        int id_tempo PK
        date data
        int ano
        int mes
        varchar nome_mes
        int trimestre
        int semestre
        int dia
        int dia_semana
        varchar nome_dia_semana
    }

    dim_entidade {
        int id_entidade PK
        varchar nome_entity
    }

    dim_procedimento {
        int id_procedimento PK
        varchar procedureType_base
    }

    dim_pais {
        int id_pais PK
        varchar pais_base
    }
Loading

⚡ Performance Comparison

Evaluating query execution times on a dataset of 450,000+ public contracts:

The Scenario:

Get the total number of contracts and total financial volume grouped by District.

Normalized OLTP Query (6.31 seconds):

Requires unioning municipality/parish contracts, traversing the multi-layered geographical hierarchy, and using an OR conditional join (which bypasses database indexes).

SELECT 
    d.nome AS nome_distrito,
    COUNT(*) AS total_contratos,
    SUM(c.price) AS valor_total
FROM (
    SELECT nif_mun AS nif, price FROM contratos_mun
    UNION ALL
    SELECT nif_freg AS nif, price FROM contratos_freg
) c
LEFT JOIN municipios m ON c.nif = m.nif_mun
LEFT JOIN freguesias f ON c.nif = f.nif_freg
LEFT JOIN municipios m2 ON f.nif_mun = m2.nif_mun
LEFT JOIN distritos d ON (m.ine_dist = d.ine_dist OR m2.ine_dist = d.ine_dist)
WHERE d.nome IS NOT NULL
GROUP BY d.nome;

Star Schema OLAP Query (1.83 seconds):

Uses a single, direct, index-friendly join between fact_contratos and dim_localidade.

SELECT 
    l.nome_distrito,
    SUM(f.quantidade_contratos) AS total_contratos,
    SUM(f.preco_total) AS valor_total
FROM fact_contratos f
JOIN dim_localidade l ON f.id_localidade = l.id_localidade
WHERE l.nome_distrito IS NOT NULL
GROUP BY l.nome_distrito;

Result: The OLAP query is 3.45x faster due to pre-aggregation and index-optimized design.


🛠️ How it Works (ETL Pipeline)

  1. Extraction (extrair_contratos_base_gov/): Scrapes BASE.gov JSON endpoints paginating by NIF, downloading records and saving them locally.
  2. Operational Load (enviar_dados_para_bd/): Imports normalized geographical data (Districts, Municipalities, Parishes) and populates the OLTP tables in MySQL.
  3. OLAP Transformation (ETL_OLAP/transformar_para_OLAP.py):
    • Builds dimension tables dynamically.
    • Extracts date parts (year, month, semester, trimester, weekday name) for dim_tempo.
    • Aggregates contracts by identical dimension keys in Python, calculating sums, counts, and statistical ranges (min, max, average, and median).
    • Inserts records in optimized database chunks (10,000 rows/batch) to avoid server timeouts.

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • MySQL Server 8.0+
  • mysql-connector-python package

Setup

  1. Create a MySQL database (e.g., Bernardo).
  2. Run the DDL script to create the OLTP tables.
  3. Place your raw Excel/CSV data files in the designated folders.
  4. Set up database credentials inside the Python scripts.
  5. Run the ETL pipeline script to transform operational data into the analytical star schema:
    python ETL_OLAP/transformar_para_OLAP.py

About

OLTP to OLAP Data Pipeline for Portuguese Public Procurement Contracts (BASE.gov.pt)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages