A complete data engineering pipeline that extracts public procurement contracts from the Portuguese public portal (BASE.gov.pt), structures them into a normalized operational database (OLTP), and transforms them into an optimized dimensional model (OLAP / Star Schema) for analytical reporting and business intelligence.
- Automated Data Extraction: Scrapes and processes public contract records from the official BASE.gov API.
- OLTP Relational Database: Stores raw transactional data in a normalized structure (Districts, Municipalities, Parishes, and Contracts).
- OLAP Star Schema: Implements a clean dimensional model optimized for fast BI reporting (PowerBI, Metabase, etc.).
- ETL Aggregations in Python: Pre-calculates heavy statistical measures during the ETL process, including:
- Contract Count (
quantidade_contratos) - Total Cost (
preco_total) - Maximum Price (
preco_maximo) - Minimum Price (
preco_minimo) - Average Price (
preco_medio) - Median Price (
preco_mediano) (computed in Python to bypass complex database-side median calculations).
- Contract Count (
- Performance Optimization: Star schema queries run up to 3.45x faster compared to equivalent queries in the normalized OLTP schema.
Highly normalized schema to ensure transactional integrity and avoid redundancy.
- Tabelas:
distritos➔municipios➔freguesias➔contratos_mun/contratos_freg
Denormalized star schema designed around a central fact table and surrounding dimension tables to allow fast slice-and-dice operations.
erDiagram
fact_contratos }|--|| dim_localidade : joins
fact_contratos }|--|| dim_tempo : joins
fact_contratos }|--|| dim_entidade : joins
fact_contratos }|--|| dim_pais : joins
fact_contratos }|--|| dim_procedimento : joins
fact_contratos {
bigint id_fact PK
enum origem_contrato
int id_localidade FK
int id_tempo_publicacao FK
int id_tempo_assinatura FK
int id_entidade_contratante FK
int id_entidade_contratada FK
int id_pais FK
int id_procedimento FK
int quantidade_contratos
decimal preco_total
decimal preco_maximo
decimal preco_minimo
decimal preco_medio
decimal preco_mediano
}
dim_localidade {
int id_localidade PK
enum nivel_entidade
int nif_mun
varchar nome_municipio
char ine_mun
int nif_freg
varchar nome_freguesia
char ine_freg
char ine_dist
varchar nome_distrito
}
dim_tempo {
int id_tempo PK
date data
int ano
int mes
varchar nome_mes
int trimestre
int semestre
int dia
int dia_semana
varchar nome_dia_semana
}
dim_entidade {
int id_entidade PK
varchar nome_entity
}
dim_procedimento {
int id_procedimento PK
varchar procedureType_base
}
dim_pais {
int id_pais PK
varchar pais_base
}
Evaluating query execution times on a dataset of 450,000+ public contracts:
Get the total number of contracts and total financial volume grouped by District.
Requires unioning municipality/parish contracts, traversing the multi-layered geographical hierarchy, and using an OR conditional join (which bypasses database indexes).
SELECT
d.nome AS nome_distrito,
COUNT(*) AS total_contratos,
SUM(c.price) AS valor_total
FROM (
SELECT nif_mun AS nif, price FROM contratos_mun
UNION ALL
SELECT nif_freg AS nif, price FROM contratos_freg
) c
LEFT JOIN municipios m ON c.nif = m.nif_mun
LEFT JOIN freguesias f ON c.nif = f.nif_freg
LEFT JOIN municipios m2 ON f.nif_mun = m2.nif_mun
LEFT JOIN distritos d ON (m.ine_dist = d.ine_dist OR m2.ine_dist = d.ine_dist)
WHERE d.nome IS NOT NULL
GROUP BY d.nome;Uses a single, direct, index-friendly join between fact_contratos and dim_localidade.
SELECT
l.nome_distrito,
SUM(f.quantidade_contratos) AS total_contratos,
SUM(f.preco_total) AS valor_total
FROM fact_contratos f
JOIN dim_localidade l ON f.id_localidade = l.id_localidade
WHERE l.nome_distrito IS NOT NULL
GROUP BY l.nome_distrito;Result: The OLAP query is 3.45x faster due to pre-aggregation and index-optimized design.
- Extraction (
extrair_contratos_base_gov/): Scrapes BASE.gov JSON endpoints paginating by NIF, downloading records and saving them locally. - Operational Load (
enviar_dados_para_bd/): Imports normalized geographical data (Districts, Municipalities, Parishes) and populates the OLTP tables in MySQL. - OLAP Transformation (
ETL_OLAP/transformar_para_OLAP.py):- Builds dimension tables dynamically.
- Extracts date parts (year, month, semester, trimester, weekday name) for
dim_tempo. - Aggregates contracts by identical dimension keys in Python, calculating sums, counts, and statistical ranges (min, max, average, and median).
- Inserts records in optimized database chunks (10,000 rows/batch) to avoid server timeouts.
- Python 3.8+
- MySQL Server 8.0+
mysql-connector-pythonpackage
- Create a MySQL database (e.g.,
Bernardo). - Run the DDL script to create the OLTP tables.
- Place your raw Excel/CSV data files in the designated folders.
- Set up database credentials inside the Python scripts.
- Run the ETL pipeline script to transform operational data into the analytical star schema:
python ETL_OLAP/transformar_para_OLAP.py