Skip to content

Latest commit

 

History

History
295 lines (225 loc) · 11.6 KB

File metadata and controls

295 lines (225 loc) · 11.6 KB

Data Catalog — 100cep Gateway

Gold Layer 6 Tables Dimensional Model


📚 About This Catalog

This document contains the complete documentation of the data model for the Gold layer of the 100cep Gateway MVP. The model was structured following dimensional Data Warehouse principles, optimized for business analytics.


Relationship Table

Source Table FK Column Target Table PK Column Cardinalidade
fato_transacoes cliente_id dim_clientes cliente_id N:1
fato_transacoes vendedor_id dim_vendedores vendedor_id N:1
fato_transacoes data_pedido dim_data data_calendario N:1
fato_transacoes pedido_id dim_chargebacks pedido_id N:0,1
dim_clientes cep_prefixo dim_geolocalizacao cep_prefixo N:1
dim_vendedores cep_prefixo dim_geolocalizacao cep_prefixo N:1

Note: Not all orders have chargebacks.


dim_chargebacks

Type: Event Dimension
Granularity: 1 record = 1 chargeback request
Records: ~1,000 chargebacks

dim_chargebacks

Column Type Description
pedido_id string Unique identifier for each chargeback request, consisting of 13 alphanumeric characters in lowercase.
motivo_chargeback string Reason for the chargeback request, which may include fraud, non-receipt of products, defective product, among other causes.
status_chargeback string Current status of the chargeback request, indicating its progress stage.
resposta_emitente string Response provided by the card issuer regarding the chargeback request.
resposta_adquirente string Response provided by the acquirer regarding the chargeback request.

Value Domains:

  • motivo_chargeback: Fraud, Product not received, Product defective, Service issue, Unauthorized transaction, etc.
  • status_chargeback: pending, approved, denied, investigating, closed
  • resposta_emitente: Approved, Denied, Under Review, Pending Documentation
  • resposta_adquirente: Accepted, Rejected, Requesting Evidence, Processing

Business Rules:

  • ✅ Not all orders have chargebacks (1:0..1 relationship)
  • ✅ Chargebacks are generated by an additional dataset (AI-generated)
  • ✅ Used for risk analysis and fraud detection

Use:

  • Chargeback rate analysis
  • Identification of fraud patterns
  • Risk assessment by payment method/region
  • Monitoring of financial losses

dim_data

dim_data

Column Type Description
data_calendario date Date when the order was placed. Format: YYYY-MM-DD.
dia int Day number corresponding to the date in integer format.
mes int Month number corresponding to the date.
ano int Year number corresponding to the date.
nome_dia_semana string Weekday name corresponding to the date.
nome_mes string Month name corresponding to the date.
Use:
  • Identification of sellers in transactions
  • Performance analysis by seller
  • Geographic analysis of sellers

dim_geolocalizacao

Type: Geographic Dimension
Granularity: 1 record = 1 ZIP code prefix (5 digits)
Records: ~19,000 ZIP code prefixes

dim_geolocalizacao

Column Type Description
cep_prefixo string The first 5 digits of the ZIP code.
cidade string City name associated with the ZIP code.
estado string Brazilian state abbreviation (two uppercase letters) associated with the ZIP code.
latitude string Geographic coordinate specifying the north-south position.
longitude string Geographic coordinate specifying the east-west position.
Use:
  • Identification of customers in transactions
  • Geographic analysis (via cep_prefixo)
  • Customer segmentation by region

dim_vendedores

Type: Dimension Table
Granularity: 1 record = 1 seller
Records: ~3,000 sellers

dim_vendedores

Column Type Description
vendedor_id string Unique identifier for each seller, consisting of 13 alphanumeric characters in lowercase.
cep_prefixo string The first 5 digits of the seller's postal code.

fato_transacoes

Type: Fact Table
Granularity: 1 record = 1 payment transaction per order
Records: ~99,000 transactions

ato_transacoes

Column Type Description
pedido_id string Unique identifier for each transaction, consisting of 13 alphanumeric characters in lowercase.
cliente_id string Unique identifier for each client, consisting of 13 alphanumeric characters in lowercase.
vendedor_id string Unique identifier for each seller, consisting of 13 alphanumeric characters in lowercase.
data_pedido date Date when the order was placed. Format: YYYY-MM-DD.
horario_pedido date Time when the order was placed in the format HH:MM:SS.
tipo_pagamento string Payment method used for the transaction.
valor_transacao decimal(12,2) Transaction value in a decimal format with up to 12 digits and 2 decimal places, only positive values.
preco_total decimal(16,2) Total order price, including product price and shipping cost.
frete_total decimal(15,2) Total shipping cost in a decimal format with up to 15 digits and 2 decimal places, only positive values.
status_pedido string Current delivery status of the order.

Value Domains:

  • tipo_pagamento: credit_card, boleto, voucher, debit_card
  • status_pedido: delivered, shipped, canceled, processing, unavailable, invoiced

Business Rules:

  • ✅ Only orders with status "delivered" are included
  • preco_total = sum of all items in the order
  • frete_total = sum of all shipping values of the items
  • valor_transacao can be different from preco_total (multiple payments)

dim_data

Type: Time Dimension
Granularity: 1 record = 1 day
Records: ~365 days (2016-2018)

dim_data

Column Type Description
data_calendario date Date when the order was placed. Format: YYYY-MM-DD.
dia int Day number corresponding to the date in integer format.
mes int Month number corresponding to the date.
ano int Year number corresponding to the date.
nome_dia_semana string Weekday name corresponding to the date.
nome_mes string Month name corresponding to the date.

Value Domains:

  • dia: 1-31
  • mes: 1-12
  • ano: 2016-2018
  • nome_dia_semana: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
  • nome_mes: January, February, March, ..., December

dim_clientes

Type: Dimension Table
Granularity: 1 record = 1 customer
Records: ~99,000 customers

dim_clientes

Column Type Description
cliente_id string Unique identifier for each client, consisting of 13 alphanumeric characters in lowercase.
cep_prefixo string The first 5 digits of the customer postal code.

dim_vendedores

Type: Dimension Table
Granularity: 1 record = 1 seller
Records: ~3,000 sellers

dim_vendedores

Column Type Description
vendedor_id string Unique identifier for each seller, consisting of 13 alphanumeric characters in lowercase.
cep_prefixo string The first 5 digits of the seller's postal code.

dim_geolocalizacao

Type: Geographic Dimension
Granularity: 1 record = 1 ZIP code prefix (5 digits)
Records: ~19,000 ZIP code prefixes

dim_geolocalizacao

Column Type Description
cep_prefixo string The first 5 digits of the ZIP code.
cidade string City name associated with the ZIP code.
estado string Brazilian state abbreviation (two uppercase letters) associated with the ZIP code.
latitude string Geographic coordinate specifying the north-south position.
longitude string Geographic coordinate specifying the east-west position.

Value Domains:

  • estado: Abbreviations of the 27 Brazilian states (AC, AL, AM, AP, BA, CE, DF, ES, GO, MA, MG, MS, MT, PA, PB, PE, PI, PR, RJ, RN, RO, RR, RS, SC, SE, SP, TO)
  • latitude: -33.75 to 5.27 (Brazil boundaries)
  • longitude: -73.99 to -34.79 (Brazil boundaries)

🔗 Data Lineage

Transformation Flow

📂 Kaggle CSV Files
    ↓
💾 Unity Catalog Volumes
    ↓
🟫 Bronze Layer (Raw)
    ├─ bronze_customers
    ├─ bronze_orders
    ├─ bronze_order_payments
    ├─ bronze_order_items
    ├─ bronze_sellers
    ├─ bronze_geolocation
    └─ bronze_chargebacks
    ↓
🧹 Silver Layer (Cleaned)
    ├─ silver_customers      [limpeza + validação]
    ├─ silver_orders        [conversão temporal]
    ├─ silver_order_payments [agregação]
    ├─ silver_order_items    [agregação]
    ├─ silver_sellers       [padronização]
    ├─ silver_geolocation   [deduplicação]
    └─ silver_chargebacks   [enriquecimento]
    ↓
🏆 Gold Layer (Analytics) ← YOU ARE HERE!
    ├─ fato_transacoes
    ├─ dim_data
    ├─ dim_clientes
    ├─ dim_vendedores
    ├─ dim_geolocalizacao
    └─ dim_chargebacks

Column Mapping (Main)

Origin (Kaggle) Bronze Silver Gold Transformation
order_id order_id pedido_id pedido_id Translation
customer_id customer_id cliente_id cliente_id Translation
order_purchase_timestamp order_purchase_timestamp data_compra (TIMESTAMP) data_pedido (DATE) Conversion + Split
payment_value payment_value (STRING) valor_pagamento (DECIMAL) valor_transacao (DECIMAL) Type cast + Aggregation
zip_code_prefix zip_code_prefix cep_prefixo cep_prefixo Translation

Complete transformation documentation: /docs/etl.md


✅ Quality Validations

  • Unique primary keys (no duplicates)
  • Valid foreign keys (referential integrity)
  • Correct and consistent data types
  • Values within expected domains
  • Nulls only in optional fields
  • Geographic coordinates within Brazil's boundaries
  • Dates within the expected range (2016-2018)
  • Positive numeric values (except lat/long)