The MVP simulates the transactional pipeline of 100cep Gateway, including ingestion, processing, reconciliation and chargebacks, following acquiring and financial infrastructure standards.
Data pipeline built on Databricks to simulate the processing of orders, payments and chargebacks for a fictitious company in the payments sector, 100cep Gateway.
The project follows Data Lakehouse best practices, using Delta Lake, Unity Catalog and the Bronze β Silver β Gold architecture.
Repository organization:
π 100cep-gateway
βββ π .databricks
β βββ π pipeline
β βββ π html # contains Databricks files in .html format
β βββ π notebooks # contains Databricks files in .ipynb format
βββ π datasets
β βββ π ai_dataset # contains the dataset generated by the OpenAI 5.0 model
β βββ π olist_dataset # contains the Brazilian E-Commerce Public Dataset by Olist
βββ π dbdiagram # contains the code created in dbdiagram.io
βββ π images
β βββ π databricks # Databricks evidence
β βββ π dbdiagram # dbdiagram.io schema
β βββ π logo # 100cep Gateway logo
100cep Gateway is a fictitious borderless payments infrastructure company, specialized in processing global payments in a fast, secure and interoperable way.
Our goal is to enable fast, secure and borderless transactions β after all, we are 100cep: with no city, state or country limiting the flow of payments.
- π Global Payments: Processing without geographic restrictions
- β‘ High Performance: Infrastructure prepared for high transaction volume
- π Security: Real-time fraud and chargeback monitoring
- π Analytics: Dashboards and metrics for decision-making
- Databricks: Unified data platform
- Delta Lake: Transactional storage format
- Unity Catalog: Data governance and cataloging
- UC Volumes: Raw file storage
- Apache Spark: Distributed processing engine
- PySpark: Python API for Spark
- SQL: Analytical queries and transformations
- Pandas: Exploratory data analysis
- Seaborn: Statistical visualizations
- Matplotlib: Charts and plots
- GeoPandas: Geospatial analyses
- dbdiagram.io: Data modeling
This MVP aims to build a complete data engineering pipeline to:
- ingest transactional e-commerce data;
- standardize, relate and organize entities (orders, payments, items, customers, sellers);
- generate analytical layers for monitoring risk, antifraud and chargebacks;
- answer business questions typical of payment companies, acquirers and gateways.
The central focus is to understand:
How can 100cep Gateway monitor, reconcile and anticipate payment and chargeback events using transactional data?
All business questions are documented in:
π /docs/business_questions.md
The data used were obtained from Kaggle (Brazilian E-Commerce Public Dataset by Olist), widely used in studies and educational projects.
Process followed:
- Manual download of CSV files.
- Upload to Unity Catalog Volumes in Databricks, ensuring:
- cloud storage,
- versioning via UC,
- standardized ingestion at the Bronze level.
β No web scraping or sensitive data was used.
β No internal or confidential data from real companies was used.
πΈ Evidence: Screenshots of the collection process are available in the /docs/images/databricks/ folder.
A Lakehouse model with flat tables by concept was adopted:
- Storage of files exactly as they arrived.
- No cleaning, no inference, no standardization.
- Auditability guarantee.
- Type standardization
- Deduplication
- Handling of nulls
- Correction of derived columns
- Relationship between entities (logical joins)
- Business-oriented analytical tables
- KPIs for chargebacks, GMV, average ticket
- Models by payment method, seller and region
A Data Catalog was created containing:
- Column name
- Data type
- Expected domain
- Minimum and maximum values (numerical)
- Possible categories (categorical)
- Functional description
- Source layer
- Bronze β Silver β Gold lineage
π Complete documentation: /docs/data_catalog.md
The load was structured in three main steps:
- Reading CSVs directly from the UC Volume
- Persistence in Delta
- Normalization of column names
- Conversion of datetime types
- Correction of categorical columns
- Standardization of numeric fields
- Removal of duplicates
- Consolidation of related tables
- Aggregated tables
- Operational and risk metrics
- Joins between orders, payments and chargebacks
π Complete ETL documentation: /docs/etl.md
πΈ Execution evidence: Screenshots available in /docs/images/databricks/
An analysis was performed of:
- missing values
- out-of-domain values
- inconsistencies between tables
- duplicated data
- format errors
Corrections were applied in the Silver layer, ensuring:
- β Consistent and reliable data
- β Correct data types
- β Values within expected domains
- β Integrity of relationships between tables
πΈ Evidence: Screenshots available in /docs/images/databricks/
The Gold analyses answer questions such as:
- What is the most used payment method by 100cep Gateway customers?
- What is the revenue history for the year 2017?
- What is the proportion of orders with and without chargeback requests?
- Which payment methods have the highest chargeback risk?
- Which states present the highest chargeback rates?
Detailed answers are in:
π /docs/business_questions.md
- Credit card is the predominant payment method
- Chargeback rate varies significantly by state
- Correlation between payment method and chargeback risk
- Seasonal patterns in 2017 revenue
Final discussion about:
- goals achieved and not achieved;
- challenges faced;
- natural limitations of the MVP;
- improvements and next steps (streaming, automation, dashboards, monitoring).
π Complete documentation: /docs/self_assessment.md
Felipe Pinheiro
Dataset: Brazilian E-Commerce Public Dataset by Olist
Author: Olist & AndrΓ© Sionek
DOI Citation: DOI
License: CC BY-NC-SA 4.0
