This repository contains the ETL (Extract, Transform, Load) process for analyzing data from the "Dunder Mifflin Paper Company" sales.xlsx file using Power BI. The Excel file has been pre-processed to include a "COGS" (Cost of Goods Sold) column and updated branch names from the series "The Office".
Before continuing into this repository, I was already working in Power BI until reaching the point where it can be seen in the following image:
-
The first thing I wanted to know was if sales were correlated with the company's profits. In the scatterplot, we can see that this is not the case.
-
We could think that the more we sell, the more profits we will have; However, we notice that there are losses in this graph.
As data professionals, we should not assume anything and take our conclusions for granted. Next, I will show you the components of my project:
The goal is to demonstrate how Python and Power BI merge when exploring the components of Business Intelligence (BI) flow. This includes:
-
Performing Exploratory Data Analysis (EDA) using Python to clean data and remove outliers in the Fact table.
-
Following the EDA, the project employs ETL (Extract, Transform, Load) processes to prepare the data for advanced analysis and interactive visualizations in Power BI.
-
Practical uses of DAX (Data Analysis Expressions) in Microsoft Power to create business metrics and KPIs, enabling data-driven decision-making.
Here is a little extract from Jupyter Notebook, using Python and their libraries (described down in the Technology Stack section).
Remember when I first detected the outliers in the Power BI visualization? Well, this is the part and I am performing their handling per each customer segment
Data extraction, source consolidation, cleaning, and transformation.
- Open Power BI Desktop.
- Select "Get Data" and choose "Excel".
- Locate and select "sales.xlsx".
- Choose the table format sheets (for this case, "FACT_Sales", "DIM_SKU" and "DIM State_Branches").
Coming up next, we will see the dimension table "State_Branches" has no relation to the fact one "Sales"
- Rename columns for State_Branches by using first rows as headers:
- Check the Model View if the relation to Sales was created:
- Check in the Table View the fact table "Sales" and determine the measures we will create from:
- Apply changes and close Power Query Editor.
- Load transformed data into the Power BI data model.
- Establish relationships between tables if multiple tables are used.
- Create Measures for the columns previously marked in the red rectangle:
- Quantity.
- Sales.
- COGS.
- Profit.
The ETL execution will depend on the specific needs of your data analysis. Mostly, those key moments happen when:
- Integrating new data sources.
- Updating existing data.
- Developing new analyses or reports.
- Modifying the data structure.
- Performing data maintenance or cleaning.
Relationships, indicators, optimization.
DAX is a formula language that allows users to create custom calculations & expressions in Power BI.
It is similar to Excel formulas but is specifically designed for use in Power BI & other Microsoft BI tools. Some of them used were:
-
Sales Measures:
-
Total Sales = SUM ( FACT_Sales[Sales] )
-
Sales Ranking = RANKX ( ALL ( 'DIM state_branches'[Branch] ), [Total Sales] )
-
Cumulative Total Sales = CALCULATE ( [Total Sales], TOPN ( [Sales Ranking], ALL ( 'DIM state_branches'[Branch] ), [Total Sales] ))
-
% Sales Performance = [Cumulative Total Sales] / CALCULATE ( [Total Sales], ALL ( 'DIM state_branches'[Branch] ) )
-
-
Time Intelligence:
- DIM_Calendar = CALENDARAUTO()
Data visualization, reports, dashboards, storytelling.
To determine which reports I wanted to visualize and how to develop their storytelling, I referred to the section in Eric Ries' book "The Lean Startup" that discusses the Toyota method of the 5 Whys, which helps to identify the root cause of a detected problem. In this case, I wanted to understand why "higher sales did not translate into higher profits".
-
Why are sales and profit not correlated? Because more sales do not necessarily mean more profit.
-
Why do more sales not necessarily mean more profit? Because some sales result in losses.
-
Why do some sales result in losses? Because their costs are higher than the revenue generated from those sales.
-
Why are their costs higher than the revenue? Because in some branches, operational costs are higher.
-
Why are their operational costs higher? Maybe it has to do with some products that are complex to sell.
Based on the insights derived from the visualized reports and the storytelling process, the following actionable steps are suggested to be taken:
-
Analyze Cost Structure:
- Conduct a detailed analysis of the cost structure to understand where cost optimizations can be made.
- Identify areas where expenses are disproportionately high, despite higher sales.
-
Improve Product Mix:
- Temporarily withdraw products with excessively high costs.
- Focus on promoting and selling profitable products.
- Implement this strategy by prioritizing the branches that generate 80% of the company's revenue.
By executing these data-driven actions, we can systematically address the root causes of the problem and drive the organization toward sustainable profitability.
- matplotlib.pyplot: For data visualization.
- numpy: For mathematical operations and array manipulation.
- pandas: For data manipulation and analysis.
- seaborn: For statistical data visualization.
- Microsoft Power BI: For interactive data visualization and dashboard creation.
- DAX Formatter: For formatting Data Analysis Expressions (DAX) queries.
To show what it looks like in action, click here.
Note: The demo is also contained in the report folder.
βββ project
βββ architecture
β βββ Dunder Mifflin data pipeline.png
βββ data
β βββ raw
β β βββ DIM sku.xlsx
β β βββ DIM state_branches.xlsx
β β βββ FACT sales.xlsx
β βββ processed
β β βββ FACT sales.xlsx
β βββ ready
β βββ sales.xlsx
βββ notebooks
β βββ main.ipynb
βββ report
β βββ Dunder Mifflin Sales Report.pbix
βββ README.md
For further information, reach me at andres.buelvas.diago.01@gmail.com














