This project is of hackathon participation that requires analyzing e-commerce sales data through data cleaning in Excel and SQL, performing SQL queries to extract data insights, conducting further analysis using Python, and visualizing the results with Power BI.
The dataset contains sales transactions from an e-commerce platform. It includes customer demographics, order details, pricing, shipping status, and order dates.
- Analyze total sales by region.
- Identify the product category generating the highest revenue.
- Calculate the average shipping fee by region.
- Examine how customer age impacts purchasing behavior.
- Determine the most popular product by gender.
- Assess the order fulfillment rate (delivered vs. returned).
- Identified missing values using Excel’s
COUNTBLANK()function. - Filled missing values for
AgeandRegionusing:where=IF(A2="", "Unknown", A2)A2represents the column with missing values.
- Converted all date values to
YYYY-MM-DDusing Excel’sTEXT()function:=TEXT(A2, "YYYY-MM-DD") - Standardized product categories and pricing formats.
After cleaning the dataset, SQL queries were used to extract insights.
ALTER TABLE SALES_CUSTOMER_INSIGHT
CHANGE `Product Name` Product_Name VARCHAR(255),
CHANGE `Unit Price` Unit_Price DECIMAL(10,2),
CHANGE `Total Price` Total_Price DECIMAL(10,2),
CHANGE `Shipping status` Shipping_Status VARCHAR(50);SELECT region, SUM(total_price) AS total_sales
FROM SALES_CUSTOMER_INSIGHT
GROUP BY region
ORDER BY total_sales DESC;
SELECT Category, SUM(Total_Price) AS Total_Revenue
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Category
ORDER BY Total_Revenue DESC
LIMIT 1;
SELECT Region, AVG(Shipping_Fee) AS Avg_Shipping_Fee
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Region
ORDER BY Avg_Shipping_Fee DESC;
SELECT Age, COUNT(*) AS Total_Purchases, SUM(Total_Price) AS Total_Spent
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Age
ORDER BY Total_Spent DESC;
SELECT
CASE
WHEN Age BETWEEN 18 AND 25 THEN '18-25'
WHEN Age BETWEEN 26 AND 35 THEN '26-35'
WHEN Age BETWEEN 36 AND 50 THEN '36-50'
ELSE '50+'
END AS Age_Group,
COUNT(*) AS Total_Purchases,
SUM(Total_Price) AS Total_Spent
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Age_Group
ORDER BY Total_Spent DESC;
SELECT Gender, Product_Name, COUNT(*) AS Purchase_Count
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Gender, Product_Name
ORDER BY Gender, Purchase_Count DESC;
Alternative with Ranking:
SELECT Gender, Product_Name, Purchase_Count
FROM (
SELECT Gender, Product_Name, COUNT(*) AS Purchase_Count,
RANK() OVER (PARTITION BY Gender ORDER BY COUNT(*) DESC) AS ranking
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Gender, Product_Name
) ranked
WHERE ranking = 1;
SELECT Shipping_Status, COUNT(*) AS Order_Count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS Percentage
FROM SALES_CUSTOMER_INSIGHT
GROUP BY Shipping_Status;
SELECT
(COUNT(CASE WHEN Shipping_Status = 'Delivered' THEN 1 END) * 100.0 / COUNT(*)) AS Fulfillment_Rate,
(COUNT(CASE WHEN Shipping_Status = 'Returned' THEN 1 END) * 100.0 / COUNT(*)) AS Return_Rate
FROM SALES_CUSTOMER_INSIGHT;
ALTER TABLE SALES_CUSTOMER_INSIGHT
CHANGE `Customer ID` Customer_id VARCHAR(255);
SELECT *
FROM SALES_CUSTOMER_INSIGHT
ORDER BY CAST(SUBSTRING(Customer_ID, 6) AS UNSIGNED) ASC;
Analysis of a 1000-record sales dataset to identify key trends and patterns in customer purchasing behavior.
# Install required packages
pip install pandas numpy matplotlib seaborn# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv("sales_cleaneddata_from_msql.csv")
# Check data basics
df.head()
df.isnull().sum() # No missing values
df.duplicated().sum() # No duplicates- 1000 sales records with customer info, product details, and shipping data
- Three product categories: Electronics (47.7%), Accessories (40.1%), Wearables (12.2%)
- Customer demographics: 52.4% Male, 47.6% Female; Ages 18-69 (avg: 46.9)
- Shipping statuses: In Transit (32.9%), Delivered (31.3%), Returned (30.8%)
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# df is my DataFrame
# Aggregate total sales per category
category_sales = df.groupby('Category')['Total_Price'].sum().sort_values(ascending=False)
# Plot bar chart
plt.figure(figsize=(10, 5))
sns.barplot(x=category_sales.index, y=category_sales.values, hue=category_sales.index, legend=False, palette='viridis') # Corrected line
plt.xticks(rotation=45)
plt.xlabel("Product Category")
plt.ylabel("Total Sales")
plt.title("Total Sales by Product Category")
plt.show()
# Most profitable products
df.groupby("Product_Name")["Total_Price"].sum().sort_values(ascending=False).head(5)
# Scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='Unit_Price', y='Quantity', alpha=0.7)
plt.xlabel("Unit Price")
plt.ylabel("Quantity Sold")
plt.title("Unit Price vs Quantity Sold")
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# df is my DataFrame
shipping_counts = df['Shipping_Status'].value_counts()
plt.figure(figsize=(8, 5))
sns.barplot(x=shipping_counts.index, y=shipping_counts.values, hue=shipping_counts.index, legend=False, palette='coolwarm') # Corrected line
plt.xlabel("Shipping Status")
plt.ylabel("Number of Orders")
plt.title("Order Distribution by Shipping Status")
plt.show()
#Insight: Understand sales performance across different genders.
import seaborn as sns
import matplotlib.pyplot as plt
gender_sales = df.groupby('Gender')['Total_Price'].sum()
plt.figure(figsize=(6, 4))
sns.barplot(x=gender_sales.index, y=gender_sales.values, hue=gender_sales.index, legend=False, palette='pastel') # Corrected line
plt.xlabel("Gender")
plt.ylabel("Total Sales")
plt.title("Sales Distribution by Gender")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
region_quantity = df.groupby('Region')['Quantity'].sum().sort_values(ascending=False)
plt.figure(figsize=(8, 5))
sns.barplot(x=region_quantity.index, y=region_quantity.values, hue=region_quantity.index, legend=False, palette='Set2') # Corrected line
plt.xlabel("Region")
plt.ylabel("Total Quantity Sold")
plt.title("Quantity Sold by Region")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
category_avg_price = df.groupby('Category')['Unit_Price'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 5))
sns.barplot(x=category_avg_price.index, y=category_avg_price.values, hue=category_avg_price.index, legend=False, palette='plasma') # Corrected line
plt.xlabel("Category")
plt.ylabel("Average Unit Price")
plt.title("Average Unit Price by Category")
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.hist(df['Total_Price'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel("Total Price")
plt.ylabel("Frequency")
plt.title("Total Price Distribution")
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# 'df' is my DataFrame with 'Order Date' and 'Total_Price' columns
# Making 'Order Date' to be in datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
# Set 'Order Date' as index
df.set_index('Order Date', inplace=True)
# Resample data monthly and plot sales trend
df.resample('M')['Total_Price'].sum().plot(title='Monthly Sales Trend', figsize=(10, 5), marker='o', color='b')
plt.ylabel("Total Sales")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
sns.boxplot(y=df['Shipping_Fee'], color='lightgreen')
plt.ylabel("Shipping Fee")
plt.title("Shipping Fee Distribution")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
#using seaborn to see the correlation
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='Age', y='Total_Price', alpha=0.7)
plt.xlabel("Age")
plt.ylabel("Total Price")
plt.title("Correlation between Age and Total Price")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
# Compute correlation matrix
corr_matrix = numeric_df.corr()
# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt
# Boxplot for outlier detection
numerical_cols = ['Unit_Price', 'Quantity', 'Total_Price', 'Shipping_Fee']
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[numerical_cols])
plt.title("Boxplot of Numerical Variables")
plt.show()
Using Power BI, we built an interactive dashboard that aggregates key metrics from our e-commerce sales data. The dashboard includes insights on total sales, revenue by product category, shipping status, and more. It is designed to be user-friendly and visually appealing, with various visualizations such as maps, bar charts, and pie charts. The dashboard has been published to Power BI Service, and a link to the live dashboard is provided below.
Dashboard Link: View Live Power BI Dashboard
Description:
This page provides an overall view of customer sales performance. It includes:
- Key Metrics: Total sales, total orders, and average order value.
- Visualizations: A combination of KPI cards and a bar chart displaying total sales by region.
- Insight: Quickly identifies regions with the highest sales, enabling targeted strategies.
Description:
This page focuses on customer demographics and purchasing behavior. It includes:
- Visualizations:
- Sales by age group using a clustered bar chart.
- Sales by gender using a pie chart.
- Insight: Highlights how different customer segments contribute to overall revenue, allowing for more targeted marketing efforts.
Description:
This page analyzes product-related metrics. It features:
- Visualizations:
- A bar chart displaying total sales by product category and region.
- A detailed view of top-performing products.
- Insight: Helps identify which product categories are driving revenue and how regional performance varies.
Description:
This page evaluates the shipping performance and operational efficiency. It includes:
- Visualizations:
- A line chart showing delivery date trends.
- A bar chart depicting delivery status by region.
- A pie chart representing shipping fees by region.
- Insight: Provides a comprehensive view of order fulfillment and shipping cost distribution, highlighting potential areas for logistics improvement.
From our comprehensive analysis of the e-commerce sales data, we identified key trends and insights:
- Sales Trends Over Time: January recorded the highest sales revenue, indicating strong demand at the beginning of the year.
- Top-Selling Product Category: Electronics emerged as the most sold product category, contributing significantly to overall revenue.
- Sales by Region: Certain regions generated higher sales, highlighting potential markets for expansion and targeted marketing.
- Customer Behavior:
- Younger age groups (26-35) demonstrated higher purchasing activity.
- Product preferences varied between genders, influencing sales trends.
- Shipping and Order Fulfillment:
- Most orders were successfully delivered.
- Shipping fees varied across regions, affecting purchase decisions.
Based on these insights, we propose the following strategies for business growth:
-
Leverage Seasonal Trends
- Prepare for high demand in January by optimizing inventory and marketing campaigns.
- Offer promotional discounts in off-peak months to boost sales.
-
Product Strategy Enhancement
- Focus on Electronics as a key product category and expand offerings based on customer demand.
- Optimize inventory levels to ensure availability of top-selling products.
-
Targeted Marketing and Customer Engagement
- Use customer age and gender preferences to personalize recommendations.
- Implement loyalty programs and targeted ads in high-performing regions.
-
Shipping and Logistics Optimization
- Improve delivery efficiency in regions with higher return rates.
- Adjust pricing strategies for shipping fees to enhance affordability and customer satisfaction.