The project goal is to carry out segmentation of customers using the online transactions history of an ecommerce platform. The project consists of 4 python scripts that extract (src/extract.py), preprocess (src/clean_data.py), segment the customers using the preprocessed data (src/segment_data.py), and execute all these steps (main.py). It also consists of 4 Jupyter notebooks that demonstrate coding workflows for the data exploration, data preprocessing, exploratory analysis, and customer segmentation.
- Data Exploration (SQL)
- Data Preprocessing (Python)
- Exploratory Data Analysis (Python - seaborn and matplotlib)
- Customer Segmentation and Analysis (Python - seaborn and matplotlib, yellowbrick, sklearn)
The dataset is the output of the ETL Pipeline project where the raw online transactions data was extracted from Amazon Redshift warehouse, transformed and loaded to S3 cloud object. It contains 399841 transaction records from 01/12/2010 to 09/12/2011. The data has 9 attributes:
- invoice
- stock_code
- description
- price
- quantity
- total_order_value
- invoice_date
- customer_id
- country
- Python 3+
- Python IDE or Text Editor and Command Line Interface
- Jupyter notebook
- Copy the
.env.example
file to.env
and fill out the environment variables. - Install the requirements
pip3 install -r requirements.txt
- Run the main.py script
- For Windows:
python main.py
- For Mac:
python3 main.py
- Alternatively, run the Jupyter notebooks in this order:
- 1_data_preprocessing.ipynb
- 2_exploratory_data_analysis.ipynb
- 3_customer_segmentation_percentile_ranking.ipynb
- 4_customer_segmentation_kmeans.ipynb