Skip to content

karanbayns/Wine-Quality-Classification

Repository files navigation

Wine Quality Classification Analysis

Contributors

Aidan Hew

Karan Bains

SHUHANG LI

Project Summary

Wine Quality Classification is a reproducible project for classifying different red wine based on the quality. This project aims to investigate whether physicochemical properties can reliably predict wine quality using classification, which includes exploratory data analysis, model training, testing, and result visualization. The goal is to help understand the key physicochemical features affecting wine quality and build a model with good performance and generalization.

Usage

Setup

  1. Clone the repository to your local machine

Using the container image

  1. Use the command line docker compose up -d, it will create a container and you will see the similar result below.
截屏2025-11-29 上午11 47 18
  1. Use the command line docker ps to see the status of the container we created
截屏2025-11-29 上午11 48 22
  1. Use the command line docker logs wine-quality-classification-analysis-env-1 ('wine-quality-classification-analysis-env-1' is the name of the container can be find in the docker ps result)
截屏2025-11-29 上午11 50 19
  1. The result of step 3 includes URL's. Click on the second URL to open the project in a JupyterLab.
  2. Now you can run the full analysis by following the instructions below.

Running the Full Analysis

The entire pipeline can be executed with one command:

make all

This will automatically:

  1. Download the raw data
  2. Process and split the data
  3. Perform EDA
  4. Train and evaluate the models

Makefile Targets

Run EDA:

make eda

Train and Evaluate Models:

make analyze

Clean All Generated Files:

make clean

Pipeline Details

  1. Create the virtual environment by using the following command line (if your laptop uses MacOS):
conda env create -f environment.yml
conda activate wine-quality

If your laptop does not use MacOS, use the following command line:

conda-lock install --name <environment_name>
conda activate <environment_name>

Run the following scripts in order:

Step 1: Download/Read Data (read_csv.py) Reads the raw data and saves it to a local file.

Arguments

  • path_read: URL or path to the input CSV.
  • path_save: File path (including filename) where the raw data should be saved.
  • --delim: (Optional) Delimiter of the input file (default: ,).

Example

python src/read_csv.py https://raw.githubusercontent.com/prudhvinathreddymalla/Red-Wine-Dataset/refs/heads/master/winequality-red.csv data/raw/raw_data.csv --delim ";"

Step 2: Process Data (data_processing.py) Validates data schema, handles outliers/missing values, and splits data into train/test sets.

Arguments

  • path_read: Path to the raw input CSV.
  • path_save: Directory where train_data.csv and test_data.csv will be saved.

Example

python src/data_processing.py data/raw/raw_data.csv data/processed/

Step 3: Exploratory Data Analysis (EDA) (eda.py) Generates summary statistics, correlation heatmaps, and distribution plots from train data

Arguments

  • path_read: Path to the training data CSV.
  • path_save: Directory where figures and tables will be saved.

Example

python src/eda.py data/processed/train_data.csv results/figures/

Step 4: Analysis (analysis.py) Trains Logistic Regression, Decision Tree, and Random Forest models. Outputs performance metrics and ROC curves.

Arguments

  • path_train: Path to the train data CSV.
  • path_test: Path to the test data CSV.
  • path_save: Directory where the model results (CSV and PNG) will be saved.

Example

python src/analysis.py data/processed/train_data.csv data/processed/test_data.csv results/models/

Important Note on Output Paths!!

  • For read_csv.py, the path_save argument must be a full file path (e.g., data/data.csv).
  • For data_processing.py, eda.py, and analysis.py, the path_save argument must be a directory (e.g., data/processed/), because the filenames are hardcoded within the scripts.

Updating the container image

  1. Stop and remove the original one by using docker compose down
  2. Pull the latest version of the images defined in docker-compose.yml by using docker compose pull
  3. Follow the process mentioned in the 'The way to use the container image', so that you can use the updated container image.

Dependencies

  • click=8.3.1
  • pandas=2.2.2
  • scikit-learn=1.4.2
  • jupyter=1.1.1
  • python=3
  • numpy=1.26.4
  • altair=6.0.0
  • vl-convert-python=1.8.0
  • pandera=0.27.0
  • matplotlib=3.10.8
  • quarto=1.8.26
  • tabulate=0.9.0
  • pytest=9.0.2

License

The project is licensed under the MIT License and CC BY-NC-ND 4.0 license. The detail is in LICENSE.md.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •