Wine Quality Classification Analysis

Contributors

Aidan Hew

Karan Bains

SHUHANG LI

Project Summary

Wine Quality Classification is a reproducible project for classifying different red wine based on the quality. This project aims to investigate whether physicochemical properties can reliably predict wine quality using classification, which includes exploratory data analysis, model training, testing, and result visualization. The goal is to help understand the key physicochemical features affecting wine quality and build a model with good performance and generalization.

Usage

Setup

Clone the repository to your local machine

Using the container image

Use the command line docker compose up -d, it will create a container and you will see the similar result below.

Use the command line docker ps to see the status of the container we created

Use the command line docker logs wine-quality-classification-analysis-env-1 ('wine-quality-classification-analysis-env-1' is the name of the container can be find in the docker ps result)

The result of step 3 includes URL's. Click on the second URL to open the project in a JupyterLab.
Now you can run the full analysis by following the instructions below.

Running the Full Analysis

The entire pipeline can be executed with one command:

make all

This will automatically:

Download the raw data
Process and split the data
Perform EDA
Train and evaluate the models

Makefile Targets

Run EDA:

make eda

Train and Evaluate Models:

make analyze

Clean All Generated Files:

make clean

Pipeline Details

Create the virtual environment by using the following command line (if your laptop uses MacOS):

conda env create -f environment.yml
conda activate wine-quality

If your laptop does not use MacOS, use the following command line:

conda-lock install --name <environment_name>
conda activate <environment_name>

Run the following scripts in order:

Step 1: Download/Read Data (read_csv.py) Reads the raw data and saves it to a local file.

Arguments

path_read: URL or path to the input CSV.
path_save: File path (including filename) where the raw data should be saved.
--delim: (Optional) Delimiter of the input file (default: ,).

Example

python src/read_csv.py https://raw.githubusercontent.com/prudhvinathreddymalla/Red-Wine-Dataset/refs/heads/master/winequality-red.csv data/raw/raw_data.csv --delim ";"

Step 2: Process Data (data_processing.py) Validates data schema, handles outliers/missing values, and splits data into train/test sets.

Arguments

path_read: Path to the raw input CSV.
path_save: Directory where train_data.csv and test_data.csv will be saved.

Example

python src/data_processing.py data/raw/raw_data.csv data/processed/

Step 3: Exploratory Data Analysis (EDA) (eda.py) Generates summary statistics, correlation heatmaps, and distribution plots from train data

Arguments

path_read: Path to the training data CSV.
path_save: Directory where figures and tables will be saved.

Example

python src/eda.py data/processed/train_data.csv results/figures/

Step 4: Analysis (analysis.py) Trains Logistic Regression, Decision Tree, and Random Forest models. Outputs performance metrics and ROC curves.

Arguments

path_train: Path to the train data CSV.
path_test: Path to the test data CSV.
path_save: Directory where the model results (CSV and PNG) will be saved.

Example

python src/analysis.py data/processed/train_data.csv data/processed/test_data.csv results/models/

Important Note on Output Paths!!

For read_csv.py, the path_save argument must be a full file path (e.g., data/data.csv).
For data_processing.py, eda.py, and analysis.py, the path_save argument must be a directory (e.g., data/processed/), because the filenames are hardcoded within the scripts.

Updating the container image

Stop and remove the original one by using docker compose down
Pull the latest version of the images defined in docker-compose.yml by using docker compose pull
Follow the process mentioned in the 'The way to use the container image', so that you can use the updated container image.

Dependencies

click=8.3.1
pandas=2.2.2
scikit-learn=1.4.2
jupyter=1.1.1
python=3
numpy=1.26.4
altair=6.0.0
vl-convert-python=1.8.0
pandera=0.27.0
matplotlib=3.10.8
quarto=1.8.26
tabulate=0.9.0
pytest=9.0.2

License

The project is licensed under the MIT License and CC BY-NC-ND 4.0 license. The detail is in LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github/workflows		.github/workflows
data		data
reports		reports
results		results
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
conda-lock.yml		conda-lock.yml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Quality Classification Analysis

Contributors

Project Summary

Usage

Setup

Using the container image

Running the Full Analysis

Makefile Targets

Pipeline Details

Updating the container image

Dependencies

License

About

Uh oh!

Releases 5

Packages

Contributors 3

Uh oh!

Languages

License

karanbayns/Wine-Quality-Classification

Folders and files

Latest commit

History

Repository files navigation

Wine Quality Classification Analysis

Contributors

Project Summary

Usage

Setup

Using the container image

Running the Full Analysis

Makefile Targets

Pipeline Details

Updating the container image

Dependencies

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 3

Uh oh!

Languages

Packages