Aidan Hew
Karan Bains
SHUHANG LI
Wine Quality Classification is a reproducible project for classifying different red wine based on the quality. This project aims to investigate whether physicochemical properties can reliably predict wine quality using classification, which includes exploratory data analysis, model training, testing, and result visualization. The goal is to help understand the key physicochemical features affecting wine quality and build a model with good performance and generalization.
- Clone the repository to your local machine
- Use the command line
docker compose up -d, it will create a container and you will see the similar result below.
- Use the command line
docker psto see the status of the container we created
- Use the command line
docker logs wine-quality-classification-analysis-env-1('wine-quality-classification-analysis-env-1' is the name of the container can be find in thedocker psresult)
- The result of step 3 includes URL's. Click on the second URL to open the project in a JupyterLab.
- Now you can run the full analysis by following the instructions below.
The entire pipeline can be executed with one command:
make allThis will automatically:
- Download the raw data
- Process and split the data
- Perform EDA
- Train and evaluate the models
Run EDA:
make edaTrain and Evaluate Models:
make analyzeClean All Generated Files:
make clean- Create the virtual environment by using the following command line (if your laptop uses MacOS):
conda env create -f environment.yml
conda activate wine-qualityIf your laptop does not use MacOS, use the following command line:
conda-lock install --name <environment_name>
conda activate <environment_name>Run the following scripts in order:
Step 1: Download/Read Data (read_csv.py)
Reads the raw data and saves it to a local file.
Arguments
path_read: URL or path to the input CSV.path_save: File path (including filename) where the raw data should be saved.--delim: (Optional) Delimiter of the input file (default:,).
Example
python src/read_csv.py https://raw.githubusercontent.com/prudhvinathreddymalla/Red-Wine-Dataset/refs/heads/master/winequality-red.csv data/raw/raw_data.csv --delim ";"Step 2: Process Data (data_processing.py)
Validates data schema, handles outliers/missing values, and splits data into train/test sets.
Arguments
path_read: Path to the raw input CSV.path_save: Directory wheretrain_data.csvandtest_data.csvwill be saved.
Example
python src/data_processing.py data/raw/raw_data.csv data/processed/Step 3: Exploratory Data Analysis (EDA) (eda.py)
Generates summary statistics, correlation heatmaps, and distribution plots from train data
Arguments
path_read: Path to the training data CSV.path_save: Directory where figures and tables will be saved.
Example
python src/eda.py data/processed/train_data.csv results/figures/Step 4: Analysis (analysis.py)
Trains Logistic Regression, Decision Tree, and Random Forest models. Outputs performance metrics and ROC curves.
Arguments
path_train: Path to the train data CSV.path_test: Path to the test data CSV.path_save: Directory where the model results (CSV and PNG) will be saved.
Example
python src/analysis.py data/processed/train_data.csv data/processed/test_data.csv results/models/Important Note on Output Paths!!
- For
read_csv.py, thepath_saveargument must be a full file path (e.g.,data/data.csv). - For
data_processing.py,eda.py, andanalysis.py, thepath_saveargument must be a directory (e.g.,data/processed/), because the filenames are hardcoded within the scripts.
- Stop and remove the original one by using
docker compose down - Pull the latest version of the images defined in
docker-compose.ymlby usingdocker compose pull - Follow the process mentioned in the 'The way to use the container image', so that you can use the updated container image.
- click=8.3.1
- pandas=2.2.2
- scikit-learn=1.4.2
- jupyter=1.1.1
- python=3
- numpy=1.26.4
- altair=6.0.0
- vl-convert-python=1.8.0
- pandera=0.27.0
- matplotlib=3.10.8
- quarto=1.8.26
- tabulate=0.9.0
- pytest=9.0.2
The project is licensed under the MIT License and CC BY-NC-ND 4.0 license. The detail is in LICENSE.md.