This project applies statistical analysis and data visualization techniques to a real-world white wine dataset using Python. The focus is on analyzing the alcohol content of wines by computing descriptive statistics, constructing frequency distributions, calculating confidence and tolerance intervals, and conducting hypothesis testing.
The dataset, sourced from Kaggle, contains ~5000 samples of white wine with the following attributes:
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
🔍 For this project, we focused only on the alcohol
column to perform all statistical analysis and visualizations.
We aimed to:
- Calculate mean and variance of alcohol percentage (both manually and using Pandas)
- Create histograms and pie charts to visualize alcohol distribution
- Build a frequency distribution table using grouped bins
- Estimate a 95% confidence interval and a 95% tolerance interval for alcohol content
- Validate the tolerance interval using a 20% test split
- Perform a one-sample t-test to test if the mean alcohol percentage is significantly different from a given value
- Python 3
pandas
– data manipulationnumpy
– numerical operationsmatplotlib
– visualizationsscipy.stats
– statistical testing and interval calculations
| Data Cleaning | Removed missing values from dataset | Descriptive Stats | Computed mean & variance of alcohol | Visualization | Histogram of alcohol % and pie chart for ranges | Frequency Dist. | Created bins and counted samples in each | Confidence Interval | 95% CI of mean alcohol using t-distribution | Tolerance Interval | 95% range expected to cover future samples | Hypothesis Test | One-sample t-test to check if μ ≠ 10.5%
We tested the hypothesis:
- Null Hypothesis (H₀): Mean alcohol percentage = 10.5%
- Alternative Hypothesis (H₁): Mean alcohol percentage ≠ 10.5%
Based on the computed p-value and t-statistic, we determined whether to reject H₀.
- Alcohol mean and variance were successfully calculated manually and with Python.
- Visualizations provided clear insight into alcohol distribution.
- Both 95% confidence and tolerance intervals were constructed.
- Hypothesis test showed whether alcohol % significantly differed from 10.5%.
- Over 90% of test data fell within the computed tolerance interval, validating its effectiveness.
- Wine Quality Dataset – Kaggle
- McKinney, W. (2010). Data Analysis with Python and Pandas. O'Reilly Media.