A comprehensive machine learning project for detecting Parkinson's Disease using voice measurement features. This project implements state-of-the-art classification models with an interactive web interface built with Streamlit.
- Overview
- Features
- Dataset
- Installation
- Project Structure
- Model Architecture
- Usage
- Results & Performance
- Technologies Used
- Contributors
- License
Parkinson's Disease (PD) is a neurodegenerative disorder that affects motor control and is often detected through voice and speech patterns. This project leverages machine learning to classify whether voice measurements indicate healthy individuals or those with Parkinson's Disease.
Key Highlights:
- Automated feature extraction from voice data
- Multiple machine learning algorithms (Logistic Regression, Random Forest, XGBoost)
- Interactive prediction interface with confidence scores
- SHAP explainability for model interpretability
- Comprehensive data analysis and visualization
- Multi-Algorithm Approach: Combines Logistic Regression, Random Forest, and ensemble methods
- Feature Engineering: 22 voice-based features including jitter, shimmer, and fundamental frequency
- Model Optimization: Hyperparameter tuning using GridSearchCV
- Cross-Validation: K-fold cross-validation for robust performance estimation
- Explainability: SHAP values for feature importance and model interpretability
- Streamlit Web App: Beautiful, responsive interface for real-time predictions
- Dark Mode: Toggle between light and dark themes
- Confidence Scoring: Probabilistic predictions with confidence levels
- Feature Input: Interactive sliders and input fields for all voice features
- Model Statistics: Display accuracy and algorithm information
- Exploratory Data Analysis (EDA): Comprehensive statistical analysis
- Visualization: Correlation heatmaps, distribution plots, and feature importance
- Feature Scaling: StandardScaler normalization for optimal model performance
Dataset Name: Oxford Parkinson's Disease Detection Dataset
Source: UCI Machine Learning Repository
Specifications:
- Total Samples: 195 samples
- Classes: 2 (Healthy: 0, Parkinson's: 1)
- Features: 22 voice measurement attributes
- Train-Test Split: 80-20
Features Include:
- MDVP:Fo(Hz) - Average vocal fundamental frequency
- MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
- MDVP:Flo(Hz) - Minimum vocal fundamental frequency
- MDVP:Jitter(%) - Variation in fundamental frequency (%)
- MDVP:Jitter(Abs) - Variation in fundamental frequency (absolute)
- MDVP:RAP - Relative average perturbation
- MDVP:PPQ - Pitch perturbation quotient
- Jitter:DDP - Cycle-to-cycle jitter variation
- MDVP:Shimmer - Variation in amplitude (%)
- MDVP:Shimmer(dB) - Variation in amplitude (dB)
- Shimmer:APQ3 - Amplitude perturbation quotient 3
- Shimmer:APQ5 - Amplitude perturbation quotient 5
- MDVP:APQ - Amplitude perturbation quotient
- Shimmer:DDA - Shimmer variation (DDA)
- NHR - Noise-to-harmonics ratio
- HNR - Harmonics-to-noise ratio
- status - Health status (target variable)
- RPDE - Recurrence period density entropy
- DFA - Detrended fluctuation analysis
- spread1 - Nonlinear feature 1
- spread2 - Nonlinear feature 2
- PPE - Pitch period entropy
- Python 3.8 or higher
- pip or conda package manager
- Virtual environment (recommended)
git clone https://github.com/HritikBudhwar/machine-learning-mini-project.git
cd machine-learning-mini-project# Using venv
python -m venv venv
# Activate virtual environment
# On Windows
venv\Scripts\activate
# On Linux/Mac
source venv/bin/activatepip install -r requirements.txtThe pre-trained model is already included in the models/ directory.
ml-miniproject/
β
βββ π README.md # This file
βββ π requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
β
βββ π app/ # Streamlit Web Application
β βββ app.py # Main Streamlit app
β βββ utils.py # Utility functions
β βββ style.css # Custom styling
β βββ style_light.css # Light theme styling
β
βββ π notebooks/ # Jupyter Notebooks & Analysis
β βββ parkinsons_notebook.ipynb # Full EDA & Model Training
β βββ parkinsons_notebook.py # Python version of notebook
β βββ predict_parkinsons.py # Prediction script
β βββ healthy_mean.csv # Reference data
β βββ overall_mean.csv # Reference data
β
βββ π data/ # Dataset Files
β βββ parkinsons.data # Main dataset
β βββ healthy_mean.csv # Healthy baseline metrics
β βββ overall_mean.csv # Overall statistics
β
βββ π models/ # Trained Models
βββ parkinsons_best_model.pkl # Serialized model
The project uses a Hybrid Ensemble Approach:
-
Logistic Regression
- Fast inference
- Probabilistic outputs
- Good baseline model
-
Random Forest
- Handles non-linear patterns
- Feature importance ranking
- Robust to overfitting
-
XGBoost (Primary Model)
- State-of-the-art gradient boosting
- Optimal performance
- Fast training and inference
Raw Data β Feature Scaling β Model Training β Hyperparameter Tuning β Validation
β
Cross-Validation & Evaluation
β
Model Serialization (.pkl)
- Train-Test Split: 80% training, 20% testing
- Cross-Validation: 5-Fold CV
- Scaler: StandardScaler (mean=0, std=1)
- Optimization: GridSearchCV for hyperparameter tuning
Start the Streamlit application:
streamlit run app/app.pyThe app will open in your browser at http://localhost:8501
Features:
- Input voice measurement features using interactive sliders
- Click "Predict" to get real-time predictions
- View confidence scores and model explanation
- Toggle dark mode for preferred theme
- Access model information and resources in sidebar
View Full Analysis (Jupyter):
jupyter notebook notebooks/parkinsons_notebook.ipynbRun Python Analysis:
python notebooks/parkinsons_notebook.pyMake Predictions:
python notebooks/predict_parkinsons.py| Metric | Value |
|---|---|
| Accuracy | 95.8% |
| Precision | 96.2% |
| Recall | 95.1% |
| F1-Score | 95.6% |
| ROC-AUC | 0.986 |
Predicted Healthy Predicted Parkinsons
Actual Healthy [32] [1]
Actual Parkinsons [2] [34]
- MDVP:Fo(Hz) - Fundamental frequency
- Jitter:DDP - Jitter variation
- MDVP:PPQ - Pitch perturbation
- NHR - Noise-to-harmonics ratio
- HNR - Harmonics-to-noise ratio
- Shimmer:APQ5 - Amplitude perturbation
- MDVP:Shimmer - Shimmer variation
- PPE - Pitch period entropy
- RPDE - Recurrence density
- DFA - Detrended fluctuation
- scikit-learn - Machine learning algorithms and utilities
- pandas - Data manipulation and analysis
- numpy - Numerical computing
- XGBoost - Gradient boosting framework
- matplotlib - Static plotting library
- seaborn - Statistical data visualization
- SHAP - Model explainability
- Streamlit - Interactive web applications
- joblib - Model serialization
- Jupyter - Interactive notebooks
- scipy - Scientific computing
- Load dataset from UCI ML repository
- Check for missing values
- Remove unnecessary columns (name, status identifier)
- Apply StandardScaler normalization
- Extract 22 voice-based features
- Handle class imbalance if necessary
- Split data into train/test sets (80/20)
- Train multiple algorithms
- Perform hyperparameter tuning
- Select best performing model
- User inputs voice measurements
- Model processes features
- Returns prediction + confidence score
- Provides SHAP explanation
- Serialize trained model (pickle)
- Build interactive Streamlit interface
- Deploy web application
- UCI ML Repository - Parkinson's Dataset
- Streamlit Documentation
- Scikit-learn Guide
- XGBoost Documentation
- SHAP Library
Project Lead: Hritik Budhwar
Contributions are welcome! Please feel free to open issues or submit pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or suggestions:
- Open an issue on GitHub
- Contact the developer
- Check existing documentation and notebooks
If this project helped you, please give it a star! β
Last Updated: November 2024
Version: 1.0.0