This project was conducted as part of the coursework for a Master's School program in Machine Learning/Data Science, focusing on model optimization.
The primary objective is to significantly improve the performance of an existing regression model (e.g., Random Forest or Decision Tree) by systematically adjusting its key internal settings, or hyperparameters, using advanced search techniques.
| 🏷️ Feature | Description |
|---|---|
| Dataset | Diabetes Dataset (from previous exercise) |
| Task | Hyperparameter Tuning for Regression |
| Method | Grid Search (GridSearchCV) or Random Search (RandomizedSearchCV) |
| Evaluation | Cross-Validation (CV) Score, |
- Select one model from the previous comparison exercise (e.g., Random Forest Regressor).
- Define a dictionary (
param_grid) listing ranges for at least three relevant hyperparameters. - Example for Random Forest:
n_estimators(Number of trees)max_depth(Maximum tree depth)min_samples_leaf(Minimum samples required at a leaf node)
- Initialize the
GridSearchCVorRandomizedSearchCVobject. - Crucial Setting: Use Cross-Validation (
cv) during the search (e.g.,cv=5) to ensure the chosen parameters generalize well across different subsets of the training data. - Fit the search object to the training data (
X_train, y_train).
| Report Item | Description |
|---|---|
| Best Hyperparameters | The exact set of parameters found by the search that yielded the best average CV score. |
| Best CV Score | The average performance score (e.g., |
| Tuned Model Test Score ( |
The final |
| Untuned Model |
The |
The final step is to analyze the effectiveness of the tuning process.
-
Comparison: Compare the Tuned Model Test Score (
$R^2$ ) to the Untuned Model$R^2$ . -
Interpretation: If the tuned
$R^2$ is significantly higher, tuning successfully optimized the model's complexity to better fit the data patterns without overfitting the noise.
- Observation: Examine the best hyperparameters found by the search.
- Example Risk: If the best
max_depthfor a Decision Tree is found to be very high (e.g., 20 orNone), this suggests the tuning process may have found a local optimum that risks overfitting if the CV setting was too lenient. - CV's Role: State how the use of Cross-Validation (
cv) helped mitigate the risk of simply selecting parameters that only performed well on one arbitrary data split.
| Risk Indicator | Finding | Interpretation |
|---|---|---|
| Low Complexity | Best max_depth is very low (e.g., 3). |
Risk of Underfitting (Model too simple). |
| High Complexity | Best max_depth is high (e.g., 15+). |
Risk of Overfitting (Model learned too much training noise). |
| Optimal Balance | Performance improved, and complexity parameters are mid-range. | Successful Tuning (Optimal bias-variance trade-off achieved). |