GMM test#12
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR replaces the kernel density estimator with a Gaussian Mixture Model (GMM) for generative modeling of EU survey data. The change aims to evaluate whether GMMs provide more stable results than the previous approach.
- Replaces KernelDensity with GaussianMixture from sklearn
- Adds component selection methodology with configurable parameters for GMM optimization
- Introduces validation-based model selection using BIC/AIC criteria
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/scripts/train_generative_model.py | Passes new GMM configuration parameters to the training function |
| src/european_values/generative_training.py | Core implementation change from KernelDensity to GaussianMixture with component selection |
| config/config.yaml | Adds GMM-specific configuration parameters with default values |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| plot_component_selection, | ||
| save_evaluation_results, | ||
| ) | ||
|
|
There was a problem hiding this comment.
The import of gmm_component_selection module will fail as this module is not included in the diff and likely doesn't exist yet. The functions evaluate_gmm_components, plot_component_selection, and save_evaluation_results are being imported but their implementation is missing.
| # --- Implementations for missing functions --- | |
| import numpy as np | |
| import matplotlib.pyplot as plt | |
| def evaluate_gmm_components( | |
| X_train, | |
| X_val, | |
| max_components, | |
| criterion="bic", | |
| covariance_type="full", | |
| random_state=None, | |
| n_init=1, | |
| reg_covar=1e-6, | |
| ): | |
| """Evaluate GMMs with different numbers of components and select the best one.""" | |
| scores = [] | |
| n_components_range = range(1, max_components + 1) | |
| for n in n_components_range: | |
| gmm = GaussianMixture( | |
| n_components=n, | |
| covariance_type=covariance_type, | |
| random_state=random_state, | |
| n_init=n_init, | |
| reg_covar=reg_covar, | |
| ) | |
| gmm.fit(X_train) | |
| if criterion == "bic": | |
| score = gmm.bic(X_val) | |
| else: | |
| score = gmm.aic(X_val) | |
| scores.append(score) | |
| optimal_n = n_components_range[np.argmin(scores)] | |
| evaluation_results = { | |
| "n_components": list(n_components_range), | |
| "scores": scores, | |
| "criterion": criterion, | |
| } | |
| return optimal_n, evaluation_results | |
| def plot_component_selection(evaluation_results, output_path=None): | |
| """Plot the selection criterion scores for different numbers of GMM components.""" | |
| plt.figure() | |
| plt.plot( | |
| evaluation_results["n_components"], | |
| evaluation_results["scores"], | |
| marker="o", | |
| ) | |
| plt.xlabel("Number of GMM components") | |
| plt.ylabel(evaluation_results.get("criterion", "Score").upper()) | |
| plt.title("GMM Component Selection") | |
| if output_path: | |
| plt.savefig(output_path) | |
| else: | |
| plt.show() | |
| plt.close() | |
| def save_evaluation_results(evaluation_results, output_path): | |
| """Save evaluation results to a CSV file.""" | |
| df = pd.DataFrame({ | |
| "n_components": evaluation_results["n_components"], | |
| "score": evaluation_results["scores"], | |
| }) | |
| df.to_csv(output_path, index=False) |
| selection_criterion: bic # 'bic' or 'aic' | ||
| covariance_type: diag # 'full', 'tied', 'diag', 'spherical' - using 'diag' for stability | ||
| n_init: 5 # Number of GMM initializations | ||
| reg_covar: 1e-5 # Regularization for covariance matrices (increased for extra stability) |
There was a problem hiding this comment.
The default value for reg_covar in the function signature (1e-6) differs from the config file value (1e-5). This inconsistency could lead to confusion about which value is actually used.
| reg_covar: 1e-5 # Regularization for covariance matrices (increased for extra stability) | |
| reg_covar: 1e-6 # Regularization for covariance matrices (matches function default) |
| seed: int, | ||
| n_components_max: int = 50, | ||
| selection_criterion: str = "bic", | ||
| covariance_type: str = "full", |
There was a problem hiding this comment.
The default value for covariance_type in the function signature is 'full' but the config file sets it to 'diag'. This mismatch could cause unexpected behavior when the function is called without explicit parameters.
| covariance_type: str = "full", | |
| covariance_type: str = "diag", |
This is just a sanity check to see if GMMs produce more stable results than the kernel density estimator. The aim is not necessarily to merge this with main, but just to check if it produces better results.