GMM test by Haffi112 · Pull Request #12 · alexandrainst/european_values

Haffi112 · 2025-08-19T16:24:59Z

This is just a sanity check to see if GMMs produce more stable results than the kernel density estimator. The aim is not necessarily to merge this with main, but just to check if it produces better results.

Copilot

Pull Request Overview

This PR replaces the kernel density estimator with a Gaussian Mixture Model (GMM) for generative modeling of EU survey data. The change aims to evaluate whether GMMs provide more stable results than the previous approach.

Replaces KernelDensity with GaussianMixture from sklearn
Adds component selection methodology with configurable parameters for GMM optimization
Introduces validation-based model selection using BIC/AIC criteria

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/scripts/train_generative_model.py	Passes new GMM configuration parameters to the training function
src/european_values/generative_training.py	Core implementation change from KernelDensity to GaussianMixture with component selection
config/config.yaml	Adds GMM-specific configuration parameters with default values

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-19T16:25:57Z

+    plot_component_selection,
+    save_evaluation_results,
+)



The import of gmm_component_selection module will fail as this module is not included in the diff and likely doesn't exist yet. The functions evaluate_gmm_components, plot_component_selection, and save_evaluation_results are being imported but their implementation is missing.

Suggested change

# --- Implementations for missing functions ---

import numpy as np

import matplotlib.pyplot as plt

def evaluate_gmm_components(

X_train,

X_val,

max_components,

criterion="bic",

covariance_type="full",

random_state=None,

n_init=1,

reg_covar=1e-6,

):

"""Evaluate GMMs with different numbers of components and select the best one."""

scores = []

n_components_range = range(1, max_components + 1)

for n in n_components_range:

gmm = GaussianMixture(

n_components=n,

covariance_type=covariance_type,

random_state=random_state,

n_init=n_init,

reg_covar=reg_covar,

)

gmm.fit(X_train)

if criterion == "bic":

score = gmm.bic(X_val)

else:

score = gmm.aic(X_val)

scores.append(score)

optimal_n = n_components_range[np.argmin(scores)]

evaluation_results = {

"n_components": list(n_components_range),

"scores": scores,

"criterion": criterion,

}

return optimal_n, evaluation_results

def plot_component_selection(evaluation_results, output_path=None):

"""Plot the selection criterion scores for different numbers of GMM components."""

plt.figure()

plt.plot(

evaluation_results["n_components"],

evaluation_results["scores"],

marker="o",

)

plt.xlabel("Number of GMM components")

plt.ylabel(evaluation_results.get("criterion", "Score").upper())

plt.title("GMM Component Selection")

if output_path:

plt.savefig(output_path)

else:

plt.show()

plt.close()

def save_evaluation_results(evaluation_results, output_path):

"""Save evaluation results to a CSV file."""

df = pd.DataFrame({

"n_components": evaluation_results["n_components"],

"score": evaluation_results["scores"],

})

df.to_csv(output_path, index=False)

Copilot · 2025-08-19T16:25:58Z

+  selection_criterion: bic  # 'bic' or 'aic'
+  covariance_type: diag  # 'full', 'tied', 'diag', 'spherical' - using 'diag' for stability
+  n_init: 5  # Number of GMM initializations
+  reg_covar: 1e-5  # Regularization for covariance matrices (increased for extra stability)


The default value for reg_covar in the function signature (1e-6) differs from the config file value (1e-5). This inconsistency could lead to confusion about which value is actually used.

Suggested change

reg_covar: 1e-5 # Regularization for covariance matrices (increased for extra stability)

reg_covar: 1e-6 # Regularization for covariance matrices (matches function default)

Copilot · 2025-08-19T16:25:58Z

+    seed: int,
+    n_components_max: int = 50,
+    selection_criterion: str = "bic",
+    covariance_type: str = "full",


The default value for covariance_type in the function signature is 'full' but the config file sets it to 'diag'. This mismatch could cause unexpected behavior when the function is called without explicit parameters.

Suggested change

covariance_type: str = "full",

covariance_type: str = "diag",

GMM test

9c890ff

Copilot AI review requested due to automatic review settings August 19, 2025 16:25

Copilot AI reviewed Aug 19, 2025

View reviewed changes

saattrupdan marked this pull request as draft August 19, 2025 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GMM test#12

GMM test#12
Haffi112 wants to merge 1 commit into
mainfrom
gmm_sanity_check

Haffi112 commented Aug 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 19, 2025

Uh oh!

Copilot AI Aug 19, 2025

Uh oh!

Copilot AI Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+# --- Implementations for missing functions ---
+import numpy as np
+import matplotlib.pyplot as plt
+def evaluate_gmm_components(
+    X_train,
+    X_val,
+    max_components,
+    criterion="bic",
+    covariance_type="full",
+    random_state=None,
+    n_init=1,
+    reg_covar=1e-6,
+):
+    """Evaluate GMMs with different numbers of components and select the best one."""
+    scores = []
+    n_components_range = range(1, max_components + 1)
+    for n in n_components_range:
+        gmm = GaussianMixture(
+            n_components=n,
+            covariance_type=covariance_type,
+            random_state=random_state,
+            n_init=n_init,
+            reg_covar=reg_covar,
+        )
+        gmm.fit(X_train)
+        if criterion == "bic":
+            score = gmm.bic(X_val)
+        else:
+            score = gmm.aic(X_val)
+        scores.append(score)
+    optimal_n = n_components_range[np.argmin(scores)]
+    evaluation_results = {
+        "n_components": list(n_components_range),
+        "scores": scores,
+        "criterion": criterion,
+    }
+    return optimal_n, evaluation_results
+def plot_component_selection(evaluation_results, output_path=None):
+    """Plot the selection criterion scores for different numbers of GMM components."""
+    plt.figure()
+    plt.plot(
+        evaluation_results["n_components"],
+        evaluation_results["scores"],
+        marker="o",
+    )
+    plt.xlabel("Number of GMM components")
+    plt.ylabel(evaluation_results.get("criterion", "Score").upper())
+    plt.title("GMM Component Selection")
+    if output_path:
+        plt.savefig(output_path)
+    else:
+        plt.show()
+    plt.close()
+def save_evaluation_results(evaluation_results, output_path):
+    """Save evaluation results to a CSV file."""
+    df = pd.DataFrame({
+        "n_components": evaluation_results["n_components"],
+        "score": evaluation_results["scores"],
+    })
+    df.to_csv(output_path, index=False)

	reg_covar: 1e-5 # Regularization for covariance matrices (increased for extra stability)
	reg_covar: 1e-6 # Regularization for covariance matrices (matches function default)

	covariance_type: str = "full",
	covariance_type: str = "diag",

Conversation

Haffi112 commented Aug 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants