-
Notifications
You must be signed in to change notification settings - Fork 1
Added print statements for probabilities without .max(axis=1), made various changes in training script, evaluation script and data processing script. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
b156496
Fix remaining line length issues in print statements
AnnikaSimonsen 33700f1
Added print statements for probabilities without .max(axis=1), made v…
AnnikaSimonsen 5261120
Update src/european_values/data_processing.py
AnnikaSimonsen 437ef04
Update src/european_values/generative_training.py
AnnikaSimonsen a666c21
Update src/european_values/generative_training.py
AnnikaSimonsen 58f0c6f
Update src/european_values/generative_training.py
AnnikaSimonsen 205725b
Update src/scripts/evaluate_llm_benchmark.py
AnnikaSimonsen 2def0f6
Update src/european_values/generative_training.py
AnnikaSimonsen 96021dd
Update src/scripts/evaluate_llm_benchmark.py
AnnikaSimonsen 3136d90
Update src/scripts/train_generative_model.py
AnnikaSimonsen 803b806
Update src/scripts/evaluate_llm_benchmark.py
AnnikaSimonsen a9aba95
Address review feedback: flexible data loading, evaluation fixes, and…
AnnikaSimonsen 0bab4dc
Update src/european_values/generative_training.py
AnnikaSimonsen 6aae6a8
Update src/european_values/generative_training.py
AnnikaSimonsen 429d343
Fix process_data tuple unpacking in classifier, survey, and plot scripts
AnnikaSimonsen ae00d74
Apply ruff formatting
AnnikaSimonsen 243177f
Clean up llm_evaluation.py and add flexible data loading to evaluate_…
AnnikaSimonsen 012ded9
Update data loading patterns and fix tuple unpacking
AnnikaSimonsen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| """LLM evaluation using trained GMM.""" | ||
|
|
||
| from typing import Any, Dict, Tuple | ||
|
|
||
| import joblib | ||
| import numpy as np | ||
| import pandas as pd | ||
| from sklearn.pipeline import Pipeline | ||
|
|
||
|
|
||
| def evaluate_with_gmm( | ||
| responses: np.ndarray, gmm_pipeline: Pipeline | ||
| ) -> Tuple[float, np.ndarray]: | ||
| """Evaluate responses using GMM pipeline.""" | ||
| # Try using the pipeline directly first | ||
| try: | ||
| full_probabilities = gmm_pipeline.predict_proba(responses) | ||
| except AttributeError: | ||
| # Fallback: Pipeline doesn't have predict_proba, access GMM component | ||
| gmm_model = gmm_pipeline.named_steps["gaussianmixture"] | ||
| scaled_responses = gmm_pipeline.named_steps["minmaxscaler"].transform(responses) | ||
| full_probabilities = gmm_model.predict_proba(scaled_responses) | ||
|
|
||
| print("First 5 rows of full probabilities:") | ||
| print(full_probabilities[:5]) | ||
|
AnnikaSimonsen marked this conversation as resolved.
AnnikaSimonsen marked this conversation as resolved.
|
||
|
|
||
| probabilities = full_probabilities.max(axis=1) | ||
| avg_probability = np.mean(probabilities) | ||
| return avg_probability, probabilities | ||
|
|
||
|
|
||
| def evaluate_survey_data( | ||
| survey_df: pd.DataFrame, gmm_model_path: str, region: str = "EU" | ||
| ) -> Dict[str, Any]: | ||
| """Evaluate survey data with GMM.""" | ||
| # Load pipeline directly (no need for wrapper function) | ||
| gmm_pipeline = joblib.load(gmm_model_path) | ||
|
|
||
| # Try filtering by country_group first, then by country_code | ||
| data = survey_df[survey_df["country_group"] == region] | ||
| if len(data) == 0: | ||
| data = survey_df[survey_df["country_code"] == region] | ||
| if len(data) == 0: | ||
| available_groups = survey_df["country_group"].unique() | ||
| available_countries = survey_df["country_code"].unique() | ||
| raise ValueError( | ||
| f"No data found for region '{region}'. " | ||
| f"Available groups: {list(available_groups)}, " | ||
| f"Available countries: {list(available_countries)}" | ||
| ) | ||
|
|
||
| question_cols = [col for col in data.columns if col.startswith("question_")] | ||
| responses = data[question_cols].values | ||
|
|
||
| # Since data should already be processed and imputed, we shouldn't need | ||
| # additional NaN handling. But keep minimal safety check: | ||
| if np.isnan(responses).any(): | ||
| nan_count = np.isnan(responses).sum() | ||
| print(f"Warning: Found {nan_count} NaN values in processed data") | ||
| # Simple fallback: replace NaN with column means | ||
| col_means = np.nanmean(responses, axis=0) | ||
| nan_mask = np.isnan(responses) | ||
| responses = np.where(nan_mask, col_means, responses) | ||
|
|
||
| # Evaluate | ||
| avg_probability, sample_probabilities = evaluate_with_gmm(responses, gmm_pipeline) | ||
|
|
||
| return { | ||
| "avg_probability": avg_probability, | ||
| "sample_probabilities": sample_probabilities, | ||
| "n_samples": len(data), | ||
| "n_questions": len(question_cols), | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| """Script to evaluate LLM benchmark.""" | ||
|
|
||
| import logging | ||
|
|
||
| import hydra | ||
| import pandas as pd | ||
| from omegaconf import DictConfig | ||
|
|
||
| from european_values.data_loading import load_evs_trend_data, load_evs_wvs_data | ||
| from european_values.data_processing import process_data | ||
| from european_values.llm_evaluation import evaluate_survey_data | ||
|
|
||
| logger = logging.getLogger("evaluate_llm") | ||
|
|
||
|
|
||
| @hydra.main(config_path="../../config", config_name="config", version_base=None) | ||
| def main(config: DictConfig) -> None: | ||
| """Main evaluation function.""" | ||
| # Load data - now supports both datasets like other scripts | ||
| match (config.include_evs_trend, config.include_evs_wvs): | ||
| case (True, True): | ||
| logger.info("Loading EVS trend and EVS/WVS data...") | ||
| evs_trend_df = load_evs_trend_data() | ||
| evs_wvs_df = load_evs_wvs_data() | ||
| df = pd.concat([evs_trend_df, evs_wvs_df], ignore_index=True) | ||
| case (True, False): | ||
| logger.info("Loading only EVS trend data...") | ||
| df = load_evs_trend_data() | ||
| case (False, True): | ||
| logger.info("Loading only EVS/WVS data...") | ||
| df = load_evs_wvs_data() | ||
| case _: | ||
| raise ValueError( | ||
| "At least one of `include_evs_trend` or `include_evs_wvs` must be True." | ||
| ) | ||
| # Process data but SKIP normalization (let pipeline handle it) | ||
| df, _ = process_data(df=df, config=config, normalize=False) | ||
| # Apply subset filtering | ||
| if config.subset_csv is not None: | ||
| subset_df = pd.read_csv(config.subset_csv) | ||
| question_subset = ( | ||
| subset_df.question.unique().tolist() | ||
| if "question" in subset_df.columns | ||
| else list({line.split(":")[0] for line in subset_df.index.tolist()}) | ||
| ) | ||
| question_cols_to_remove = [ | ||
| col | ||
| for col in df.columns | ||
| if col.startswith("question_") and col not in question_subset | ||
| ] | ||
| df.drop(columns=question_cols_to_remove, inplace=True) | ||
| logger.info(f"Using {len(question_subset)} questions from subset") | ||
| # Set evaluation parameters | ||
| region = config.evaluation.region | ||
| model_path = config.evaluation.gmm_model_path | ||
| # Run evaluation | ||
| logger.info(f"Evaluating {region} data...") | ||
| results = evaluate_survey_data(df, model_path, region) | ||
| # Print results | ||
| print(f"\n{'=' * 50}") | ||
| print(f"EVALUATION RESULTS FOR {region}") | ||
| print(f"{'=' * 50}") | ||
| print(f"Samples: {results['n_samples']:,}") | ||
| print(f"Questions: {results['n_questions']}") | ||
| print(f"Average probability: {results['avg_probability']:.4f}") | ||
| print( | ||
| f"Probability range: [{results['sample_probabilities'].min():.4f}, " | ||
| f"{results['sample_probabilities'].max():.4f}]" | ||
| ) | ||
| print(f"Probability mean: {results['sample_probabilities'].mean():.4f}") | ||
| print(f"Probability std: {results['sample_probabilities'].std():.4f}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.