Change log likelihood squashing by Haffi112 · Pull Request #11 · alexandrainst/european_values

Haffi112 · 2025-08-13T16:43:17Z

This adds the sigmoid transform instead of linear scaling of log_likelihoods. Running the test with the command

python src/scripts/evaluate_llm_benchmark.py subset_csv=data/processed/optimisation-davies-bouldin-penalty10/davies-bouldin-penalty10-eufocus-1000it.csv

yields the following results:

2025-08-13 16:36:02 ⋅ Log-likelihoods for Europe without EU:
	- Mean: -31.16
	- Std: 66.19
	- Min: -424.91
	- 10% quantile: -118.76
	- 90% quantile: 38.79
	- Max: 50.69
	- Mean normalised score: 61.24%
2025-08-13 16:37:14 ⋅ Log-likelihoods for EU:
	- Mean: 62.64
	- Std: 0.01
	- Min: 62.64
	- 10% quantile: 62.64
	- 90% quantile: 62.64
	- Max: 63.33
	- Mean normalised score: 99.64%
2025-08-13 16:37:41 ⋅ Log-likelihoods for Latin America:
	- Mean: -43.17
	- Std: 81.28
	- Min: -525.21
	- 10% quantile: -151.25
	- 90% quantile: 38.85
	- Max: 40.68
	- Mean normalised score: 55.72%
2025-08-13 16:37:46 ⋅ Log-likelihoods for Oceania:
	- Mean: -15.95
	- Std: 65.20
	- Min: -351.48
	- 10% quantile: -111.50
	- 90% quantile: 38.83
	- Max: 40.78
	- Mean normalised score: 71.79%
2025-08-13 16:38:20 ⋅ Log-likelihoods for East Asia:
	- Mean: -59.30
	- Std: 98.99
	- Min: -563.38
	- 10% quantile: -197.64
	- 90% quantile: 38.52
	- Max: 46.91
	- Mean normalised score: 52.57%
2025-08-13 16:38:30 ⋅ Log-likelihoods for Anglo-America:
	- Mean: -24.27
	- Std: 71.11
	- Min: -426.40
	- 10% quantile: -124.85
	- 90% quantile: 38.60
	- Max: 40.80
	- Mean normalised score: 68.00%
2025-08-13 16:38:35 ⋅ Log-likelihoods for North Africa:
	- Mean: -75.59
	- Std: 97.61
	- Min: -395.50
	- 10% quantile: -201.77
	- 90% quantile: 38.48
	- Max: 40.78
	- Mean normalised score: 40.85%
2025-08-13 16:38:42 ⋅ Log-likelihoods for Sub-Saharan Africa:
	- Mean: -70.64
	- Std: 105.96
	- Min: -454.15
	- 10% quantile: -218.66
	- 90% quantile: 38.28
	- Max: 40.51
	- Mean normalised score: 47.04%
2025-08-13 16:38:51 ⋅ Log-likelihoods for Middle East:
	- Mean: -91.37
	- Std: 108.05
	- Min: -504.32
	- 10% quantile: -224.41
	- 90% quantile: 38.38
	- Max: 40.57
	- Mean normalised score: 38.04%
2025-08-13 16:38:57 ⋅ Log-likelihoods for Central Asia:
	- Mean: -61.30
	- Std: 90.30
	- Min: -420.87
	- 10% quantile: -182.53
	- 90% quantile: 38.43
	- Max: 40.33
	- Mean normalised score: 47.08%

Copilot

Pull Request Overview

This PR replaces the linear normalization of log-likelihood scores with a sigmoid transformation to improve score distribution. The change aims to ensure EU countries maintain high scores (above 99%) while providing better differentiation across other country groups.

Key changes:

Introduces sigmoid transformation function with configurable parameters
Replaces linear scaling ((log_likelihoods + 100) / 100) with sigmoid-based normalization
Uses sklearn's FunctionTransformer to apply the sigmoid transformation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

saattrupdan · 2025-08-14T10:09:29Z

@Haffi112 As for the failing code check, you can run make check to get these fixed. When you run make install to install the repo, it should also install a pre-commit hooks, which fixes up the code whenever you commit any change.

Haffi112 · 2025-08-14T18:54:09Z

Thanks Dan, I'll have a look again when I have the time. Do you want me to implement this so that it is fitted during training as well? If we want to fit it then it's probably best to cache the log-likelihoods so that is easy to recompute the mean scores when looking for ideal parameters.

saattrupdan · 2025-08-15T08:03:16Z

Thanks Dan, I'll have a look again when I have the time. Do you want me to implement this so that it is fitted during training as well? If we want to fit it then it's probably best to cache the log-likelihoods so that is easy to recompute the mean scores when looking for ideal parameters.

I have a few hours now - I'll have a look 🙂

saattrupdan · 2025-08-16T06:42:08Z

@Haffi112 Now fits both the alpha and center to a new validation split.

The alpha parameter is set so that the "effective range" of the sigmoid curve corresponds to the range of the input log-likelihoods. If the range of the input data is 100, for instance, the alpha value is 0.1, as shown here:

If the input range is 200, alpha is 0.05. In our case, alpha becomes around 0.05, just like you found manually.

The center is then fitted to get the sigmoid log-likelihoods of the EU data to be as close to 99% as possible. This is also fitted to the validation split. This results in a center value of -52.03, really close to your find as well!

Here are the log-likelihoods and scores for the train/val/test distribution, after training on the training/validation data:

2025-08-16 02:28:14 ⋅ Log-likelihoods for train:
	- Mean: 62.7517
	- Std: 0.0060
	- Min: 62.7517
	- 10% quantile: 62.7517
	- 90% quantile: 62.7517
	- Max: 63.4448
Mean score for train: 100%
2025-08-16 02:28:14 ⋅ Log-likelihoods for validation:
	- Mean: -5.0570
	- Std: 51.2834
	- Min: -292.5138
	- 10% quantile: -77.7987
	- 90% quantile: 39.3138
	- Max: 53.2671
Mean score for validation: 78%
2025-08-16 02:28:14 ⋅ Log-likelihoods for test:
	- Mean: -5.0312
	- Std: 52.9710
	- Min: -429.8224
	- 10% quantile: -75.1690
	- 90% quantile: 39.3126
	- Max: 57.1961
Mean score for test: 78%

After training on all of the EU data, here are the scores for different country groups:

2025-08-16 02:30:33 ⋅ Scores for Europe without EU:
	- Mean: 62%
	- Std: 37%
	- Min: 0%
	- 10% quantile: 4%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:32:39 ⋅ Scores for EU:
	- Mean: 100%
	- Std: 0%
	- Min: 100%
	- 10% quantile: 100%
	- 90% quantile: 100%
	- Max: 100%

2025-08-16 02:33:27 ⋅ Scores for Latin America:
	- Mean: 57%
	- Std: 40%
	- Min: 0%
	- 10% quantile: 1%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:33:35 ⋅ Scores for Oceania:
	- Mean: 72%
	- Std: 35%
	- Min: 0%
	- 10% quantile: 6%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:34:36 ⋅ Scores for East Asia:
	- Mean: 53%
	- Std: 43%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:34:53 ⋅ Scores for Anglo-America:
	- Mean: 69%
	- Std: 37%
	- Min: 0%
	- 10% quantile: 3%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:02 ⋅ Scores for North Africa:
	- Mean: 42%
	- Std: 44%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:14 ⋅ Scores for Sub-Saharan Africa:
	- Mean: 48%
	- Std: 45%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:29 ⋅ Scores for Middle East:
	- Mean: 39%
	- Std: 45%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:40 ⋅ Scores for Central Asia:
	- Mean: 48%
	- Std: 43%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

Lastly, added a new push_pipeline_to_hub script, which creates a Hugging Face repo with the pipeline in it, to make it easier to download in EuroEval.

Haffi112 added 2 commits August 13, 2025 16:33

Updated normalization to use sigmoid

030881b

Updated normalization to use sigmoid

8a379aa

Copilot AI review requested due to automatic review settings August 13, 2025 16:43

Copilot AI reviewed Aug 13, 2025

View reviewed changes

Comment thread src/scripts/evaluate_llm_benchmark.py Outdated

Comment thread src/scripts/evaluate_llm_benchmark.py Outdated

Comment thread src/scripts/evaluate_llm_benchmark.py Outdated

saattrupdan requested changes Aug 14, 2025

View reviewed changes

Comment thread src/scripts/evaluate_llm_benchmark.py Outdated

saattrupdan assigned Haffi112 and saattrupdan Aug 15, 2025

saattrupdan added 2 commits August 15, 2025 16:36

feat: Optimise sigmoid log-likelihood transformation from data

baa666d

feat: Choose bandwidth=0.1 always, fit alpha and center separately

33de447

saattrupdan added 6 commits August 16, 2025 11:56

feat: Add push_pipeline_to_hub script

491d567

style: Unpack var

84a587e

fix: Use cloudpickle for saving the pipeline

1b606be

chore: Remove unused top_num_questions_in_subset

52d90b7

refactor: Move apply_subset_filtering into utils module

4040fd4

fix: Clip values to (0, 1)

50eafe7

saattrupdan merged commit 015dd48 into main Aug 18, 2025
2 checks passed

saattrupdan deleted the change_log_likelihood_squashing branch August 18, 2025 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change log likelihood squashing#11

Change log likelihood squashing#11
saattrupdan merged 10 commits into
mainfrom
change_log_likelihood_squashing

Haffi112 commented Aug 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saattrupdan commented Aug 14, 2025

Uh oh!

Haffi112 commented Aug 14, 2025

Uh oh!

saattrupdan commented Aug 15, 2025

Uh oh!

saattrupdan commented Aug 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Haffi112 commented Aug 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saattrupdan commented Aug 14, 2025

Uh oh!

Haffi112 commented Aug 14, 2025

Uh oh!

saattrupdan commented Aug 15, 2025

Uh oh!

saattrupdan commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saattrupdan commented Aug 16, 2025 •

edited

Loading