Skip to content

Change log likelihood squashing#11

Merged
saattrupdan merged 10 commits into
mainfrom
change_log_likelihood_squashing
Aug 18, 2025
Merged

Change log likelihood squashing#11
saattrupdan merged 10 commits into
mainfrom
change_log_likelihood_squashing

Conversation

@Haffi112
Copy link
Copy Markdown
Collaborator

This adds the sigmoid transform instead of linear scaling of log_likelihoods. Running the test with the command

python src/scripts/evaluate_llm_benchmark.py subset_csv=data/processed/optimisation-davies-bouldin-penalty10/davies-bouldin-penalty10-eufocus-1000it.csv

yields the following results:

2025-08-13 16:36:02 ⋅ Log-likelihoods for Europe without EU:
	- Mean: -31.16
	- Std: 66.19
	- Min: -424.91
	- 10% quantile: -118.76
	- 90% quantile: 38.79
	- Max: 50.69
	- Mean normalised score: 61.24%
2025-08-13 16:37:14 ⋅ Log-likelihoods for EU:
	- Mean: 62.64
	- Std: 0.01
	- Min: 62.64
	- 10% quantile: 62.64
	- 90% quantile: 62.64
	- Max: 63.33
	- Mean normalised score: 99.64%
2025-08-13 16:37:41 ⋅ Log-likelihoods for Latin America:
	- Mean: -43.17
	- Std: 81.28
	- Min: -525.21
	- 10% quantile: -151.25
	- 90% quantile: 38.85
	- Max: 40.68
	- Mean normalised score: 55.72%
2025-08-13 16:37:46 ⋅ Log-likelihoods for Oceania:
	- Mean: -15.95
	- Std: 65.20
	- Min: -351.48
	- 10% quantile: -111.50
	- 90% quantile: 38.83
	- Max: 40.78
	- Mean normalised score: 71.79%
2025-08-13 16:38:20 ⋅ Log-likelihoods for East Asia:
	- Mean: -59.30
	- Std: 98.99
	- Min: -563.38
	- 10% quantile: -197.64
	- 90% quantile: 38.52
	- Max: 46.91
	- Mean normalised score: 52.57%
2025-08-13 16:38:30 ⋅ Log-likelihoods for Anglo-America:
	- Mean: -24.27
	- Std: 71.11
	- Min: -426.40
	- 10% quantile: -124.85
	- 90% quantile: 38.60
	- Max: 40.80
	- Mean normalised score: 68.00%
2025-08-13 16:38:35 ⋅ Log-likelihoods for North Africa:
	- Mean: -75.59
	- Std: 97.61
	- Min: -395.50
	- 10% quantile: -201.77
	- 90% quantile: 38.48
	- Max: 40.78
	- Mean normalised score: 40.85%
2025-08-13 16:38:42 ⋅ Log-likelihoods for Sub-Saharan Africa:
	- Mean: -70.64
	- Std: 105.96
	- Min: -454.15
	- 10% quantile: -218.66
	- 90% quantile: 38.28
	- Max: 40.51
	- Mean normalised score: 47.04%
2025-08-13 16:38:51 ⋅ Log-likelihoods for Middle East:
	- Mean: -91.37
	- Std: 108.05
	- Min: -504.32
	- 10% quantile: -224.41
	- 90% quantile: 38.38
	- Max: 40.57
	- Mean normalised score: 38.04%
2025-08-13 16:38:57 ⋅ Log-likelihoods for Central Asia:
	- Mean: -61.30
	- Std: 90.30
	- Min: -420.87
	- 10% quantile: -182.53
	- 90% quantile: 38.43
	- Max: 40.33
	- Mean normalised score: 47.08%

Copilot AI review requested due to automatic review settings August 13, 2025 16:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the linear normalization of log-likelihood scores with a sigmoid transformation to improve score distribution. The change aims to ensure EU countries maintain high scores (above 99%) while providing better differentiation across other country groups.

Key changes:

  • Introduces sigmoid transformation function with configurable parameters
  • Replaces linear scaling ((log_likelihoods + 100) / 100) with sigmoid-based normalization
  • Uses sklearn's FunctionTransformer to apply the sigmoid transformation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/scripts/evaluate_llm_benchmark.py Outdated
Comment thread src/scripts/evaluate_llm_benchmark.py Outdated
Comment thread src/scripts/evaluate_llm_benchmark.py Outdated
Comment thread src/scripts/evaluate_llm_benchmark.py Outdated
@saattrupdan
Copy link
Copy Markdown
Member

@Haffi112 As for the failing code check, you can run make check to get these fixed. When you run make install to install the repo, it should also install a pre-commit hooks, which fixes up the code whenever you commit any change.

@Haffi112
Copy link
Copy Markdown
Collaborator Author

Thanks Dan, I'll have a look again when I have the time. Do you want me to implement this so that it is fitted during training as well? If we want to fit it then it's probably best to cache the log-likelihoods so that is easy to recompute the mean scores when looking for ideal parameters.

@saattrupdan
Copy link
Copy Markdown
Member

Thanks Dan, I'll have a look again when I have the time. Do you want me to implement this so that it is fitted during training as well? If we want to fit it then it's probably best to cache the log-likelihoods so that is easy to recompute the mean scores when looking for ideal parameters.

I have a few hours now - I'll have a look 🙂

@saattrupdan
Copy link
Copy Markdown
Member

saattrupdan commented Aug 16, 2025

@Haffi112 Now fits both the alpha and center to a new validation split.

The alpha parameter is set so that the "effective range" of the sigmoid curve corresponds to the range of the input log-likelihoods. If the range of the input data is 100, for instance, the alpha value is 0.1, as shown here:

Screenshot 2025-08-16 at 08 38 23

If the input range is 200, alpha is 0.05. In our case, alpha becomes around 0.05, just like you found manually.

The center is then fitted to get the sigmoid log-likelihoods of the EU data to be as close to 99% as possible. This is also fitted to the validation split. This results in a center value of -52.03, really close to your find as well!

Here are the log-likelihoods and scores for the train/val/test distribution, after training on the training/validation data:

2025-08-16 02:28:14 ⋅ Log-likelihoods for train:
	- Mean: 62.7517
	- Std: 0.0060
	- Min: 62.7517
	- 10% quantile: 62.7517
	- 90% quantile: 62.7517
	- Max: 63.4448
Mean score for train: 100%
2025-08-16 02:28:14 ⋅ Log-likelihoods for validation:
	- Mean: -5.0570
	- Std: 51.2834
	- Min: -292.5138
	- 10% quantile: -77.7987
	- 90% quantile: 39.3138
	- Max: 53.2671
Mean score for validation: 78%
2025-08-16 02:28:14 ⋅ Log-likelihoods for test:
	- Mean: -5.0312
	- Std: 52.9710
	- Min: -429.8224
	- 10% quantile: -75.1690
	- 90% quantile: 39.3126
	- Max: 57.1961
Mean score for test: 78%

After training on all of the EU data, here are the scores for different country groups:

2025-08-16 02:30:33 ⋅ Scores for Europe without EU:
	- Mean: 62%
	- Std: 37%
	- Min: 0%
	- 10% quantile: 4%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:32:39 ⋅ Scores for EU:
	- Mean: 100%
	- Std: 0%
	- Min: 100%
	- 10% quantile: 100%
	- 90% quantile: 100%
	- Max: 100%

2025-08-16 02:33:27 ⋅ Scores for Latin America:
	- Mean: 57%
	- Std: 40%
	- Min: 0%
	- 10% quantile: 1%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:33:35 ⋅ Scores for Oceania:
	- Mean: 72%
	- Std: 35%
	- Min: 0%
	- 10% quantile: 6%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:34:36 ⋅ Scores for East Asia:
	- Mean: 53%
	- Std: 43%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:34:53 ⋅ Scores for Anglo-America:
	- Mean: 69%
	- Std: 37%
	- Min: 0%
	- 10% quantile: 3%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:02 ⋅ Scores for North Africa:
	- Mean: 42%
	- Std: 44%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:14 ⋅ Scores for Sub-Saharan Africa:
	- Mean: 48%
	- Std: 45%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:29 ⋅ Scores for Middle East:
	- Mean: 39%
	- Std: 45%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

2025-08-16 02:35:40 ⋅ Scores for Central Asia:
	- Mean: 48%
	- Std: 43%
	- Min: 0%
	- 10% quantile: 0%
	- 90% quantile: 99%
	- Max: 99%

Lastly, added a new push_pipeline_to_hub script, which creates a Hugging Face repo with the pipeline in it, to make it easier to download in EuroEval.

@saattrupdan saattrupdan merged commit 015dd48 into main Aug 18, 2025
2 checks passed
@saattrupdan saattrupdan deleted the change_log_likelihood_squashing branch August 18, 2025 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants