Spectral Uncertainty

This repo is our official implementation of the paper "Fine-Grained Uncertainty Decomposition in Large Language Models: A Spectral Approach" as well as the baselines we compare to.

All results and plots in the paper can be reproduced by running this repo.
Paper arXiv: https://arxiv.org/abs/2509.22272

Requirements

First, create a virtual environment with python 3.12 and install all requirements using

pip install -r requirements.txt

Environment variables

In order for the code to run properly, you need to set the following environment variable before launching the experiments:

HF_CACHE: Path for storing huggingface models.
OPENAI_API_KEY: API key for openai. Necessary for GPT-4o and GPT-4.1 calls.
GROQ_API_KEY: API key. Necessary for Llama 4 Maverick calls.

Running the pipeline for aleatoric uncertainty evaluations

1. Generate Clarifications

Start by generating the input clarifications using gpt_4o. The second line just copies the original input in the clarifications directory, to serve as input for baselines that do not need clarification.

python generate_ensemble.py --config_path configs/generate_ensemble/clarification_zeroshot/AmbigInst-clarification_zeroshot.yaml
python generate_ensemble.py --config_path configs/generate_ensemble/no_ensembling/AmbigInst-no_ensembling.yaml

2. Generate Model Answers (No Ensembling)

Run target model answer generation for the no-ensembling (no clarification baselines) case using both Phi-4 14B and Llama 4 Maverick.

python generate_model_answers.py --config_path configs/generate_model_answers/no_ensembling/AmbigInst-no_ensembling-Llama_4_maverick.yaml
python generate_model_answers.py --config_path configs/generate_model_answers/no_ensembling/AmbigInst-no_ensembling.yaml

3. Generate Model Answers (Clarification-based)

Repeat model answer generation for the clarification-based methods.

python generate_model_answers.py --config_path configs/generate_model_answers/clarification_zeroshot/AmbigInst-clarification_zeroshot-Llama4_maverick.yaml
python generate_model_answers.py --config_path configs/generate_model_answers/clarification_zeroshot/AmbigInst-clarification_zeroshot.yaml

4. Compute Uncertainty (With Embeddings)

Compute uncertainty using embeddings for all configurations (clarification and no-ensembling, Phi 4 or Llama 4). This is for uncertainty methods that use embeddings (predictive kernel entropy, spectral uncertainty).

python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/AmbigInst-no-llama4_maverick.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/AmbigInst-no.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/AmbigInst-zeroshot-llama4_maverick.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/AmbigInst-zeroshot.yaml

5. Compute Uncertainty (Without Embeddings)

Finally, compute uncertainty without using embeddings for the same configuration set. This is for uncertainty methods that do not use embeddings (semantic entropy, kernel language entropy, input clarification ensembling).

python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/AmbigInst-no-llama4_maverick.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/AmbigInst-no.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/AmbigInst-zeroshot-llama4_maverick.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/AmbigInst-zeroshot.yaml

6. Repeat the above steps for AmbigQA

Replace AnbigInst in the above instructions by AmbigQA and rerun.

7. Evaluate the results

Run the cells in the notebook evaluation_aleatoric.ipynb to generate scores and plots for aleatoric uncertainty.

Running the pipeline for total uncertainty evaluations

1. Generate Clarifications

Start by generating the input clarifications using gpt_4o. The second line just copies the original input into the clarifications directory, to serve as input for baselines that do not use clarification.

python generate_ensemble.py --config_path configs/generate_ensemble/clarification_zeroshot/TriviaQA-clarification_zeroshot.yaml
python generate_ensemble.py --config_path configs/generate_ensemble/no_ensembling/TriviaQA-no_ensembling.yaml

2. Generate Model Answers (No Ensembling)

Run target model answer generation for the no-ensembling (no clarification baselines) case using both Phi-4 14B and Llama 4 Maverick.

python generate_model_answers.py --config_path configs/generate_model_answers/no_ensembling/TriviaQA-no_ensembling-Llama_4_maverick.yaml
python generate_model_answers.py --config_path configs/generate_model_answers/no_ensembling/TriviaQA-no_ensembling.yaml

3. Generate Model Answers (Clarification-based)

Repeat model answer generation for the clarification-based methods.

python generate_model_answers.py --config_path configs/generate_model_answers/clarification_zeroshot/TriviaQA-clarification_zeroshot-Llama4_maverick.yaml
python generate_model_answers.py --config_path configs/generate_model_answers/clarification_zeroshot/TriviaQA-clarification_zeroshot.yaml

4. Generate Standard Answers

Generate standard model outputs (standard answers that will be used to generate the ground truth for prediction correctness) for both Llama 4 Maverick and Phi-4 14B. These are the "most likely" answers of the target model to the original questions.

python generate_model_standard_answers_predictive.py --config_path configs/generate_model_standard_answers_predictive/TriviaQA-llama4_maverick.yaml
python generate_model_standard_answers_predictive.py --config_path configs/generate_model_standard_answers_predictive/TriviaQA-phi_4.yaml

5. Judge Correctness of Standard Answers

Use the correctness judge to determine the correctness of each sampled answer. GPT 4.1 here decides if the standard answer is correct or not.

python correctness_judge_predictive.py --config_path configs/correctness_judge/TriviaQA-llama4_maverick.yaml
python correctness_judge_predictive.py --config_path configs/correctness_judge/TriviaQA-phi_4.yaml

6. Compute Uncertainty (With Embeddings)

Compute uncertainty using embeddings for all configurations (clarification and no-ensembling, Phi 4 or Llama 4). This includes methods relying on embeddings (predictive kernel entropy, spectral uncertainty).

python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/TriviaQA-no-llama4_maverick.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/TriviaQA-no.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/TriviaQA-zeroshot-llama4_maverick.yaml
python compute_uncertainty_w_embeddings.py --config_path configs/compute_uncertainty/w_embeddings/TriviaQA-zeroshot.yaml

7. Compute Uncertainty (Without Embeddings)

Compute uncertainty without using embeddings. This includes methods that don't use embeddings (semantic entropy, kernel language entropy, and input clarification ensembling).

python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/TriviaQA-no-llama4_maverick.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/TriviaQA-no.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/TriviaQA-zeroshot-llama4_maverick.yaml
python compute_uncertainty_wo_embeddings.py --config_path configs/compute_uncertainty/wo_embeddings/TriviaQA-zeroshot.yaml

8. Repeat the above steps for Natural Questions

Replace TriviaQA in the above instructions with OpenNQ and rerun the full pipeline.

9. Evaluate the results

Run the cells in the notebook evaluation_predictive.ipynb to generate scores and plots for total uncertainty evaluation.

Citation

If you found this work or code useful, please cite:

@article{walha2025fine,
  title={Fine-Grained Uncertainty Decomposition in Large Language Models: A Spectral Approach},
  author={Walha, Nassim and Gruber, Sebastian G and Decker, Thomas and Yang, Yinchong and Javanmardi, Alireza and H{\"u}llermeier, Eyke and Buettner, Florian},
  journal={arXiv preprint arXiv:2509.22272},
  year={2025}
}

License

Everything is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data/eval		data/eval
experiment_logs		experiment_logs
prompt_templates		prompt_templates
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
SpectralUQ.png		SpectralUQ.png
__init__.py		__init__.py
compute_uncertainty_w_embeddings.py		compute_uncertainty_w_embeddings.py
compute_uncertainty_wo_embeddings.py		compute_uncertainty_wo_embeddings.py
correctness_judge_predictive.py		correctness_judge_predictive.py
evaluation_aleatoric.ipynb		evaluation_aleatoric.ipynb
evaluation_predictive.ipynb		evaluation_predictive.ipynb
generate_ensemble.py		generate_ensemble.py
generate_model_answers.py		generate_model_answers.py
generate_model_standard_answers_predictive.py		generate_model_standard_answers_predictive.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spectral Uncertainty

Requirements

Environment variables

Running the pipeline for aleatoric uncertainty evaluations

1. Generate Clarifications

2. Generate Model Answers (No Ensembling)

3. Generate Model Answers (Clarification-based)

4. Compute Uncertainty (With Embeddings)

5. Compute Uncertainty (Without Embeddings)

6. Repeat the above steps for AmbigQA

7. Evaluate the results

Running the pipeline for total uncertainty evaluations

1. Generate Clarifications

2. Generate Model Answers (No Ensembling)

3. Generate Model Answers (Clarification-based)

4. Generate Standard Answers

5. Judge Correctness of Standard Answers

6. Compute Uncertainty (With Embeddings)

7. Compute Uncertainty (Without Embeddings)

8. Repeat the above steps for Natural Questions

9. Evaluate the results

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

MLO-lab/spectral_uncertainty_decomposition

Folders and files

Latest commit

History

Repository files navigation

Spectral Uncertainty

Requirements

Environment variables

Running the pipeline for aleatoric uncertainty evaluations

1. Generate Clarifications

2. Generate Model Answers (No Ensembling)

3. Generate Model Answers (Clarification-based)

4. Compute Uncertainty (With Embeddings)

5. Compute Uncertainty (Without Embeddings)

6. Repeat the above steps for AmbigQA

7. Evaluate the results

Running the pipeline for total uncertainty evaluations

1. Generate Clarifications

2. Generate Model Answers (No Ensembling)

3. Generate Model Answers (Clarification-based)

4. Generate Standard Answers

5. Judge Correctness of Standard Answers

6. Compute Uncertainty (With Embeddings)

7. Compute Uncertainty (Without Embeddings)

8. Repeat the above steps for Natural Questions

9. Evaluate the results

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages