Forecasting for LLMs

This repo hosts the code for the paper, Evaluating LLMs on Real-World Forecasting Against Expert Forecasters.

Get started

Use uv to set up your environment

pip install uv
uv sync

You may also use pip install or poetry to install the dependencies.

Getting predictions from LLMs

newspipeline.py takes each question and open_date and gets the relevant articles from before that date from AskNews

batch_prediction.py is the orchestrator script for running multiple models with async/concurrent processing

models.py contains all the code to call models. Prompts located in prompts.py

Available flags:

python batch_prediction.py [OPTIONS]

--mode          Prediction mode: direct, narrative, baseline, both (default: direct)
--model         Single model to run 
--models        Comma-separated list of models to run
--runs          Number of runs per question (default: 5)
--dataset       Dataset to use: aibq3, aibq4 (default: aibq3)

For o3, I used the batch API with a different script:

python batch_prediction_o3.py --dataset aibq3 --mode direct --runs 5

Models tested in the paper

claude-3.5-sonnet, claude-3.6-sonnet, gpt-4o, gpt-4o-mini, gpt-4.1, o4-mini, o3, deepseek-v3, deepseek-r1, qwen3-72b, qwen3-30b-a3b

Datasets:

aibq3: data_metaculus/metaculus_data_aibq3_wd.json (main Q3 dataset)
aibq4: data_metaculus/metaculus_data_aibq4_subset_RT.json (Q4 dataset with 130 questions matches the Q3 dataset distribution)

Processing the data

extract_probabilities.py [file.json] gets probabilities from the json and writes it to a .csv extract_probabilities_narrative.py [file.json] gets narrative probabilities from the json and writes it to a .csv extract_probabilities_conf.py gets probabilities from aibq3_predictions_conf_{model name}.json and writes it to a .csv

Narrativeprediction.py asks the LLM to write a script between Tetlock and Silver the day after the question's scheduled close date, and say the probability the models had calculated before the event.

Graphs folder contains visualizations

Metaculus folder structure

metaculus.py scrapes all binary, resolved questions.
metaculus_aibqi.py scrapes all resolved questions from aibq{i} tournament
metaculus_aibq3_wd.py scrapes all resolved questions and background information
metaculus_data_binary_resolved_all.json contains ALL binary, resolved questions from Metaculus
metaculus_data_binary_resolved_May24.json contains ALL binary, resolved questions from Metaculus after May 1, 2024
metaculus_data_aibq3_wd.json contains all binary, resolved questions from Metaculus in Q3 LLM Benchmarking tournament
metaculus_data_aibq4_wd.json contains all binary, resolved questions from Metaculus in Q4 LLM Benchmarking tournament
classification.py gets Gemini to categorize the questions
metaculus_data_aibq3_categories.csv has all the categories
metaculus_data_path.json exists because claude sonnet 3.5 old errorred out in the first resolution
random_sample.py gets the category distributions across Q3 and Q4 tournaments and selects a representative subset by category
get_question_details.py takes the question_ids in a file and gets the information associated with that id out of json files

Citation

If the paper or codebase is helpful to your work, please cite:

@misc{lu2025evaluatingllmsrealworldforecasting,
      title={Evaluating LLMs on Real-World Forecasting Against Expert Forecasters}, 
      author={Janna Lu},
      year={2025},
      eprint={2507.04562},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.04562}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data_metaculus		data_metaculus
graphs		graphs
.env.example		.env.example
.gitignore		.gitignore
Brier_score_categories.py		Brier_score_categories.py
README.md		README.md
batch_download.py		batch_download.py
batch_prediction.py		batch_prediction.py
batch_prediction_o3.py		batch_prediction_o3.py
brier_score_calculator.py		brier_score_calculator.py
extract_probabilities.py		extract_probabilities.py
extract_probabilities_narrative.py		extract_probabilities_narrative.py
models.py		models.py
newspipeline.py		newspipeline.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Forecasting for LLMs

Get started

Getting predictions from LLMs

Available flags:

Models tested in the paper

Datasets:

Processing the data

Metaculus folder structure

Citation

About

Uh oh!

Languages

jannalulu/evaluating-llm-forecasting

Folders and files

Latest commit

History

Repository files navigation

Forecasting for LLMs

Get started

Getting predictions from LLMs

Available flags:

Models tested in the paper

Datasets:

Processing the data

Metaculus folder structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages