This repo hosts the code for the paper, Evaluating LLMs on Real-World Forecasting Against Expert Forecasters.
Use uv to set up your environment
pip install uv
uv syncYou may also use pip install or poetry to install the dependencies.
newspipeline.py takes each question and open_date and gets the relevant articles from before that date from AskNews
batch_prediction.py is the orchestrator script for running multiple models with async/concurrent processing
models.py contains all the code to call models. Prompts located in prompts.py
python batch_prediction.py [OPTIONS]
--mode Prediction mode: direct, narrative, baseline, both (default: direct)
--model Single model to run
--models Comma-separated list of models to run
--runs Number of runs per question (default: 5)
--dataset Dataset to use: aibq3, aibq4 (default: aibq3)For o3, I used the batch API with a different script:
python batch_prediction_o3.py --dataset aibq3 --mode direct --runs 5claude-3.5-sonnet, claude-3.6-sonnet, gpt-4o, gpt-4o-mini, gpt-4.1, o4-mini, o3, deepseek-v3, deepseek-r1, qwen3-72b, qwen3-30b-a3b
- aibq3:
data_metaculus/metaculus_data_aibq3_wd.json(main Q3 dataset) - aibq4:
data_metaculus/metaculus_data_aibq4_subset_RT.json(Q4 dataset with 130 questions matches the Q3 dataset distribution)
extract_probabilities.py [file.json] gets probabilities from the json and writes it to a .csv
extract_probabilities_narrative.py [file.json] gets narrative probabilities from the json and writes it to a .csv
extract_probabilities_conf.py gets probabilities from aibq3_predictions_conf_{model name}.json and writes it to a .csv
Narrativeprediction.py asks the LLM to write a script between Tetlock and Silver the day after the question's scheduled close date, and say the probability the models had calculated before the event.
Graphs folder contains visualizations
metaculus.pyscrapes all binary, resolved questions.metaculus_aibqi.pyscrapes all resolved questions from aibq{i} tournamentmetaculus_aibq3_wd.pyscrapes all resolved questions and background informationmetaculus_data_binary_resolved_all.jsoncontains ALL binary, resolved questions from Metaculusmetaculus_data_binary_resolved_May24.jsoncontains ALL binary, resolved questions from Metaculus after May 1, 2024metaculus_data_aibq3_wd.jsoncontains all binary, resolved questions from Metaculus in Q3 LLM Benchmarking tournamentmetaculus_data_aibq4_wd.jsoncontains all binary, resolved questions from Metaculus in Q4 LLM Benchmarking tournamentclassification.pygets Gemini to categorize the questionsmetaculus_data_aibq3_categories.csvhas all the categoriesmetaculus_data_path.jsonexists because claude sonnet 3.5 old errorred out in the first resolutionrandom_sample.pygets the category distributions across Q3 and Q4 tournaments and selects a representative subset by categoryget_question_details.pytakes the question_ids in a file and gets the information associated with that id out of json files
If the paper or codebase is helpful to your work, please cite:
@misc{lu2025evaluatingllmsrealworldforecasting,
title={Evaluating LLMs on Real-World Forecasting Against Expert Forecasters},
author={Janna Lu},
year={2025},
eprint={2507.04562},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.04562},
}