Skip to content

jannalulu/evaluating-llm-forecasting

Repository files navigation

Forecasting for LLMs

This repo hosts the code for the paper, Evaluating LLMs on Real-World Forecasting Against Expert Forecasters.

Get started

Use uv to set up your environment

pip install uv
uv sync

You may also use pip install or poetry to install the dependencies.

Getting predictions from LLMs

newspipeline.py takes each question and open_date and gets the relevant articles from before that date from AskNews

batch_prediction.py is the orchestrator script for running multiple models with async/concurrent processing

models.py contains all the code to call models. Prompts located in prompts.py

Available flags:

python batch_prediction.py [OPTIONS]

--mode          Prediction mode: direct, narrative, baseline, both (default: direct)
--model         Single model to run 
--models        Comma-separated list of models to run
--runs          Number of runs per question (default: 5)
--dataset       Dataset to use: aibq3, aibq4 (default: aibq3)

For o3, I used the batch API with a different script:

python batch_prediction_o3.py --dataset aibq3 --mode direct --runs 5

Models tested in the paper

claude-3.5-sonnet, claude-3.6-sonnet, gpt-4o, gpt-4o-mini, gpt-4.1, o4-mini, o3, deepseek-v3, deepseek-r1, qwen3-72b, qwen3-30b-a3b

Datasets:

  • aibq3: data_metaculus/metaculus_data_aibq3_wd.json (main Q3 dataset)
  • aibq4: data_metaculus/metaculus_data_aibq4_subset_RT.json (Q4 dataset with 130 questions matches the Q3 dataset distribution)

Processing the data

extract_probabilities.py [file.json] gets probabilities from the json and writes it to a .csv extract_probabilities_narrative.py [file.json] gets narrative probabilities from the json and writes it to a .csv extract_probabilities_conf.py gets probabilities from aibq3_predictions_conf_{model name}.json and writes it to a .csv

Narrativeprediction.py asks the LLM to write a script between Tetlock and Silver the day after the question's scheduled close date, and say the probability the models had calculated before the event.

Graphs folder contains visualizations

Metaculus folder structure

  • metaculus.py scrapes all binary, resolved questions.
  • metaculus_aibqi.py scrapes all resolved questions from aibq{i} tournament
  • metaculus_aibq3_wd.py scrapes all resolved questions and background information
  • metaculus_data_binary_resolved_all.json contains ALL binary, resolved questions from Metaculus
  • metaculus_data_binary_resolved_May24.json contains ALL binary, resolved questions from Metaculus after May 1, 2024
  • metaculus_data_aibq3_wd.json contains all binary, resolved questions from Metaculus in Q3 LLM Benchmarking tournament
  • metaculus_data_aibq4_wd.json contains all binary, resolved questions from Metaculus in Q4 LLM Benchmarking tournament
  • classification.py gets Gemini to categorize the questions
  • metaculus_data_aibq3_categories.csv has all the categories
  • metaculus_data_path.json exists because claude sonnet 3.5 old errorred out in the first resolution
  • random_sample.py gets the category distributions across Q3 and Q4 tournaments and selects a representative subset by category
  • get_question_details.py takes the question_ids in a file and gets the information associated with that id out of json files

Citation

If the paper or codebase is helpful to your work, please cite:

@misc{lu2025evaluatingllmsrealworldforecasting,
      title={Evaluating LLMs on Real-World Forecasting Against Expert Forecasters}, 
      author={Janna Lu},
      year={2025},
      eprint={2507.04562},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.04562}, 
}

About

Making LLMs do forecasting

Resources

Stars

Watchers

Forks

Languages