Skip to content

A standardized evaluation framework for assessing the medical knowledge and reasoning capabilities of LLMs on Italian Medical Residency examination questions.

Notifications You must be signed in to change notification settings

alessandroliva/ssm_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SSM Benchmark: Evaluating LLMs on an Italian Medical Residency Admission Test

Python 3.8+ License

tl;dr:

  • 📚 SSM exam based benchmark, 2020–2025, multiple choice format, multimodal
  • 🌀 Multiple-choice options are auto-scrambled
  • 🚦 Scoring system: +1.0 correct, –0.25 wrong, 0.0 abstain
  • 🤖 0-shot OpenRouter API baseline included to test various LLMs, add your OPENROUTER_API_KEY to .env file
  • 🩻 Handles image-based questions (use with models that support vision)
  • 📦 Install: pip install -r requirements.txt (Python 3.8+)
  • ▶️ Quickrun: python inference/run_openrouter.py --model anthropic/claude-3-opus --years 2025
  • 🔎 Results: Written to results/ as JSON

Introduction

The SSM (Italian State Exam for Medical Specializations - Selezione Scuole di Specializzazione Medica) is the official test provided by the Italian Ministry of University that each year over 15,000 medical doctors take to access positions in Italy.

SSM Benchmark is a standardized evaluation framework for assessing the medical knowledge and reasoning capabilities of LLMs on Italian medical examination questions.

Dataset Overview

The benchmark includes SSM test questions from 2020 to 2025, covering various medical specialties and question types. Each question is a multiple-choice format with 5 options (A-E), plus an optional abstention option (F).

Key Features:

  • Multimodal Support: Questions may include medical images (imaging, blood panels, flow charts, tables, etc.), around 10 images per test/year
  • SSM Scoring System: Implements the official scoring rules (+1.0 correct, -0.25 wrong, 0.0 abstain)
  • Deterministic Scrambling: Choice order is scrambled to prevent position bias
  • Abstention Option: Models can choose option 'F' to abstain to avoid scoring penalties

Repository Structure

.
├── README.md                                 
├── requirements.txt                          # Python dependencies
├── LIMITATIONS_AND_FAILURES.md               # Known limitations and issues
├── data/
│   ├── raw/                                  # Original PDF files (ssm2020.pdf, etc.)
│   ├── interim/                              # Extracted text artifacts
│   └── processed/                            # Final processed JSON datasets
│       ├── ssm2020.json
│       ├── ssm2021.json
│       ├── ...
│       └── images/                            # Extracted medical images by year
├── src/
│   ├── parser/                                # PDF extraction and parsing logic
│   │   ├── pdf_text.py
│   │   ├── ssm_pdf.py
│   │   └── ssm_pdf_legacy.py
│   └── schemas.py                             # Data models and validation
├── scripts/
│   ├── parse_pdf.py                           # Parse PDF to JSON
│   ├── parse_all.py                           # Batch parsing
│   └── validate_dataset.py                    # Validate JSON format
├── inference/                                  # Inference scripts and utilities
│   ├── run_openrouter.py                      # Main inference script for OpenRouter models
│   ├── dataset_loader.py                      # Dataset loading utilities
│   ├── prompting.py                           # Prompt construction
│   ├── scramble.py                            # Choice scrambling logic
│   └── openrouter_client.py                   # OpenRouter API client
├── evaluation/                                 # Evaluation scripts
│   └── evaluate_results.py                    # Evaluation script
└── results/                                   # Inference results (JSON files)

Installation and Dependencies

This project requires Python 3.8+.

Setup

  1. Clone the repository:
git clone https://github.com/alessandroliva/ssm_benchmark
cd ssm_benchmark
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (for OpenRouter API):
export OPENROUTER_API_KEY="your_key_here"
  1. Alternatively, you can create a .env file
OPENROUTER_API_KEY="your_key_here"

Running the Benchmark

Inference

The inference scripts are located in the inference/ directory:

cd inference

Arguments

The inference script supports:

  • --model: OpenRouter model ID (e.g., openai/gpt-4o, anthropic/claude-3-opus)
  • --years: Years to include (can specify multiple: --years 2024 --years 2025)
  • --seed: Seed for deterministic choice scrambling (default: 14)
  • --limit: Limit number of questions (useful for testing)
  • --delay: Delay in seconds between API requests (default: 1.0)
  • --with-explanation: Enable model to provide brief explanations in the JSON output
  • --dataset-dir: Directory for processed JSON datasets (default: ../data/processed)
  • --out-dir: Directory to save results (default: ../results)

Basic Usage

# Run a model on a single year
python run_openrouter.py --model anthropic/claude-3-opus --years 2025

# Run a model on more than one year
python run_openrouter.py --model openai/gpt-4o --years 2024 --years 2025

Rate Limiting and Partial Test Benchmarking

# Limit number of questions for testing
python run_openrouter.py \
  --model mistralai/mistral-small \
  --years 2025 \
  --limit 10

# Adjust API rate limiting (2 second delay between requests)
python run_openrouter.py \
  --model openai/gpt-4o \
  --years 2025 \
  --delay 2.0

Evaluation

The evaluation scripts are located in the evaluation/ directory:

cd evaluation

Basic Evaluation

# Evaluate results and compute SSM score
python evaluate_results.py ../results/openrouter_openai_gpt-4o_20260103_120000.json

Scoring Rules:

  • Correct Answer: +1.0 point
  • Wrong Answer: -0.25 points
  • Abstained (F) / Invalid: 0.0 points

The evaluation script provides:

  • Overall accuracy
  • Accuracy on attempted questions (excluding abstentions)
  • Total SSM score
  • Breakdown of correct, wrong, and abstained answers
  • Top errors breakdown

Data Format

The processed dataset is a JSON array of question objects. Each object represents a single multiple-choice question:

{
  "id": "ssm2025861",
  "year": 2025,
  "source_pdf": "ssm2025.pdf",
  "question_number": 1,
  "prompt": "Un bambino di 5 anni nato a termine...",
  "choices": {
    "A": "toracotomia postero-laterale sinistra nel IV spazio intercostale",
    "B": "toracotomia postero-laterale sinistra nel VII spazio intercostale",
    "C": "sternotomia mediana longitudinale",
    "D": "toracotomia anteriore sinistra nel VI spazio intercostale",
    "E": "toracotomia postero-laterale destra nel V spazio intercostale"
  },
  "answer": "A",
  "images": ["data/processed/images/ssm2025/ssm2025885.png"]
}

Result Format

Inference results are saved as JSON files with the following structure:

{
  "model": "openai/gpt-4o",
  "run_seed": 42,
  "timestamp": "20260103_120000",
  "inference_mode": "0-shot",
  "with_explanation": false,
  "years": [2024, 2025],
  "results": [
    {
      "question_id": "ssm2025861",
      "year": 2025,
      "predicted_answer": "A",
      "ground_truth_original": "A",
      "ground_truth_scrambled": "C",
      "scrambled_choices": {
        "A": "...",
        "B": "...",
        "C": "original A answer",
        "D": "...",
        "E": "..."
      },
      "correct": true,
      "score_delta": 1.0,
      "raw_response": "..."
    }
  ],
  "summary": {
    "total_questions": 100,
    "correct": 75,
    "score": 68.75
  }
}

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See: https://creativecommons.org/licenses/by-nc/4.0/deed.en

Citation

If you find this repository useful for your work, please cite:

@misc{ssm_benchmark,
  title={SSM Benchmark: Evaluating LLMs on Italian Medical Specialty Examinations},
  author={Alessandro Oliva},
  year={2025},
  url={https://github.com/alessandroliva/ssm_benchmark}
}

Contact

For any query or collaboration opportunities, please do not esitate to contact me here

About

A standardized evaluation framework for assessing the medical knowledge and reasoning capabilities of LLMs on Italian Medical Residency examination questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages