SSM Benchmark: Evaluating LLMs on an Italian Medical Residency Admission Test

tl;dr:

📚 SSM exam based benchmark, 2020–2025, multiple choice format, multimodal
🌀 Multiple-choice options are auto-scrambled
🚦 Scoring system: +1.0 correct, –0.25 wrong, 0.0 abstain
🤖 0-shot OpenRouter API baseline included to test various LLMs, add your OPENROUTER_API_KEY to .env file
🩻 Handles image-based questions (use with models that support vision)
📦 Install: pip install -r requirements.txt (Python 3.8+)
▶️ Quickrun: python inference/run_openrouter.py --model anthropic/claude-3-opus --years 2025
🔎 Results: Written to results/ as JSON

Introduction

The SSM (Italian State Exam for Medical Specializations - Selezione Scuole di Specializzazione Medica) is the official test provided by the Italian Ministry of University that each year over 15,000 medical doctors take to access positions in Italy.

SSM Benchmark is a standardized evaluation framework for assessing the medical knowledge and reasoning capabilities of LLMs on Italian medical examination questions.

Dataset Overview

The benchmark includes SSM test questions from 2020 to 2025, covering various medical specialties and question types. Each question is a multiple-choice format with 5 options (A-E), plus an optional abstention option (F).

Key Features:

Multimodal Support: Questions may include medical images (imaging, blood panels, flow charts, tables, etc.), around 10 images per test/year
SSM Scoring System: Implements the official scoring rules (+1.0 correct, -0.25 wrong, 0.0 abstain)
Deterministic Scrambling: Choice order is scrambled to prevent position bias
Abstention Option: Models can choose option 'F' to abstain to avoid scoring penalties

Repository Structure

.
├── README.md                                 
├── requirements.txt                          # Python dependencies
├── LIMITATIONS_AND_FAILURES.md               # Known limitations and issues
├── data/
│   ├── raw/                                  # Original PDF files (ssm2020.pdf, etc.)
│   ├── interim/                              # Extracted text artifacts
│   └── processed/                            # Final processed JSON datasets
│       ├── ssm2020.json
│       ├── ssm2021.json
│       ├── ...
│       └── images/                            # Extracted medical images by year
├── src/
│   ├── parser/                                # PDF extraction and parsing logic
│   │   ├── pdf_text.py
│   │   ├── ssm_pdf.py
│   │   └── ssm_pdf_legacy.py
│   └── schemas.py                             # Data models and validation
├── scripts/
│   ├── parse_pdf.py                           # Parse PDF to JSON
│   ├── parse_all.py                           # Batch parsing
│   └── validate_dataset.py                    # Validate JSON format
├── inference/                                  # Inference scripts and utilities
│   ├── run_openrouter.py                      # Main inference script for OpenRouter models
│   ├── dataset_loader.py                      # Dataset loading utilities
│   ├── prompting.py                           # Prompt construction
│   ├── scramble.py                            # Choice scrambling logic
│   └── openrouter_client.py                   # OpenRouter API client
├── evaluation/                                 # Evaluation scripts
│   └── evaluate_results.py                    # Evaluation script
└── results/                                   # Inference results (JSON files)

Installation and Dependencies

This project requires Python 3.8+.

Setup

Clone the repository:

git clone https://github.com/alessandroliva/ssm_benchmark
cd ssm_benchmark

Install dependencies:

pip install -r requirements.txt

Set up environment variables (for OpenRouter API):

export OPENROUTER_API_KEY="your_key_here"

Alternatively, you can create a .env file

OPENROUTER_API_KEY="your_key_here"

Running the Benchmark

Inference

The inference scripts are located in the inference/ directory:

cd inference

Arguments

The inference script supports:

--model: OpenRouter model ID (e.g., openai/gpt-4o, anthropic/claude-3-opus)
--years: Years to include (can specify multiple: --years 2024 --years 2025)
--seed: Seed for deterministic choice scrambling (default: 14)
--limit: Limit number of questions (useful for testing)
--delay: Delay in seconds between API requests (default: 1.0)
--with-explanation: Enable model to provide brief explanations in the JSON output
--dataset-dir: Directory for processed JSON datasets (default: ../data/processed)
--out-dir: Directory to save results (default: ../results)

Basic Usage

# Run a model on a single year
python run_openrouter.py --model anthropic/claude-3-opus --years 2025

# Run a model on more than one year
python run_openrouter.py --model openai/gpt-4o --years 2024 --years 2025

Rate Limiting and Partial Test Benchmarking

# Limit number of questions for testing
python run_openrouter.py \
  --model mistralai/mistral-small \
  --years 2025 \
  --limit 10

# Adjust API rate limiting (2 second delay between requests)
python run_openrouter.py \
  --model openai/gpt-4o \
  --years 2025 \
  --delay 2.0

Evaluation

The evaluation scripts are located in the evaluation/ directory:

cd evaluation

Basic Evaluation

# Evaluate results and compute SSM score
python evaluate_results.py ../results/openrouter_openai_gpt-4o_20260103_120000.json

Scoring Rules:

Correct Answer: +1.0 point
Wrong Answer: -0.25 points
Abstained (F) / Invalid: 0.0 points

The evaluation script provides:

Overall accuracy
Accuracy on attempted questions (excluding abstentions)
Total SSM score
Breakdown of correct, wrong, and abstained answers
Top errors breakdown

Data Format

The processed dataset is a JSON array of question objects. Each object represents a single multiple-choice question:

{
  "id": "ssm2025861",
  "year": 2025,
  "source_pdf": "ssm2025.pdf",
  "question_number": 1,
  "prompt": "Un bambino di 5 anni nato a termine...",
  "choices": {
    "A": "toracotomia postero-laterale sinistra nel IV spazio intercostale",
    "B": "toracotomia postero-laterale sinistra nel VII spazio intercostale",
    "C": "sternotomia mediana longitudinale",
    "D": "toracotomia anteriore sinistra nel VI spazio intercostale",
    "E": "toracotomia postero-laterale destra nel V spazio intercostale"
  },
  "answer": "A",
  "images": ["data/processed/images/ssm2025/ssm2025885.png"]
}

Result Format

Inference results are saved as JSON files with the following structure:

{
  "model": "openai/gpt-4o",
  "run_seed": 42,
  "timestamp": "20260103_120000",
  "inference_mode": "0-shot",
  "with_explanation": false,
  "years": [2024, 2025],
  "results": [
    {
      "question_id": "ssm2025861",
      "year": 2025,
      "predicted_answer": "A",
      "ground_truth_original": "A",
      "ground_truth_scrambled": "C",
      "scrambled_choices": {
        "A": "...",
        "B": "...",
        "C": "original A answer",
        "D": "...",
        "E": "..."
      },
      "correct": true,
      "score_delta": 1.0,
      "raw_response": "..."
    }
  ],
  "summary": {
    "total_questions": 100,
    "correct": 75,
    "score": 68.75
  }
}

License

This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See: https://creativecommons.org/licenses/by-nc/4.0/deed.en

Citation

If you find this repository useful for your work, please cite:

@misc{ssm_benchmark,
  title={SSM Benchmark: Evaluating LLMs on Italian Medical Specialty Examinations},
  author={Alessandro Oliva},
  year={2025},
  url={https://github.com/alessandroliva/ssm_benchmark}
}

Contact

For any query or collaboration opportunities, please do not esitate to contact me here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SSM Benchmark: Evaluating LLMs on an Italian Medical Residency Admission Test

tl;dr:

Introduction

Dataset Overview

Repository Structure

Installation and Dependencies

Setup

Running the Benchmark

Inference

Arguments

Basic Usage

Rate Limiting and Partial Test Benchmarking

Evaluation

Basic Evaluation

Data Format

Result Format

License

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
evaluation		evaluation
inference		inference
scripts		scripts
src		src
.gitignore		.gitignore
README.MD		README.MD
requirements.txt		requirements.txt

alessandroliva/ssm_benchmark

Folders and files

Latest commit

History

Repository files navigation

SSM Benchmark: Evaluating LLMs on an Italian Medical Residency Admission Test

tl;dr:

Introduction

Dataset Overview

Repository Structure

Installation and Dependencies

Setup

Running the Benchmark

Inference

Arguments

Basic Usage

Rate Limiting and Partial Test Benchmarking

Evaluation

Basic Evaluation

Data Format

Result Format

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages