- 📚 SSM exam based benchmark, 2020–2025, multiple choice format, multimodal
- 🌀 Multiple-choice options are auto-scrambled
- 🚦 Scoring system: +1.0 correct, –0.25 wrong, 0.0 abstain
- 🤖 0-shot OpenRouter API baseline included to test various LLMs, add your OPENROUTER_API_KEY to .env file
- 🩻 Handles image-based questions (use with models that support vision)
- 📦 Install:
pip install -r requirements.txt(Python 3.8+) ▶️ Quickrun:python inference/run_openrouter.py --model anthropic/claude-3-opus --years 2025- 🔎 Results: Written to
results/as JSON
The SSM (Italian State Exam for Medical Specializations - Selezione Scuole di Specializzazione Medica) is the official test provided by the Italian Ministry of University that each year over 15,000 medical doctors take to access positions in Italy.
SSM Benchmark is a standardized evaluation framework for assessing the medical knowledge and reasoning capabilities of LLMs on Italian medical examination questions.
The benchmark includes SSM test questions from 2020 to 2025, covering various medical specialties and question types. Each question is a multiple-choice format with 5 options (A-E), plus an optional abstention option (F).
Key Features:
- Multimodal Support: Questions may include medical images (imaging, blood panels, flow charts, tables, etc.), around 10 images per test/year
- SSM Scoring System: Implements the official scoring rules (+1.0 correct, -0.25 wrong, 0.0 abstain)
- Deterministic Scrambling: Choice order is scrambled to prevent position bias
- Abstention Option: Models can choose option 'F' to abstain to avoid scoring penalties
.
├── README.md
├── requirements.txt # Python dependencies
├── LIMITATIONS_AND_FAILURES.md # Known limitations and issues
├── data/
│ ├── raw/ # Original PDF files (ssm2020.pdf, etc.)
│ ├── interim/ # Extracted text artifacts
│ └── processed/ # Final processed JSON datasets
│ ├── ssm2020.json
│ ├── ssm2021.json
│ ├── ...
│ └── images/ # Extracted medical images by year
├── src/
│ ├── parser/ # PDF extraction and parsing logic
│ │ ├── pdf_text.py
│ │ ├── ssm_pdf.py
│ │ └── ssm_pdf_legacy.py
│ └── schemas.py # Data models and validation
├── scripts/
│ ├── parse_pdf.py # Parse PDF to JSON
│ ├── parse_all.py # Batch parsing
│ └── validate_dataset.py # Validate JSON format
├── inference/ # Inference scripts and utilities
│ ├── run_openrouter.py # Main inference script for OpenRouter models
│ ├── dataset_loader.py # Dataset loading utilities
│ ├── prompting.py # Prompt construction
│ ├── scramble.py # Choice scrambling logic
│ └── openrouter_client.py # OpenRouter API client
├── evaluation/ # Evaluation scripts
│ └── evaluate_results.py # Evaluation script
└── results/ # Inference results (JSON files)
This project requires Python 3.8+.
- Clone the repository:
git clone https://github.com/alessandroliva/ssm_benchmark
cd ssm_benchmark- Install dependencies:
pip install -r requirements.txt- Set up environment variables (for OpenRouter API):
export OPENROUTER_API_KEY="your_key_here"- Alternatively, you can create a
.envfile
OPENROUTER_API_KEY="your_key_here"
The inference scripts are located in the inference/ directory:
cd inferenceThe inference script supports:
--model: OpenRouter model ID (e.g.,openai/gpt-4o,anthropic/claude-3-opus)--years: Years to include (can specify multiple:--years 2024 --years 2025)--seed: Seed for deterministic choice scrambling (default: 14)--limit: Limit number of questions (useful for testing)--delay: Delay in seconds between API requests (default: 1.0)--with-explanation: Enable model to provide brief explanations in the JSON output--dataset-dir: Directory for processed JSON datasets (default:../data/processed)--out-dir: Directory to save results (default:../results)
# Run a model on a single year
python run_openrouter.py --model anthropic/claude-3-opus --years 2025
# Run a model on more than one year
python run_openrouter.py --model openai/gpt-4o --years 2024 --years 2025# Limit number of questions for testing
python run_openrouter.py \
--model mistralai/mistral-small \
--years 2025 \
--limit 10
# Adjust API rate limiting (2 second delay between requests)
python run_openrouter.py \
--model openai/gpt-4o \
--years 2025 \
--delay 2.0The evaluation scripts are located in the evaluation/ directory:
cd evaluation# Evaluate results and compute SSM score
python evaluate_results.py ../results/openrouter_openai_gpt-4o_20260103_120000.jsonScoring Rules:
- Correct Answer: +1.0 point
- Wrong Answer: -0.25 points
- Abstained (F) / Invalid: 0.0 points
The evaluation script provides:
- Overall accuracy
- Accuracy on attempted questions (excluding abstentions)
- Total SSM score
- Breakdown of correct, wrong, and abstained answers
- Top errors breakdown
The processed dataset is a JSON array of question objects. Each object represents a single multiple-choice question:
{
"id": "ssm2025861",
"year": 2025,
"source_pdf": "ssm2025.pdf",
"question_number": 1,
"prompt": "Un bambino di 5 anni nato a termine...",
"choices": {
"A": "toracotomia postero-laterale sinistra nel IV spazio intercostale",
"B": "toracotomia postero-laterale sinistra nel VII spazio intercostale",
"C": "sternotomia mediana longitudinale",
"D": "toracotomia anteriore sinistra nel VI spazio intercostale",
"E": "toracotomia postero-laterale destra nel V spazio intercostale"
},
"answer": "A",
"images": ["data/processed/images/ssm2025/ssm2025885.png"]
}Inference results are saved as JSON files with the following structure:
{
"model": "openai/gpt-4o",
"run_seed": 42,
"timestamp": "20260103_120000",
"inference_mode": "0-shot",
"with_explanation": false,
"years": [2024, 2025],
"results": [
{
"question_id": "ssm2025861",
"year": 2025,
"predicted_answer": "A",
"ground_truth_original": "A",
"ground_truth_scrambled": "C",
"scrambled_choices": {
"A": "...",
"B": "...",
"C": "original A answer",
"D": "...",
"E": "..."
},
"correct": true,
"score_delta": 1.0,
"raw_response": "..."
}
],
"summary": {
"total_questions": 100,
"correct": 75,
"score": 68.75
}
}This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See: https://creativecommons.org/licenses/by-nc/4.0/deed.en
If you find this repository useful for your work, please cite:
@misc{ssm_benchmark,
title={SSM Benchmark: Evaluating LLMs on Italian Medical Specialty Examinations},
author={Alessandro Oliva},
year={2025},
url={https://github.com/alessandroliva/ssm_benchmark}
}For any query or collaboration opportunities, please do not esitate to contact me here