A comprehensive evaluation suite for testing Large Language Models (LLMs) on Fast Healthcare Interoperable Resources (FHIR) knowledge and capabilities.
FHIR-Workbench provides standardized benchmarks to evaluate LLM performance on healthcare interoperability tasks. This repository contains implementations for four key FHIR tasks:
- FHIR-QA: Tests general knowledge of FHIR concepts and standards through multiple-choice questions
- FHIR-RESTQA: Evaluates specific understanding of FHIR RESTful API operations, queries, and interactions through multiple-choice questions
- FHIR-ResourceID: Tests the ability to identify FHIR resource types based on their JSON structure and content
- Note2FHIR: Assesses the ability to generate structured FHIR resources from patient clinical notes
All evaluation datasets are available on Hugging Face: FHIR-Workbench Collection
FHIR tasks are integrated with lm-eval-harness for standardized evaluation of open-source models.
# Clone this repository
git clone https://github.com/UMEssen/FHIR-Workbench.git
cd FHIR-Workbench
# Install lm-eval-harness with FHIR tasks
cd lm-evaluation-harness
pip install -e .
To evaluate an open-source model on all FHIR tasks:
lm_eval --model hf \
--model_args pretrained=microsoft/phi-4 \
--include_path lm_eval/tasks/fhir/ \
--tasks fhir_qna,fhir_api,fhir_resource,fhir_generation \
--output output \
--log_samples \
--apply_chat_template \
--trust_remote_code
You can customize the evaluation by:
- Changing the model (
--model_args pretrained=...
) - Selecting specific tasks (
--tasks ...
) - Adjusting other parameters as needed
For proprietary models (OpenAI GPT, Google Gemini), use the provided scripts:
- run_qna_proprietary.py: For FHIR-QA and FHIR-RESTQA tasks
- run_resource_proprietary.py: For FHIR-ResourceID task
- run_generation_proprietary.py: For Note2FHIR task
Set your API key as an environment variable:
# For OpenAI models
export OPENAI_API_KEY=your_api_key
# For other providers, edit the script to use your API key and endpoint
# For FHIR-QA task
python run_qna_proprietary.py --dataset ikim-uk-essen/FHIR-QA
# For FHIR-RESTQA task
python run_qna_proprietary.py --dataset ikim-uk-essen/FHIR-RESTQA
# For FHIR-ResourceID task
python run_resource_proprietary.py
# For Note2FHIR task
python run_generation_proprietary.py
Each script accepts additional parameters like --batch-size
and --concurrent
to control evaluation behavior.
We maintain a comprehensive leaderboard tracking performance of various LLMs on FHIR-specific tasks: FHIR-Workbench Leaderboard
The leaderboard currently includes evaluations of 16 models ranging from open-source models (7B-671B parameters) to closed-source commercial models. Models are ranked based on their average performance across all four FHIR tasks.
Have a FHIR-capable model you want to include in our leaderboard? Visit the leaderboard page and submit your HuggingFace model repository URL for evaluation.
If you use FHIR-Workbench in your research, please cite our paper: [Coming soon]
[License information]