FHIR-Workbench

A comprehensive evaluation suite for testing Large Language Models (LLMs) on Fast Healthcare Interoperable Resources (FHIR) knowledge and capabilities.

Overview

FHIR-Workbench provides standardized benchmarks to evaluate LLM performance on healthcare interoperability tasks. This repository contains implementations for four key FHIR tasks:

FHIR-QA: Tests general knowledge of FHIR concepts and standards through multiple-choice questions
FHIR-RESTQA: Evaluates specific understanding of FHIR RESTful API operations, queries, and interactions through multiple-choice questions
FHIR-ResourceID: Tests the ability to identify FHIR resource types based on their JSON structure and content
Note2FHIR: Assesses the ability to generate structured FHIR resources from patient clinical notes

Dataset Access

All evaluation datasets are available on Hugging Face: FHIR-Workbench Collection

Evaluation Methods

Open-Source Models

FHIR tasks are integrated with lm-eval-harness for standardized evaluation of open-source models.

Setup and Installation

# Clone this repository
git clone https://github.com/UMEssen/FHIR-Workbench.git
cd FHIR-Workbench

# Install lm-eval-harness with FHIR tasks
cd lm-evaluation-harness
pip install -e .

Running Evaluations

To evaluate an open-source model on all FHIR tasks:

lm_eval --model hf \
  --model_args pretrained=microsoft/phi-4 \
  --include_path lm_eval/tasks/fhir/ \
  --tasks fhir_qna,fhir_api,fhir_resource,fhir_generation \
  --output output \
  --log_samples \
  --apply_chat_template \
  --trust_remote_code

You can customize the evaluation by:

Changing the model (--model_args pretrained=...)
Selecting specific tasks (--tasks ...)
Adjusting other parameters as needed

Commercial Models

For proprietary models (OpenAI GPT, Google Gemini), use the provided scripts:

run_qna_proprietary.py: For FHIR-QA and FHIR-RESTQA tasks
run_resource_proprietary.py: For FHIR-ResourceID task
run_generation_proprietary.py: For Note2FHIR task

Setup

Set your API key as an environment variable:

# For OpenAI models
export OPENAI_API_KEY=your_api_key

# For other providers, edit the script to use your API key and endpoint

Running Evaluations

# For FHIR-QA task
python run_qna_proprietary.py --dataset ikim-uk-essen/FHIR-QA

# For FHIR-RESTQA task
python run_qna_proprietary.py --dataset ikim-uk-essen/FHIR-RESTQA

# For FHIR-ResourceID task
python run_resource_proprietary.py

# For Note2FHIR task
python run_generation_proprietary.py

Each script accepts additional parameters like --batch-size and --concurrent to control evaluation behavior.

Leaderboard

We maintain a comprehensive leaderboard tracking performance of various LLMs on FHIR-specific tasks: FHIR-Workbench Leaderboard

The leaderboard currently includes evaluations of 16 models ranging from open-source models (7B-671B parameters) to closed-source commercial models. Models are ranked based on their average performance across all four FHIR tasks.

Submit Your Model

Have a FHIR-capable model you want to include in our leaderboard? Visit the leaderboard page and submit your HuggingFace model repository URL for evaluation.

Citation

If you use FHIR-Workbench in your research, please cite our paper: [Coming soon]

License

[License information]

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
lm-evaluation-harness		lm-evaluation-harness
.gitignore		.gitignore
README.md		README.md
run_generation_proprietary.py		run_generation_proprietary.py
run_qna_proprietary.py		run_qna_proprietary.py
run_resource_proprietary.py		run_resource_proprietary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FHIR-Workbench

Overview

Dataset Access

Evaluation Methods

Open-Source Models

Setup and Installation

Running Evaluations

Commercial Models

Setup

Running Evaluations

Leaderboard

Submit Your Model

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

UMEssen/FHIR-Workbench

Folders and files

Latest commit

History

Repository files navigation

FHIR-Workbench

Overview

Dataset Access

Evaluation Methods

Open-Source Models

Setup and Installation

Running Evaluations

Commercial Models

Setup

Running Evaluations

Leaderboard

Submit Your Model

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages