Medical Data Toolkit

Medical Data Toolkit converts unstructured medical documents (PDFs, images, screenshots) into accurately structured electronic medical data in the HL7® FHIR® standard format. The toolkit accomplishes this through the use of an LLM model (external to the tool), specialized document schemas, and advanced extraction pipelines.

Vision

Enable information encoded in medical documents to be converted into digitally accessible in the FHIR®, the interoperability standard for electronic medical records.

Key Features

The system achieves high accuracy using the following core modules:

Medical Document Classification: Identifies the document type (e.g., lab report, prescription) to route it to the appropriate processor.
Structured Information Extraction: Extracts clinical data using per-document schemas to help stability and correctness.
Advanced Medical Coding: Accurately maps concepts to standard LOINC codes.
Deterministic FHIR® Conversion: Rule-based conversion of structured JSON into FHIR® R4 resources.

Terminology & LOINC Mapping

Medical Data Toolkit addresses the high-cardinality challenge of LOINC mapping for lab reports using a multi-stage strategy:

Core-Analyte Prediction: Extracts the primary substance being measured (e.g., "Glucose") during initial extraction, achieving high recall.
Offline Knowledge Base: Uses pre-computed knowledge base to populate the LOINC axes (Property, System, etc.). Constructing this Knowledge Base is a prerequisite for offline execution. Detailed instructions can be found in the LOINC README.
Signature Matching: Employs word signatures to handle word-order swaps and OCR noise.

These techniques help improve document conversion precision and do not require external API calls, making the toolkit suitable for offline deployment.

Usage

The toolkit provides a Docker image which exposes REST API that accepts the raw bytes of a PDF file or picture of a medical document (i.e., JPEG, PNG) and returns a completed FHIR® bundle. The docker can be deployed within a serverless environment (e.g., Cloud Run, GKE), deployed within a virtual machine, or executed locally.

Limitations

Transformation of hand written medical documents to FHIR® is not supported.
The toolkit supports Diagnostic Report or Laboratory Report type of medical documents.
The toolkit supports FHIR® Implementation Guide for ABDM.

Toolkit interface Workflow

Ingest: Client sends raw bytes of the document.
Process: The system classifies the document, extracts data, maps terminology, and generates FHIR® resources.
Respond: Returns the FHIR® bundle.

The current API is synchronous and optimized for processing small files fast.

Prerequisites

Models: Clients can use any LLM model capable of extracting medical data from PDF and image files.
Environment: Serverless container execution environment or Docker.

Project Structure

src/: Contains the source code for the server and processing logic.
- rest_server.py: The Flask entry point.
- document_to_fhir/core/: Core logic for classification, extraction, and FHIR generation.
Dockerfile: For containerizing the application.

Getting Started Locally

####1. Clone the GitHub Repository

git clone https://github.com/Google-Health/medical-data-toolkit

####2. Build the Medical Data Toolkit Container

Execute from the directory containing the toolkits Dockerfile.

docker build -t medical-data-toolkit-image .

####3. Run the Container

docker run --name medical-data-toolkit-container -p 8080:8080 -d medical-data-toolkit-image

####4. Call the Running Container

Example Client Usage (Console)

curl -X POST \
     --data-binary @medical_document.pdf \
     "http://127.0.0.1:8080/document_to_fhir"

Example Client Usage (Python)

import requests

# Assuming the server is running locally or at a specific address
url = "http://127.0.0.1:8080/document_to_fhir"  # Replace with actual endpoint when available

with open("sample_report.pdf", "rb") as f:
  pdf_bytes = f.read()

with requests.post(url, data=pdf_bytes) as response:
  if response.ok:
    print("FHIR Bundle:", response.json())
  else:
    print("Error:", response.text)

####5. Stop the Container and Cleanup (Optional)

docker kill medical-data-toolkit-container && \
docker system prune --all --force

License

This project is licensed under the Apache 2.0 license, see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
nginx.conf		nginx.conf
requirements.in		requirements.in
requirements.txt		requirements.txt
start_server.sh		start_server.sh
third_party_ip_notices.md		third_party_ip_notices.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Data Toolkit

Vision

Key Features

Terminology & LOINC Mapping

Usage

Limitations

Toolkit interface Workflow

Prerequisites

Project Structure

Getting Started Locally

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medical Data Toolkit

Vision

Key Features

Terminology & LOINC Mapping

Usage

Limitations

Toolkit interface Workflow

Prerequisites

Project Structure

Getting Started Locally

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages