Medical Data Toolkit converts unstructured medical documents (PDFs, images, screenshots) into accurately structured electronic medical data in the HL7® FHIR® standard format. The toolkit accomplishes this through the use of an LLM model (external to the tool), specialized document schemas, and advanced extraction pipelines.
Enable information encoded in medical documents to be converted into digitally accessible in the FHIR®, the interoperability standard for electronic medical records.
The system achieves high accuracy using the following core modules:
- Medical Document Classification: Identifies the document type (e.g., lab report, prescription) to route it to the appropriate processor.
- Structured Information Extraction: Extracts clinical data using per-document schemas to help stability and correctness.
- Advanced Medical Coding: Accurately maps concepts to standard LOINC codes.
- Deterministic FHIR® Conversion: Rule-based conversion of structured JSON into FHIR® R4 resources.
Medical Data Toolkit addresses the high-cardinality challenge of LOINC mapping for lab reports using a multi-stage strategy:
- Core-Analyte Prediction: Extracts the primary substance being measured (e.g., "Glucose") during initial extraction, achieving high recall.
- Offline Knowledge Base: Uses pre-computed knowledge base to populate the LOINC axes (Property, System, etc.). Constructing this Knowledge Base is a prerequisite for offline execution. Detailed instructions can be found in the LOINC README.
- Signature Matching: Employs word signatures to handle word-order swaps and OCR noise.
These techniques help improve document conversion precision and do not require external API calls, making the toolkit suitable for offline deployment.
The toolkit provides a Docker image which exposes REST API that accepts the raw bytes of a PDF file or picture of a medical document (i.e., JPEG, PNG) and returns a completed FHIR® bundle. The docker can be deployed within a serverless environment (e.g., Cloud Run, GKE), deployed within a virtual machine, or executed locally.
- Transformation of hand written medical documents to FHIR® is not supported.
- The toolkit supports Diagnostic Report or Laboratory Report type of medical documents.
- The toolkit supports FHIR® Implementation Guide for ABDM.
- Ingest: Client sends raw bytes of the document.
- Process: The system classifies the document, extracts data, maps terminology, and generates FHIR® resources.
- Respond: Returns the FHIR® bundle.
The current API is synchronous and optimized for processing small files fast.
- Models: Clients can use any LLM model capable of extracting medical data from PDF and image files.
- Environment: Serverless container execution environment or Docker.
src/: Contains the source code for the server and processing logic.rest_server.py: The Flask entry point.document_to_fhir/core/: Core logic for classification, extraction, and FHIR generation.
Dockerfile: For containerizing the application.
####1. Clone the GitHub Repository
git clone https://github.com/Google-Health/medical-data-toolkit
####2. Build the Medical Data Toolkit Container
Execute from the directory containing the toolkits Dockerfile.
docker build -t medical-data-toolkit-image .
####3. Run the Container
docker run --name medical-data-toolkit-container -p 8080:8080 -d medical-data-toolkit-image
####4. Call the Running Container
Example Client Usage (Console)
curl -X POST \
--data-binary @medical_document.pdf \
"http://127.0.0.1:8080/document_to_fhir"
Example Client Usage (Python)
import requests
# Assuming the server is running locally or at a specific address
url = "http://127.0.0.1:8080/document_to_fhir" # Replace with actual endpoint when available
with open("sample_report.pdf", "rb") as f:
pdf_bytes = f.read()
with requests.post(url, data=pdf_bytes) as response:
if response.ok:
print("FHIR Bundle:", response.json())
else:
print("Error:", response.text)####5. Stop the Container and Cleanup (Optional)
docker kill medical-data-toolkit-container && \
docker system prune --all --force
This project is licensed under the Apache 2.0 license, see LICENSE.