Skip to content

openmrs/openmrs-module-chartsearchai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

403 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chart Search AI Module

An OpenMRS module that lets clinicians ask natural language questions about a patient's chart and get answers with source citations.

For project background, community discussion, and roadmap, see the wiki project page.

Table of Contents

Requirements

  • Java 11+
  • OpenMRS Platform 2.8.0+
  • Webservices REST module 2.44.0+
  • 6GB+ RAM recommended (for local LLM inference with the default MedGemma 1.5 4B model; not required when using a remote LLM)
  • Elasticsearch 8.14+ (optional, for the hybrid retrieval pipeline; the default embedding and Lucene pipelines require no external services)

Setup

1. Build

mvn package

The .omod file is in omod/target/.

2. Download the LLM model (local mode only)

Skip this step if you plan to use a remote LLM (see LLM engine below).

Download MedGemma 1.5 4B (Q4_K_M quantization) in GGUF format (~2.5GB) from Hugging Face.

Place the .gguf file inside the OpenMRS application data directory (e.g., <openmrs-application-data-directory>/chartsearchai/). Model paths are resolved relative to this directory for security.

Recommended models for local inference:

Model RAM Needed Chat Template Download
Llama 3.2 3B ~6GB total llama3 GGUF
MedGemma 1.5 4B (default) ~6–8GB total gemma GGUF
Gemma 4 E4B ~6–8GB total gemma GGUF
Llama 3.3 8B ~10GB total llama3 GGUF
Gemma 3 12B ~12GB total gemma GGUF
Mistral Nemo 12B ~12GB total mistral GGUF
Gemma 4 31B ~20–24GB total gemma GGUF

To switch models, update chartsearchai.llm.modelFilePath and chartsearchai.llm.chatTemplate — no rebuild needed. See Evaluated models for a full comparison of all models tested, including size trade-offs and licensing.

3. Download the embedding model

If embedding pre-filtering is enabled (default), download the all-MiniLM-L6-v2 ONNX model (~90MB) from Hugging Face. You need both model.onnx and vocab.txt from the repository.

Place them alongside the LLM model (e.g., <openmrs-application-data-directory>/chartsearchai/).

4. Install

Copy the .omod file into the modules folder of the OpenMRS application data directory (e.g., <openmrs-application-data-directory>/modules/). The module will be loaded on the next OpenMRS startup.

5. Configure

Set these global properties in Admin > Settings:

LLM engine

Property Default Description
chartsearchai.llm.engine local LLM inference engine: local runs a GGUF model in-process via llama.cpp; remote calls an OpenAI-compatible API

Local engine (default) — requires a downloaded GGUF model file (see step 2):

Property Description
chartsearchai.llm.modelFilePath Relative path (within the OpenMRS application data directory) to the .gguf model file, e.g. chartsearchai/medgemma-1.5-4b-it-Q4_K_M.gguf

Remote engine — set chartsearchai.llm.engine to remote and configure:

Property Where Description
chartsearchai.llm.remote.endpointUrl Global property Chat completions endpoint URL (e.g. http://localhost:11434/v1/chat/completions for Ollama, http://gpu-server:8000/v1/chat/completions for vLLM, https://api.openai.com/v1/chat/completions for OpenAI)
chartsearchai.llm.remote.apiKey openmrs-runtime.properties API key for authentication (sent as Bearer token). Stored in runtime properties instead of the database for security. For self-hosted servers that don't require auth, set to any non-empty value (e.g. none)
chartsearchai.llm.remote.modelName Global property Model identifier (e.g. llama3.3 for Ollama, meta-llama/Llama-3.3-8B-Instruct for vLLM, gpt-4o for OpenAI)

The API key is read from openmrs-runtime.properties (not from the database) so it is never exposed in the Admin UI or database backups. Add it to your runtime properties file:

chartsearchai.llm.remote.apiKey=sk-your-api-key-here

The remote engine works with any server that implements the OpenAI chat completions API format, including self-hosted inference servers (vLLM, Ollama, text-generation-inference) and cloud providers (OpenAI, Azure OpenAI, Google AI, Anthropic via proxy). Self-hosted servers keep patient data on-premise while still benefiting from GPU-accelerated inference. No GGUF model download is needed when using the remote engine.

Retrieval pipeline

Property Default Description
chartsearchai.embedding.preFilter true When true, uses the selected retrieval pipeline to narrow patient records to the most relevant ones before sending to the LLM. Set to false to send the full chart instead
chartsearchai.retrieval.pipeline embedding Selects the retrieval pipeline: embedding (default) uses vector similarity via an ONNX model with custom scoring; lucene uses Apache Lucene BM25 text search; hybrid combines Lucene BM25 and embedding kNN search using Reciprocal Rank Fusion (RRF) — same quality as the Elasticsearch pipeline but with no external services required; elasticsearch uses Elasticsearch hybrid search combining BM25 text and kNN vector search via RRF (requires Elasticsearch 8.14+ configured in OpenMRS). All require preFilter to be true. Records are indexed automatically on first access. Changing this setting takes effect on the next query

Embedding pipeline tuning

These settings apply when chartsearchai.retrieval.pipeline is embedding (the default). The Elasticsearch pipeline also uses scoreGapMultiplier, minScoreGap, gapValidationCosineThreshold, keywordWeight, and similarityRatio in its post-retrieval filter pipeline. They have no effect on the Lucene or hybrid pipelines.

Property Default Description
chartsearchai.embedding.topK 10 Maximum number of records sent to the LLM per query. When the query mentions a specific clinical type (e.g., "medications", "allergies", "lab results"), all records of that type are included regardless of topK, and remaining slots are filled with contextual records from other types. For other queries, topK is applied only when some candidates lack keyword matches; when every candidate has a keyword match, topK is bypassed because gap detection and ratio filtering already identified the relevant cluster. Type detection uses keyword matching — for example, "medications" and "drugs" both match drug orders, while "blood pressure" and "bp" both match observations
chartsearchai.embedding.similarityRatio 0.80 Minimum similarity score as a fraction of the top result's score. Records scoring below this ratio are excluded even if within the topK limit. Must be between 0 and 1
chartsearchai.embedding.scoreGapMultiplier 2.5 Controls adaptive topK by detecting natural cluster boundaries in similarity scores. Higher values include more records; lower values cut more aggressively. Set to a very large value (e.g. 999) to disable gap detection
chartsearchai.embedding.minScoreGap 0.10 Minimum absolute gap between consecutive similarity scores required for the adaptive cutoff detector to trigger. Prevents premature cutting when a relatively large gap (compared to a tight cluster's running average) is still small in absolute terms. Only applies when gap detection is active
chartsearchai.embedding.gapValidationCosineThreshold 0.47 Cosine similarity threshold for validating whether a detected gap is intra-topic or inter-topic. When the average cosine between records above and below the gap meets or exceeds this value, the gap is considered intra-topic and the cut is skipped. Must be between 0 and 1
chartsearchai.embedding.keywordWeight 0.3 Additive keyword bonus weight in the hybrid retrieval formula: finalScore = semanticScore + weight × keywordScore. Keyword overlap can only increase the score, never decrease it. Set to 0 to disable keyword matching
chartsearchai.embedding.typeBoostFactor 1.0 Score multiplier applied to records whose resource type matches the query intent (e.g., drug orders when the query is about medications). Set to 1.0 to disable type boosting (default). Values like 1.21.5 provide moderate boosting. Must be between 1.0 and 3.0
chartsearchai.embedding.queryPrefix (empty) Prefix prepended to the user query before embedding. Leave empty for models like all-MiniLM-L6-v2 that were not trained with instruction prefixes. Set to search_query: or Represent this sentence for searching relevant passages: for models that support instruction-aware queries (e.g., BGE)
chartsearchai.embedding.maxSequenceLength 256 Maximum WordPiece token sequence length for embedding input. Increase when using models that support longer contexts (e.g., 512 for BGE models). Must be between 32 and 8192
chartsearchai.embedding.modelFilePath Required when using the embedding, hybrid, or elasticsearch pipeline. Relative path to the ONNX model file (all-MiniLM-L6-v2), e.g. chartsearchai/all-MiniLM-L6-v2.onnx. Not needed for the Lucene pipeline
chartsearchai.embedding.vocabFilePath Required when using the embedding, hybrid, or elasticsearch pipeline. Relative path to the WordPiece vocab.txt file, e.g. chartsearchai/vocab.txt. Not needed for the Lucene pipeline

LLM tuning

Property Default Description
chartsearchai.llm.systemPrompt (built-in clinical prompt) System prompt that guides how the LLM responds — e.g. answering only the question asked, using only the provided patient records, citing records by number, naming what is missing when records lack relevant information (e.g. "There are no records about diabetes in this patient's chart"), keeping answers concise, and returning structured JSON
chartsearchai.llm.timeoutSeconds 120 Maximum seconds to wait for LLM inference before timing out
chartsearchai.llm.chatTemplate gemma (Local engine only) Chat template for formatting prompts. Presets: llama3, mistral, phi3, chatml, gemma. Set to auto to use the model's built-in GGUF chat template. Or a custom template string with {system} and {user} placeholders
chartsearchai.llm.idleTimeoutMinutes 30 (Local engine only) Minutes of inactivity after which the LLM model is unloaded from memory to free RAM. The model is automatically reloaded on the next query. Set to 0 to keep the model loaded indefinitely

Rate limiting and caching

Property Default Description
chartsearchai.rateLimitPerMinute 10 Maximum queries per user per minute. Set to 0 to disable
chartsearchai.cacheTtlMinutes 0 Minutes to cache identical (patient, question) answers. Set to 0 to disable (default)

Audit

Property Default Description
chartsearchai.auditLogRetentionDays 90 Audit log entries older than this are purged daily. Set to 0 to retain all

6. Grant privileges

Privilege Purpose
AI Query Patient Data Execute chart search queries
View AI Audit Logs Access the audit log endpoint

7. Indexing

When chartsearchai.embedding.preFilter is true (default), patient records are automatically indexed on first chart access for whichever retrieval pipeline is active. Subsequent data changes trigger automatic re-indexing via AOP hooks on encounter, obs, condition, diagnosis, allergy, order, program enrollment, medication dispense, and patient merge operations.

Embedding pipeline (default): Uses an ONNX embedding model for vector similarity search. A bulk backfill task ("Chart Search AI - Embedding Backfill") is available in Admin > Scheduler > Manage Scheduler to pre-index all patients. The default model is all-MiniLM-L6-v2 (general-purpose, 384 dimensions). Any BERT-based ONNX embedding model can be used as a drop-in replacement by updating chartsearchai.embedding.modelFilePath and chartsearchai.embedding.vocabFilePath. Embedding dimensions are auto-detected from the model output, so models with any dimension size work without code changes. After switching models, existing embeddings are incompatible — run the backfill task to re-index all patients with the new model.

Lucene pipeline (chartsearchai.retrieval.pipeline=lucene): Uses Apache Lucene BM25 text search with English stemming. No ONNX model files are required. The Lucene index is stored at <openmrs-application-data-directory>/chartsearchai/lucene-index/ and is built automatically on first patient access. This pipeline is simpler to set up (no model download needed) and may be preferred for environments where the ONNX model is unavailable.

Hybrid pipeline (chartsearchai.retrieval.pipeline=hybrid): Combines Lucene BM25 text search with embedding kNN semantic search using Reciprocal Rank Fusion (RRF), the same algorithm used by the Elasticsearch pipeline. Provides Elasticsearch-quality hybrid retrieval without requiring any external services — everything runs in-process. Requires the ONNX embedding model (same as the embedding pipeline) for the kNN side. Both the Lucene index and embedding vectors are built automatically on first patient access. This is the best option when you want hybrid BM25+semantic search quality but don't have an Elasticsearch cluster.

Elasticsearch pipeline (chartsearchai.retrieval.pipeline=elasticsearch): Uses Elasticsearch hybrid search combining BM25 text search with kNN dense vector search via Reciprocal Rank Fusion (RRF). Requires Elasticsearch 8.14+ configured in OpenMRS (set hibernate.search.backend.uris in runtime properties). Also requires the ONNX embedding model (same as the embedding pipeline) to compute vectors for the kNN side of the hybrid search. Patient records are indexed into a shared chartsearchai-patient-records Elasticsearch index with both text and embedding vector fields. The RRF algorithm fuses rankings from both signals — this means queries like "any cancer?" can find semantic matches (e.g. Kaposi sarcoma) via kNN even when the literal term is absent from the records, while also benefiting from BM25's lexical matching. If Elasticsearch is not available at query time, the pipeline automatically falls back to the embedding pipeline. After switching embedding models, delete the chartsearchai-patient-records index from Elasticsearch — it will be recreated with the new model's dimensions on the next patient access.

Choosing a pipeline:

Consideration Embedding (default) Lucene Hybrid Elasticsearch
External dependencies ONNX model files only None ONNX model files only Elasticsearch 8.14+ cluster + ONNX model files
Semantic matching (e.g., "cancer" finds "Kaposi sarcoma") Yes No Yes (via kNN) Yes (via kNN)
Absent-data detection (returns "no records about X" instead of false positives) Yes (z-score gate) No No Yes (via post-filter pipeline)
Type-aware auto-expand (e.g., "any conditions?" returns all conditions) Yes No No No
Adaptive result filtering (gap detection, similarity ratio) Yes No No Yes (post-retrieval filter pipeline)
Keyword matching Yes (hybrid scoring) Yes (BM25 with stemming) Yes (BM25 + kNN via RRF) Yes (BM25 + kNN via RRF)
Tunable parameters Many (topK, similarityRatio, scoreGapMultiplier, keywordWeight, etc.) Few (topK only) Few (topK only) Few (topK only; scoring delegated to Elasticsearch)
Compute location In-process (JVM) In-process (JVM) In-process (JVM) Elasticsearch cluster
Graceful fallback N/A (default) Falls back to full chart on error Falls back to full chart on error Falls back to embedding pipeline

The embedding pipeline is recommended for most deployments — it runs entirely in-process, has the most sophisticated filtering (z-score gate for absent-data detection, gap detection for adaptive result cutoff, type-aware expansion), and requires no external services. The Lucene pipeline is the simplest option when the ONNX model is unavailable, but lacks semantic understanding. The hybrid pipeline combines Lucene BM25 with embedding kNN via RRF, but benchmarks on the 153-record eval dataset show it underperforms the embedding pipeline (0.659 avg recall vs 0.748) because its fixed-size topK output cannot adapt: it always returns exactly topK records, failing on adversarial queries (can't return empty) and broad queries like blood pressure where more than topK records are relevant. The embedding pipeline's adaptive filtering (gap detection, floor gates, type-aware expansion) handles these cases. The Elasticsearch pipeline is best when you already have an ES cluster in your infrastructure and want to offload retrieval compute. ES results are post-filtered through the same scoring and gap detection pipeline as the embedding pipeline, so queries like "any cancer?" return only genuinely relevant records (e.g. Kaposi sarcoma) rather than the full RRF result set.

Testing the Elasticsearch pipeline locally

The module auto-detects whether the backend is Elasticsearch or OpenSearch and adapts its queries accordingly. OpenSearch is recommended because RRF is free; Elasticsearch requires a paid Platinum or Enterprise subscription for RRF.

To test the Elasticsearch pipeline with the OpenMRS SDK:

1. Start OpenSearch 2.19+ (recommended) or Elasticsearch 8.14+ with Docker:

OpenSearch (RRF is free):

docker run -d --name opensearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "DISABLE_SECURITY_PLUGIN=true" \
  opensearchproject/opensearch:2.19.0

Install the analysis-phonetic plugin (required by the OpenMRS platform for Soundex-based person name search):

docker exec opensearch bin/opensearch-plugin install analysis-phonetic
docker restart opensearch
Alternatively, use Elasticsearch (requires paid license for RRF)
docker run -d --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.17.2
docker exec elasticsearch bin/elasticsearch-plugin install analysis-phonetic
docker restart elasticsearch

Start a 30-day trial to enable RRF:

curl -X POST 'http://localhost:9200/_license/start_trial?acknowledge=true'

Verify it's running: curl http://localhost:9200/_cluster/health

2. Configure OpenMRS to use Elasticsearch:

Add to your OpenMRS runtime properties file (e.g., ~/openmrs/openmrs-runtime.properties):

hibernate.search.backend.type=elasticsearch
hibernate.search.backend.analysis.configurer=elasticsearchConfig
hibernate.search.backend.uris=http://localhost:9200
hibernate.search.backend.discovery.enabled=false

Notes:

  • The analysis.configurer must match the backend type — use elasticsearchConfig for Elasticsearch and luceneConfig for Lucene (the default). If you see Unknown filter type [phonetic] errors, the analysis-phonetic plugin is missing from your Elasticsearch instance.
  • Set discovery.enabled=false when running a single local node. When enabled, Hibernate Search may discover and connect to internal Docker network IPs (e.g., 172.17.x.x) that are unreachable from the host, causing Timeout connecting errors.

Or if using the SDK with Docker, pass the environment variable when running the server:

OMRS_SEARCH=elasticsearch mvn openmrs-sdk:run

3. Set the retrieval pipeline:

In Admin > Settings, set:

Property Value
chartsearchai.retrieval.pipeline elasticsearch

Also ensure the ONNX embedding model and vocab files are configured (same as the default embedding pipeline).

4. Query a patient — records are indexed automatically on first access. To verify indexing, check the ES index:

curl http://localhost:9200/chartsearchai-patient-records/_count

Elasticsearch unavailability

If Elasticsearch is unreachable (not running, network issue, misconfigured URI), the module continues to work normally:

  • Startup: The module starts successfully without checking Elasticsearch connectivity. The client is created lazily on first use.
  • Queries: Each query calls GET /_cluster/health to check availability. If the check fails, the query automatically falls back to the embedding pipeline. No error is returned to the caller — users still get search results.
  • Indexing: When patient data changes (new obs, conditions, orders, etc.), the module attempts to re-index in Elasticsearch. If the connection fails, the error is logged and swallowed — the data change proceeds normally.
  • Recovery: There is no retry or circuit-breaker logic. Each request independently checks availability, so if Elasticsearch comes back online, the next query automatically uses it.

In short, the Elasticsearch pipeline is a best-effort enhancement. The module never fails because of Elasticsearch — it silently degrades to the embedding pipeline and silently recovers when Elasticsearch becomes available again.

5. To reset and re-index, delete the ES index:

curl -X DELETE http://localhost:9200/chartsearchai-patient-records

Records will be re-indexed on the next patient access.

Query behavior

Absent-data detection

When the embedding pipeline is active and a query has no keyword matches in the patient's records (e.g., asking "any cancer?" for a patient with no cancer-related records), the system uses a z-score gate to detect whether the top semantic match is a genuine result or just noise. If the patient has 30+ records and the best semantic score is not a statistical outlier (z-score < 1.5), the query returns "There are no records about [topic] in this patient's chart" instead of false positives. This prevents the system from returning unrelated records that happen to have slightly elevated similarity scores.

Recency cap

Questions with numeric recency constraints are automatically detected and honored. For example, "last 3 blood pressure readings" or "most recent 5 lab results" will cap the results per concept group to the specified number, keeping only the most recent measurements. This applies across all retrieval pipelines.

Input validation

Questions are checked against common prompt injection patterns (e.g., "ignore previous instructions", "you are now", "system prompt:") and rejected with HTTP 400 if matched. This is a defense-in-depth measure — the primary protection is the GBNF grammar that constrains LLM output to a fixed JSON structure regardless of prompt content. Normal clinical questions containing words like "ignore" or "instructions" in non-adversarial contexts (e.g., "What instructions were given at discharge?") are not affected.

API

Search

POST /ws/rest/v1/chartsearchai/search
Content-Type: application/json

{
  "patient": "patient-uuid-here",
  "question": "What medications is this patient on?"
}

Response:

{
  "answer": "The patient is currently on Metformin [1] and Lisinopril [3]...",
  "disclaimer": "This response is AI-generated and may not be accurate...",
  "questionId": "42",
  "references": [
    { "index": 3, "resourceType": "order", "resourceId": 789, "date": "2025-03-15" },
    { "index": 1, "resourceType": "order", "resourceId": 456, "date": "2025-01-10" }
  ]
}

questionId is a string identifier for this query, used to submit feedback (see below). It is omitted if audit logging fails.

Streaming search (SSE)

For real-time token-by-token streaming:

POST /ws/rest/v1/chartsearchai/search/stream
Content-Type: application/json
Accept: text/event-stream

{
  "patient": "patient-uuid-here",
  "question": "What medications is this patient on?"
}

SSE events:

Event Description
token A chunk of the answer text as it is generated
done Final JSON with the complete answer, references (sorted most recent first, with index, resourceType, resourceId, date), questionId, and disclaimer
error Error message if something goes wrong

Feedback

Submit user feedback (thumbs up/down) for an AI response. Requires the "AI Query Patient Data" privilege.

POST /ws/rest/v1/chartsearchai/feedback
Content-Type: application/json

{
  "questionId": "42",
  "rating": "positive",
  "comment": "Accurate and helpful"
}
Field Required Description
questionId Yes The questionId from the search response
rating Yes "positive" or "negative"
comment No Optional text (max 500 characters, truncated if longer)

Users can only submit feedback on their own queries. Submitting again overwrites the previous feedback.

Audit log

Requires the "View AI Audit Logs" privilege.

GET /ws/rest/v1/chartsearchai/auditlog?patient=...&user=...&fromDate=...&toDate=...&startIndex=0&limit=50

All query parameters are optional. fromDate and toDate are epoch milliseconds. Returns paginated results ordered by most recent first, with a totalCount for pagination. Each entry includes rating and feedbackComment fields (null if no feedback was submitted).

Patient access control

By default, any user with the "AI Query Patient Data" privilege can query any patient. To add patient-level restrictions (e.g., location-based or care-team-based), provide a custom Spring bean that implements the PatientAccessCheck interface:

<bean id="chartSearchAi.patientAccessCheck"
      class="com.example.LocationBasedPatientAccessCheck"/>

This overrides the default permissive implementation.

Evals

The project includes an eval framework that tests retrieval quality, citation accuracy, absent-data detection, and prompt injection resistance without requiring a running LLM or external services.

Running evals

mvn test -pl api -Dtest="*EvalTest"

Or run a specific suite:

mvn test -pl api -Dtest="RetrievalQualityEvalTest"
mvn test -pl api -Dtest="CitationEvalTest"
mvn test -pl api -Dtest="AbsentDataEvalTest"
mvn test -pl api -Dtest="PromptInjectionEvalTest"

Adding cases

Each suite is driven by a JSON dataset in api/src/test/resources/eval/. To add a case, append an entry to the relevant file:

File What it tests
retrieval-eval-dataset.json Query → expected record indices (recall@30)
citation-eval-dataset.json Simulated LLM JSON → expected citation indices (F1)
absent-data-eval-dataset.json Query → expected keywords in "no records" answer
prompt-injection-eval-dataset.json Adversarial payload → special tokens stripped

Metrics report

Each run appends per-case and summary metrics to api/target/eval-results.csv for tracking regressions over time.

Evaluated models

The following models were evaluated for local inference via java-llama.cpp (Q4_K_M quantization, GGUF format). All figures are approximate and depend on hardware.

Model Params File Size Total RAM Context Window CPU Speed Chat Template
Qwen 2.5 1.5B 1.5B ~1GB ~2GB 32K tokens ~40–50 tok/s chatml
Gemma 3 1B 1B ~0.7GB ~2GB 32K tokens ~40–50 tok/s gemma
Gemma 3n E2B E2B (5B total) ~1.5GB ~3GB 32K tokens ~25–35 tok/s gemma
Gemma 4 E2B E2B (2.3B eff) ~1.5GB ~3–5GB 128K tokens ~25–35 tok/s gemma
Llama 3.2 3B 3B ~2GB ~6GB 128K tokens ~20–30 tok/s llama3
Phi-3 Mini 3.8B 3.8B ~2GB ~4GB 4K tokens ~15–25 tok/s phi3
Gemma 3 4B 4B ~2.5GB ~6–8GB 128K tokens ~10–20 tok/s gemma
Gemma 3n E4B E4B (8B total) ~2.5GB ~3–5GB 32K tokens ~15–25 tok/s gemma
Gemma 4 E4B E4B (4.5B eff) ~2.5GB ~6–8GB 128K tokens ~10–20 tok/s gemma
MedGemma 1.5 4B (default) 4B ~2.5GB ~6–8GB 128K tokens ~10–20 tok/s gemma
MedGemma 4B 4B ~2.5GB ~6–8GB 128K tokens ~10–20 tok/s gemma
Mistral 7B 7B ~4GB ~8GB 32K tokens ~10–15 tok/s mistral
Qwen 2.5 7B 7B ~4GB ~8GB 128K tokens ~8–12 tok/s chatml
Llama 3.3 8B 8B ~4.5GB ~10GB 128K tokens ~8–12 tok/s llama3
Gemma 2 9B Instruct 9B ~5GB ~10GB 8K tokens ~5–10 tok/s gemma
Gemma 3 12B 12B ~7GB ~12GB 128K tokens ~4–8 tok/s gemma
Mistral Nemo 12B 12B ~7GB ~12GB 128K tokens ~4–8 tok/s mistral
Phi-3-Medium 14B 14B ~8GB ~14GB 4K tokens ~3–6 tok/s phi3
Qwen 2.5 14B 14B ~8GB ~14GB 128K tokens ~3–6 tok/s chatml
Gemma 4 26B MoE 26B (3.8B active) ~15GB ~18–22GB 256K tokens ~3–6 tok/s gemma
Gemma 3 27B 27B ~16.5GB ~20–24GB 128K tokens ~1–2 tok/s gemma
MedGemma 27B Text 27B ~16.5GB ~20–24GB 128K tokens ~1–2 tok/s gemma
Gemma 4 31B 31B ~18GB ~22–26GB 256K tokens ~1–2 tok/s gemma

Model size guidance

  • 1–2B models (Gemma 3 1B, Gemma 3n E2B, Gemma 4 E2B): Ultra-low-resource or on-device deployments. Gemma 3n and Gemma 4 "E" models use Per-Layer Embeddings (PLE) for memory efficiency — E2B runs in as little as ~3GB RAM. Weaker reasoning but fast inference. Gemma 4 E2B offers 128K context; Gemma 3 1B and 3n E2B are limited to 32K.
  • 3B models (Llama 3.2 3B): Most deployable in low-resource settings but weaker instruction following — may produce verbose or hedging responses.
  • 4B models (MedGemma 1.5 4B, Gemma 4 E4B): Recommended default tier. MedGemma 1.5 4B provides medical-domain fine-tuning with improved medical imaging support. Gemma 4 E4B is a strong general-purpose alternative under the permissive Apache 2.0 license. Both offer 128K context and ~10–20 tok/s CPU inference at ~6–8GB total RAM.
  • 8B models (Llama 3.3 8B): Significantly better general reasoning and instruction following than 4B, feasible on 10GB RAM.
  • 12B models (Gemma 3 12B, Mistral Nemo 12B): Best sub-15B options for clinical Q&A. Gemma 3 12B offers 128K context with strong reasoning. Mistral Nemo 12B has strong medical text comprehension.
  • 14B models (Qwen 2.5 14B, Phi-3-Medium 14B): Best CPU-viable response quality, but slower (~2–4 tok/s) and need 14–16GB RAM.
  • 26–31B models (Gemma 4 26B MoE, Gemma 4 31B, MedGemma 27B Text): Highest quality tier. Gemma 4 26B MoE activates only 3.8B parameters per token, offering faster inference than dense models at this size. Gemma 4 31B Dense offers the best general reasoning under Apache 2.0. MedGemma 27B Text is the medical-domain specialist. All require ~20GB+ RAM and are practical mainly with GPU acceleration.

A server running OpenMRS typically uses 1–2GB for the JVM heap. A 4GB machine is insufficient — the smallest viable model requires at least 3–4GB on its own.

Licensing notes

  • Gemma 4 (Google): Apache 2.0 license — fully permissive, no usage restrictions. The first Gemma family release under a standard open-source license.
  • Gemma 3, Gemma 3n (Google): Gemma Terms of Use — custom license that permits commercial use but reserves Google's right to terminate access for policy violations. More restrictive than Apache 2.0.
  • Gemma 2 (Google): Gemma Terms of Use.
  • MedGemma (Google): Health AI Developer Foundations Terms — more restrictive than Gemma. Requires validation before clinical deployment. Applies to both MedGemma 1.5 4B and MedGemma 27B Text.
  • Llama 3.x (Meta): Free for research and commercial use under the Llama 3.2 Community License. Not technically "open source" by OSI definition — the only meaningful restriction is that products with over 700M monthly active users require a separate license.
  • Mistral (Mistral AI): Apache 2.0 license.
  • Phi-3 (Microsoft): MIT license — fully permissive with no usage restrictions.
  • Qwen 2.5 (Alibaba): Apache 2.0 license. Developed by a Chinese company subject to China's data laws — while GGUF models run locally with no data leaving the machine, some organizations may have compliance concerns.

See docs/adr.md (Decision 10) for detailed per-model analysis, trade-off discussion, and architectural rationale.

Architecture

See docs/adr.md for architectural decisions and design rationale.

License

This project is licensed under the MPL 2.0.

MedGemma is licensed under the Health AI Developer Foundations License, Copyright (C) Google LLC. All Rights Reserved.

Gemma 4 is licensed under the Apache 2.0 License.

Gemma 3 and Gemma 3n are licensed under the Gemma Terms of Use, Copyright (C) Google LLC. All Rights Reserved.

Llama 3.3 is licensed under the Llama 3.2 Community License, Copyright (C) Meta Platforms, Inc. All Rights Reserved.

About

AI-powered chart search module for OpenMRS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages