Defer context.unload() until after QA (keep the terminology cache from being overwritten mid-build) by jmandel · Pull Request #1318 · HL7/fhir-ig-publisher

jmandel · 2026-06-04T16:05:30Z

Problem

context.unload() is called in the generate phase's "Reclaiming memory…" block (PublisherGenerator), but QA (ValidationPresenter) runs afterwards in Publisher.createIg() and still uses the terminology context. unload() flushes the tx cache to disk and then clears the in-memory caches; QA then repopulates fresh, partial caches, and TerminologyCache's write-coalescing save() writes those back — overwriting the complete cache files unload() had just written.

Net effect: most persisted terminology results are discarded at the end of every build, so warm/subsequent builds re-query the tx server for answers the previous build already computed. Measured on US mCODE: a cold build performs ~2,600 persistent stores but only ~230 survive on disk; the SNOMED bucket grows to ~1,029 entries, is flushed in full, then overwritten down to ~86.

Fix

Move pf.context.unload() from the generate-phase reclaim block to the end of createIg() (after QA — the last terminology consumer). The final flush then captures everything and the persisted cache stays complete. The only cost is holding the terminology context through QA, which QA needs anyway.

Impact (measured)

We profiled cold-vs-warm terminology traffic across a random sample of published HL7 IGs. Warm-cache effectiveness (the fraction of cold tx round-trips a warm build avoids) was poor precisely for terminology-heavy clinical IGs:

light terminology: registry-protocols 100%, smart-scheduling-links 97%, smart-app-launch 89%
heavy terminology: US-Core 50%, physical-activity 44%, CARIN-BB 36%, IPS 30%, PQ-CMC-FDA 28%, mCODE 21%, ICHOM-breast-cancer 18%

The low numbers are explained by this bug. Demonstration on mCODE (baseline vs this fix, cold then warm, identical inputs, requests captured via a logging proxy):

	warm tx calls	on-disk cache entries (after cold)	QA result
baseline	2,169	234	8 errors / 117 warnings
with this fix	69	2,532	8 errors / 117 warnings

Warm terminology-server calls drop ~97%, the complete cache persists (~10× more entries on disk), and validation output is unchanged.

Notes

Minimal change — moves one line (plus a comment and a release note).
The underlying TerminologyCache.save() full-file overwrite is also a latent fragility (a partial re-save after unload() clobbers the complete file); that could be separately hardened in org.hl7.fhir.core, but this publisher change fixes the root cause.
Marked draft pending review and a regression test.

grahamegrieve · 2026-06-04T19:23:36Z

QA (ValidationPresenter) runs afterwards in Publisher.createIg() and still uses the terminology context.

really? where? how to reproduce this?

jmandel · 2026-06-04T21:29:39Z

Thanks @grahamegrieve — here's a complete, runnable reproduction: published 2.2.8 vs the fix, on a fresh mCODE clone, no proxy. Metric = tx.fhir.org round-trips, counted with grep -c x-request-id output/qa-tx.html (every server response carries an x-request-id).

mCODE (fresh clone, no proxy)	cold tx	warm tx	warm build time	txcache after cold	QA
published 2.2.8	1280	1075	6m 44s	5.0 MB / 254 entries	6 err / 117 warn
2.2.8 + fix	1210	63	4m 29s	14 MB / 2555 entries	6 err / 117 warn

Warm: 1075 → 63 calls (~94% fewer) and 6m44s → 4m29s (~33% faster — all of it in the terminology phases: generate −2m16s, validation −1m03s, narrative −25s). QA is byte-identical, so it's pure cache reuse, not changed output.

Reproduce (simple commands)

# --- bug: published publisher ---
curl -sL -o publisher.jar \
  https://github.com/HL7/fhir-ig-publisher/releases/download/2.2.8/publisher.jar
git clone --depth 1 https://github.com/HL7/fhir-mCODE-ig mcode      # SUSHI compiles clean
rm -rf mcode/input-cache/txcache
java -jar publisher.jar -ig mcode; grep -c x-request-id mcode/output/qa-tx.html   # 1280  cold
java -jar publisher.jar -ig mcode; grep -c x-request-id mcode/output/qa-tx.html   # 1075  warm

# --- fix: build it and repeat ---
git clone -b repro/defer-unload-on-2.2.8 \
  https://github.com/jmandel/fhir-ig-publisher fixpub
( cd fixpub && mvn -q -DskipTests clean package )
FIX=fixpub/org.hl7.fhir.publisher.cli/target/org.hl7.fhir.publisher.cli-2.2.8.jar
rm -rf mcode/input-cache/txcache
java -jar "$FIX" -ig mcode; grep -c x-request-id mcode/output/qa-tx.html          # 1210  cold
java -jar "$FIX" -ig mcode; grep -c x-request-id mcode/output/qa-tx.html          # 63    warm

What's happening (correction to my original description)

The late terminology consumer isn't QA / ValidationPresenter — it's genCombinedPackage() (PublisherGenerator.java:468), which runs in the same generate() method after the unload() at PublisherGenerator.java:379. unload() flushes the complete cache to disk and then clear()s it; genCombinedPackage's value-set $expands repopulate a partial cache whose coalesced save() overwrites the complete .cache files. Net result on disk: 254 of the 2555 entries the build computed survive (~90% discarded), so the next build reloads the partial cache and re-queries the rest. The fix just moves unload() to the end of createIg(), after both genCombinedPackage and QA — same final flush, but nothing runs after it to overwrite the cache.

One caveat on building against master

I couldn't run mCODE on a master build: current master (core 6.9.10-SNAPSHOT) hard-aborts publishing any R4 IG that emits ttl with "Turtle serialization for R4 is not supported in this build" (org.hl7.fhir.r5.elementmodel.TurtleParser.compose, before any of this cache code runs). So the numbers above come from the identical 2-line change applied on the 2.2.8 release base (which builds against released core 6.9.8, no auth, no Turtle issue). Flagging that R4-Turtle abort separately in case it isn't intended — it's unrelated to this PR.

grahamegrieve · 2026-06-04T21:41:36Z

The actual problem is that genCombinedPackage() should not be called after unload, and it should be an error if the TerminologyCache is used after unload()

genCombinedPackage() (in generator.generate()) and QA both use the terminology context, but context.unload() was called earlier, in the generate phase's "reclaim memory" block. unload() flushes the complete tx cache to disk then clears it; genCombinedPackage()'s value-set expansions then repopulate a partial cache whose coalesced save() overwrites the complete .cache files, so warm builds re-query everything already computed (mCODE: warm tx calls 1075 -> 63, build 6m44s -> 4m29s, QA unchanged). Move unload() to the end of createIg(), after genCombinedPackage and QA. Companion core guard (makes post-unload tx-cache use a hard error so this class of ordering bug fails loudly): hapifhir/org.hl7.fhir.core#2473 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jmandel · 2026-06-05T00:49:22Z

Done — updated this branch and opened the core guard as a separate PR.

This branch (#1318): the rewrite is the same (move pf.context.unload() to the end of createIg(), after every terminology consumer). I corrected the code comments to attribute the late consumer to genCombinedPackage() rather than QA, and cross-referenced the core PR.

Core guard — hapifhir/org.hl7.fhir.core#2473 (draft): implements your "it should be an error if the TerminologyCache is used after unload()". unload() sets an unloaded flag; the four cache entry points (get/cacheExpansion, get/cacheValidation) throw if used afterward; unit test added (existing 42 TerminologyCache tests still pass, on master).

I made it an AssertionError rather than a RuntimeException on purpose: as a RuntimeException it was swallowed by the per-expansion catch (Exception) in genCombinedPackage — 102 logged errors but a "successful" build with a degraded expansions package. An AssertionError escapes that catch and fails the build, with a stack that pinpoints the culprit:

java.lang.AssertionError: Terminology cache used after unload()...
  at TerminologyCache.getExpansion(...)
  at BaseWorkerContext.expandVS(...)
  at PackageReGenerator.generateExpansionPackageInner(...)
  at PublisherGenerator.genCombinedPackage(...)
  at Publisher.createIg(...)

Verified end-to-end on mCODE: buggy ordering + guard → hard fail (above); this branch + guard → builds clean (guard never fires). Since no IG content can trigger it (it's purely a code-ordering invariant), failing hard is safe and turns any future stray post-unload() path into an immediate build failure.

Sequencing: #2473 is a draft because the publisher should adopt it together with this PR — on its own it would hard-fail every terminology-bearing R4 build until this ordering fix is in.

github-project-automation Bot added this to IG Publisher issue board Jun 4, 2026

github-project-automation Bot moved this to Inbox in IG Publisher issue board Jun 4, 2026

jmandel mentioned this pull request Jun 5, 2026

Make using the TerminologyCache after unload() a hard error hapifhir/org.hl7.fhir.core#2473

Closed

jmandel force-pushed the fix/defer-context-unload-until-after-qa branch from 33ddc8d to fb90a5d Compare June 5, 2026 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer context.unload() until after QA (keep the terminology cache from being overwritten mid-build)#1318

Defer context.unload() until after QA (keep the terminology cache from being overwritten mid-build)#1318
jmandel wants to merge 1 commit into
HL7:masterfrom
jmandel:fix/defer-context-unload-until-after-qa

jmandel commented Jun 4, 2026

Uh oh!

grahamegrieve commented Jun 4, 2026

Uh oh!

jmandel commented Jun 4, 2026

Uh oh!

grahamegrieve commented Jun 4, 2026 •

edited

Loading

Uh oh!

jmandel commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmandel commented Jun 4, 2026

Problem

Fix

Impact (measured)

Notes

Uh oh!

grahamegrieve commented Jun 4, 2026

Uh oh!

jmandel commented Jun 4, 2026

Reproduce (simple commands)

What's happening (correction to my original description)

One caveat on building against master

Uh oh!

grahamegrieve commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmandel commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grahamegrieve commented Jun 4, 2026 •

edited

Loading