Defer context.unload() until after QA (keep the terminology cache from being overwritten mid-build)#1318
Conversation
really? where? how to reproduce this? |
|
Thanks @grahamegrieve — here's a complete, runnable reproduction: published 2.2.8 vs the fix, on a fresh mCODE clone, no proxy. Metric = tx.fhir.org round-trips, counted with
Warm: 1075 → 63 calls (~94% fewer) and 6m44s → 4m29s (~33% faster — all of it in the terminology phases: Reproduce (simple commands)# --- bug: published publisher ---
curl -sL -o publisher.jar \
https://github.com/HL7/fhir-ig-publisher/releases/download/2.2.8/publisher.jar
git clone --depth 1 https://github.com/HL7/fhir-mCODE-ig mcode # SUSHI compiles clean
rm -rf mcode/input-cache/txcache
java -jar publisher.jar -ig mcode; grep -c x-request-id mcode/output/qa-tx.html # 1280 cold
java -jar publisher.jar -ig mcode; grep -c x-request-id mcode/output/qa-tx.html # 1075 warm
# --- fix: build it and repeat ---
git clone -b repro/defer-unload-on-2.2.8 \
https://github.com/jmandel/fhir-ig-publisher fixpub
( cd fixpub && mvn -q -DskipTests clean package )
FIX=fixpub/org.hl7.fhir.publisher.cli/target/org.hl7.fhir.publisher.cli-2.2.8.jar
rm -rf mcode/input-cache/txcache
java -jar "$FIX" -ig mcode; grep -c x-request-id mcode/output/qa-tx.html # 1210 cold
java -jar "$FIX" -ig mcode; grep -c x-request-id mcode/output/qa-tx.html # 63 warmWhat's happening (correction to my original description)The late terminology consumer isn't QA / One caveat on building against masterI couldn't run mCODE on a |
|
The actual problem is that genCombinedPackage() should not be called after unload, and it should be an error if the TerminologyCache is used after unload() |
genCombinedPackage() (in generator.generate()) and QA both use the terminology context, but context.unload() was called earlier, in the generate phase's "reclaim memory" block. unload() flushes the complete tx cache to disk then clears it; genCombinedPackage()'s value-set expansions then repopulate a partial cache whose coalesced save() overwrites the complete .cache files, so warm builds re-query everything already computed (mCODE: warm tx calls 1075 -> 63, build 6m44s -> 4m29s, QA unchanged). Move unload() to the end of createIg(), after genCombinedPackage and QA. Companion core guard (makes post-unload tx-cache use a hard error so this class of ordering bug fails loudly): hapifhir/org.hl7.fhir.core#2473 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
33ddc8d to
fb90a5d
Compare
|
Done — updated this branch and opened the core guard as a separate PR. This branch (#1318): the rewrite is the same (move Core guard — hapifhir/org.hl7.fhir.core#2473 (draft): implements your "it should be an error if the I made it an Verified end-to-end on mCODE: buggy ordering + guard → hard fail (above); this branch + guard → builds clean (guard never fires). Since no IG content can trigger it (it's purely a code-ordering invariant), failing hard is safe and turns any future stray post- Sequencing: #2473 is a draft because the publisher should adopt it together with this PR — on its own it would hard-fail every terminology-bearing R4 build until this ordering fix is in. |
Problem
context.unload()is called in the generate phase's "Reclaiming memory…" block (PublisherGenerator), but QA (ValidationPresenter) runs afterwards inPublisher.createIg()and still uses the terminology context.unload()flushes the tx cache to disk and then clears the in-memory caches; QA then repopulates fresh, partial caches, andTerminologyCache's write-coalescingsave()writes those back — overwriting the complete cache filesunload()had just written.Net effect: most persisted terminology results are discarded at the end of every build, so warm/subsequent builds re-query the tx server for answers the previous build already computed. Measured on US mCODE: a cold build performs ~2,600 persistent stores but only ~230 survive on disk; the SNOMED bucket grows to ~1,029 entries, is flushed in full, then overwritten down to ~86.
Fix
Move
pf.context.unload()from the generate-phase reclaim block to the end ofcreateIg()(after QA — the last terminology consumer). The final flush then captures everything and the persisted cache stays complete. The only cost is holding the terminology context through QA, which QA needs anyway.Impact (measured)
We profiled cold-vs-warm terminology traffic across a random sample of published HL7 IGs. Warm-cache effectiveness (the fraction of cold tx round-trips a warm build avoids) was poor precisely for terminology-heavy clinical IGs:
The low numbers are explained by this bug. Demonstration on mCODE (baseline vs this fix, cold then warm, identical inputs, requests captured via a logging proxy):
Warm terminology-server calls drop ~97%, the complete cache persists (~10× more entries on disk), and validation output is unchanged.
Notes
TerminologyCache.save()full-file overwrite is also a latent fragility (a partial re-save afterunload()clobbers the complete file); that could be separately hardened inorg.hl7.fhir.core, but this publisher change fixes the root cause.