The integration test added in PR #262 (integration/cjk_drop_sfnt_roundtrip_test.go) is gated to runtime.GOOS == "darwin" because STHeiti Light is the only universally-available large-CJK TTC fixture on dev hosts. Linux and Windows CI run the synthetic 25k-group format-12 cmap test (font/cmap_test.go::TestParseCmapFormat12LargeGroupCount) but never exercise the full HTML → PDF → ToUnicode CMap → text extraction round trip on a real CJK font.
Goal
Check in a small (~5KB) synthetic CJK font under font/testdata/ so the round-trip test runs unconditionally on every host.
Why this gap matters
A regression that breaks the embedded subset's /ToUnicode CMap construction (silently producing a usable-looking PDF whose text extraction returns garbage) currently slips through Linux and Windows CI. The macOS test catches it, but darwin runners aren't required for merge. Closing the gap means the regression-pin runs on every push everywhere.
Constraints on the fixture
The fixture needs to satisfy several constraints simultaneously, which is why this isn't trivial:
- CJK glyph coverage: must map at least 5-10 CJK codepoints (e.g., U+4E2D, U+534E, U+4EBA — characters from "中华人") to real glyph IDs. Without CJK glyphs the fallback chain doesn't trigger and the test isn't exercising what it claims to test.
- Small enough to check in: ideally ≤5 KB. Any real CJK font (msyh.ttc 14MB, NotoSansSC-Regular.otf 10MB, even Source Han Sans subsets at 100KB) is too large.
- Apache 2.0-compatible license: any checked-in font binary inherits redistribution terms. Most CJK fonts are SIL Open Font License (compatible) or proprietary (not compatible). A custom-built font dodges this.
- Triggers the recovery path: the original bug was sfnt's
maxCmapSegments=20000 rejection. The fixture doesn't strictly need to exceed that limit (the synthetic-cmap unit test already covers it), but it should trip enough cmap entries to be a realistic CJK round trip.
Proposed approach: build the fixture programmatically
A custom-built minimal TrueType font with:
- 10 CJK codepoints, each mapping to a unique glyph
- Each glyph is a simple square box outline (
M 0 0 L 100 0 L 100 100 L 0 100 Z) — no aesthetic concern; the test asserts on text extraction, not rendering quality
- The minimum tables required by Folio's parser:
head, hhea, maxp, cmap (format 4 with 10 entries — fits comfortably under any sfnt limit), name, hmtx, glyf, loca, OS/2, post
This is ~150 LOC of binary construction (the inverse of what Folio's font/ parsers do). Total checked-in size: 2-5 KB. License: ours, since we built it.
The build script could live under font/testdata/build_cjk_fixture.go (with //go:build ignore) and produce font/testdata/synthetic_cjk.ttf. CI doesn't run the script; the .ttf is committed.
Alternative: use Source Han Sans Region (subset)
Source Han Sans ships region-specific TTFs at ~10MB each, but a pyftsubset to 50 codepoints could produce a ~30KB Apache-2.0 fixture. Tradeoff: requires a one-time external tool, and the resulting bytes are tied to a specific Source Han Sans version. The custom-built option is simpler long-term.
Scope estimate
~150 LOC of fixture builder + ~10 LOC of test integration + a ~5KB binary. Total: ~1 day of focused work.
Acceptance criteria
Related
This was deferred during PR #262's review; flagged again in the post-merge audit. Not blocking any user-reported issue today (the production code paths are correct on every host I've manually tested), but it's the only remaining no-CI-coverage branch in the recovery pipeline.
The integration test added in PR #262 (
integration/cjk_drop_sfnt_roundtrip_test.go) is gated toruntime.GOOS == "darwin"because STHeiti Light is the only universally-available large-CJK TTC fixture on dev hosts. Linux and Windows CI run the synthetic 25k-group format-12 cmap test (font/cmap_test.go::TestParseCmapFormat12LargeGroupCount) but never exercise the full HTML → PDF → ToUnicode CMap → text extraction round trip on a real CJK font.Goal
Check in a small (~5KB) synthetic CJK font under
font/testdata/so the round-trip test runs unconditionally on every host.Why this gap matters
A regression that breaks the embedded subset's
/ToUnicodeCMap construction (silently producing a usable-looking PDF whose text extraction returns garbage) currently slips through Linux and Windows CI. The macOS test catches it, but darwin runners aren't required for merge. Closing the gap means the regression-pin runs on every push everywhere.Constraints on the fixture
The fixture needs to satisfy several constraints simultaneously, which is why this isn't trivial:
maxCmapSegments=20000rejection. The fixture doesn't strictly need to exceed that limit (the synthetic-cmap unit test already covers it), but it should trip enough cmap entries to be a realistic CJK round trip.Proposed approach: build the fixture programmatically
A custom-built minimal TrueType font with:
M 0 0 L 100 0 L 100 100 L 0 100 Z) — no aesthetic concern; the test asserts on text extraction, not rendering qualityhead,hhea,maxp,cmap(format 4 with 10 entries — fits comfortably under any sfnt limit),name,hmtx,glyf,loca,OS/2,postThis is ~150 LOC of binary construction (the inverse of what Folio's
font/parsers do). Total checked-in size: 2-5 KB. License: ours, since we built it.The build script could live under
font/testdata/build_cjk_fixture.go(with//go:build ignore) and producefont/testdata/synthetic_cjk.ttf. CI doesn't run the script; the .ttf is committed.Alternative: use Source Han Sans Region (subset)
Source Han Sansships region-specific TTFs at ~10MB each, but apyftsubsetto 50 codepoints could produce a ~30KB Apache-2.0 fixture. Tradeoff: requires a one-time external tool, and the resulting bytes are tied to a specific Source Han Sans version. The custom-built option is simpler long-term.Scope estimate
~150 LOC of fixture builder + ~10 LOC of test integration + a ~5KB binary. Total: ~1 day of focused work.
Acceptance criteria
font/testdata/synthetic_cjk.ttf(or.ttc) exists and is ≤10KBintegration/cjk_drop_sfnt_roundtrip_test.gono longer gates onruntime.GOOS, runs on Linux and Windows CIhtml.ConvertFull→document.Save→reader.Parse→page.ExtractText→ byte-perfect equality with inputRelated
This was deferred during PR #262's review; flagged again in the post-merge audit. Not blocking any user-reported issue today (the production code paths are correct on every host I've manually tested), but it's the only remaining no-CI-coverage branch in the recovery pipeline.