Skip to content

test: synthetic CJK font fixture under font/testdata for cross-platform round-trip CI #281

@carlos7ags

Description

@carlos7ags

The integration test added in PR #262 (integration/cjk_drop_sfnt_roundtrip_test.go) is gated to runtime.GOOS == "darwin" because STHeiti Light is the only universally-available large-CJK TTC fixture on dev hosts. Linux and Windows CI run the synthetic 25k-group format-12 cmap test (font/cmap_test.go::TestParseCmapFormat12LargeGroupCount) but never exercise the full HTML → PDF → ToUnicode CMap → text extraction round trip on a real CJK font.

Goal

Check in a small (~5KB) synthetic CJK font under font/testdata/ so the round-trip test runs unconditionally on every host.

Why this gap matters

A regression that breaks the embedded subset's /ToUnicode CMap construction (silently producing a usable-looking PDF whose text extraction returns garbage) currently slips through Linux and Windows CI. The macOS test catches it, but darwin runners aren't required for merge. Closing the gap means the regression-pin runs on every push everywhere.

Constraints on the fixture

The fixture needs to satisfy several constraints simultaneously, which is why this isn't trivial:

  1. CJK glyph coverage: must map at least 5-10 CJK codepoints (e.g., U+4E2D, U+534E, U+4EBA — characters from "中华人") to real glyph IDs. Without CJK glyphs the fallback chain doesn't trigger and the test isn't exercising what it claims to test.
  2. Small enough to check in: ideally ≤5 KB. Any real CJK font (msyh.ttc 14MB, NotoSansSC-Regular.otf 10MB, even Source Han Sans subsets at 100KB) is too large.
  3. Apache 2.0-compatible license: any checked-in font binary inherits redistribution terms. Most CJK fonts are SIL Open Font License (compatible) or proprietary (not compatible). A custom-built font dodges this.
  4. Triggers the recovery path: the original bug was sfnt's maxCmapSegments=20000 rejection. The fixture doesn't strictly need to exceed that limit (the synthetic-cmap unit test already covers it), but it should trip enough cmap entries to be a realistic CJK round trip.

Proposed approach: build the fixture programmatically

A custom-built minimal TrueType font with:

  • 10 CJK codepoints, each mapping to a unique glyph
  • Each glyph is a simple square box outline (M 0 0 L 100 0 L 100 100 L 0 100 Z) — no aesthetic concern; the test asserts on text extraction, not rendering quality
  • The minimum tables required by Folio's parser: head, hhea, maxp, cmap (format 4 with 10 entries — fits comfortably under any sfnt limit), name, hmtx, glyf, loca, OS/2, post

This is ~150 LOC of binary construction (the inverse of what Folio's font/ parsers do). Total checked-in size: 2-5 KB. License: ours, since we built it.

The build script could live under font/testdata/build_cjk_fixture.go (with //go:build ignore) and produce font/testdata/synthetic_cjk.ttf. CI doesn't run the script; the .ttf is committed.

Alternative: use Source Han Sans Region (subset)

Source Han Sans ships region-specific TTFs at ~10MB each, but a pyftsubset to 50 codepoints could produce a ~30KB Apache-2.0 fixture. Tradeoff: requires a one-time external tool, and the resulting bytes are tied to a specific Source Han Sans version. The custom-built option is simpler long-term.

Scope estimate

~150 LOC of fixture builder + ~10 LOC of test integration + a ~5KB binary. Total: ~1 day of focused work.

Acceptance criteria

  • font/testdata/synthetic_cjk.ttf (or .ttc) exists and is ≤10KB
  • License is Apache 2.0 or equivalent (custom-built, NOTICE updated if needed)
  • integration/cjk_drop_sfnt_roundtrip_test.go no longer gates on runtime.GOOS, runs on Linux and Windows CI
  • The fixture exercises the full PDF round-trip: render Chinese text through html.ConvertFulldocument.Savereader.Parsepage.ExtractText → byte-perfect equality with input
  • CHANGELOG entry noting cross-platform CI coverage for the recovery path

Related

This was deferred during PR #262's review; flagged again in the post-merge audit. Not blocking any user-reported issue today (the production code paths are correct on every host I've manually tested), but it's the only remaining no-CI-coverage branch in the recovery pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions