Skip to content

Performance issue #10

@eroux

Description

@eroux

Note: written by AI, reviewed by hand

PDF extraction very slow on Dolpopa-'Dzam-Thang-7-p1-644-with-folios.pdf distributed on https://jonangdharma.org/download/#DD

The PDF is produced by Word and font subsetting is per-page-range, so
the same logical font (Microsoft Himalaya) appears in the document as
89 distinct Type0 subset xrefs (AAAA01+Microsoft-Himalaya,
BBBB02+…, …). For each one, collect_font_merges does:

  1. parse the existing /ToUnicode stream (regex over beginbfchar/beginbfrange blocks),
  2. load microsofthimalaya.json (1478 entries),
  3. merge: any DB GID not in the existing CMap counts as changed,
  4. apply_font_merges_to_doc calls doc.update_stream(tu_xref, _build_tounicode_type0(merged)),
    re-encoding all 1478 entries into a fresh CMap stream and asking
    PyMuPDF to splice it into the document.

For Dolpopa-7 the existing ToUnicode of each subset already has correct
mappings for the ~10–80 GIDs that subset uses, but step 3 still finds
~1400 "missing" GIDs (the ones from the rest of the font that this
subset never references) and step 4 unconditionally rewrites the
stream. That's 58 stream rewrites, each ~50–60 KB, on a 100-page PDF.
Net effect: a no-op semantically, but ~4 s of work per subset and
~230 s for the document.

Confirmed by cProfile: virtually all of the time is split between
_build_tounicode_type0 / update_stream (step 4) and _parse_tounicode
(step 1, parses the same kind of stream we just rewrote on the
previous page).

Suggested fixes

In rough order of impact:

  1. Don't count "changed" if the upgrade only touches GIDs the font
    never uses.
    Walk the content streams once per page (PyMuPDF gives
    page-level GID usage cheaply via page.get_text("rawdict") or
    doc.get_page_text(..., flags=…)) to compute the set of
    referenced GIDs per font xref, then in _merge only treat a
    GID as changed if it's referenced and the existing entry
    differs from db_map. For Dolpopa-7 this would drop "changed"
    from ~1400 to ~0 per subset, skipping update_stream entirely.

  2. Cache the loaded lookup JSON across patch_doc calls. The
    current cache lives inside collect_font_merges, so on a 25-PDF
    batch the same microsofthimalaya.json is parsed 25 times. A
    module-level (or function-arg) cache keyed by (path, mtime)
    would erase that cost. Same applies to the _build_db_index
    call.

Item (1) on its own probably eliminates >90 % of the cost in the
Dolpopa-7 case and lets the tool be applied unconditionally to a
corpus.

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions