Performance issue

**Note:** written by AI, reviewed by hand

PDF extraction very slow on Dolpopa-'Dzam-Thang-7-p1-644-with-folios.pdf distributed  on https://jonangdharma.org/download/#DD

The PDF is produced by Word and font subsetting is per-page-range, so
the same logical font (Microsoft Himalaya) appears in the document as
**89 distinct Type0 subset xrefs** (`AAAA01+Microsoft-Himalaya`,
`BBBB02+…`, …). For each one, `collect_font_merges` does:

1. parse the existing `/ToUnicode` stream (regex over `beginbfchar`/`beginbfrange` blocks),
2. load `microsofthimalaya.json` (1478 entries),
3. merge: any DB GID not in the existing CMap counts as `changed`,
4. `apply_font_merges_to_doc` calls `doc.update_stream(tu_xref, _build_tounicode_type0(merged))`,
   re-encoding all 1478 entries into a fresh CMap stream and asking
   PyMuPDF to splice it into the document.

For Dolpopa-7 the existing ToUnicode of each subset already has correct
mappings for the ~10–80 GIDs that subset uses, but step 3 still finds
~1400 "missing" GIDs (the ones from the rest of the font that this
subset never references) and step 4 unconditionally rewrites the
stream. That's 58 stream rewrites, each ~50–60 KB, on a 100-page PDF.
Net effect: a no-op semantically, but ~4 s of work per subset and
~230 s for the document.

Confirmed by `cProfile`: virtually all of the time is split between
`_build_tounicode_type0` / `update_stream` (step 4) and `_parse_tounicode`
(step 1, parses the same kind of stream we just rewrote on the
previous page).

## Suggested fixes

In rough order of impact:

1. **Don't count "changed" if the upgrade only touches GIDs the font
   never uses.** Walk the content streams once per page (PyMuPDF gives
   page-level GID usage cheaply via `page.get_text("rawdict")` or
   `doc.get_page_text(..., flags=…)`) to compute the set of
   *referenced* GIDs per font xref, then in `_merge` only treat a
   GID as `changed` if it's referenced **and** the existing entry
   differs from `db_map`. For Dolpopa-7 this would drop "changed"
   from ~1400 to ~0 per subset, skipping `update_stream` entirely.

2. **Cache the loaded lookup JSON across `patch_doc` calls.** The
   current cache lives inside `collect_font_merges`, so on a 25-PDF
   batch the same `microsofthimalaya.json` is parsed 25 times. A
   module-level (or function-arg) cache keyed by `(path, mtime)`
   would erase that cost. Same applies to the `_build_db_index`
   call.

Item (1) on its own probably eliminates >90 % of the cost in the
Dolpopa-7 case and lets the tool be applied unconditionally to a
corpus.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue #10

Suggested fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance issue #10

Description

Suggested fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions