Note: written by AI, reviewed by hand
PDF extraction very slow on Dolpopa-'Dzam-Thang-7-p1-644-with-folios.pdf distributed on https://jonangdharma.org/download/#DD
The PDF is produced by Word and font subsetting is per-page-range, so
the same logical font (Microsoft Himalaya) appears in the document as
89 distinct Type0 subset xrefs (AAAA01+Microsoft-Himalaya,
BBBB02+…, …). For each one, collect_font_merges does:
- parse the existing
/ToUnicode stream (regex over beginbfchar/beginbfrange blocks),
- load
microsofthimalaya.json (1478 entries),
- merge: any DB GID not in the existing CMap counts as
changed,
apply_font_merges_to_doc calls doc.update_stream(tu_xref, _build_tounicode_type0(merged)),
re-encoding all 1478 entries into a fresh CMap stream and asking
PyMuPDF to splice it into the document.
For Dolpopa-7 the existing ToUnicode of each subset already has correct
mappings for the ~10–80 GIDs that subset uses, but step 3 still finds
~1400 "missing" GIDs (the ones from the rest of the font that this
subset never references) and step 4 unconditionally rewrites the
stream. That's 58 stream rewrites, each ~50–60 KB, on a 100-page PDF.
Net effect: a no-op semantically, but ~4 s of work per subset and
~230 s for the document.
Confirmed by cProfile: virtually all of the time is split between
_build_tounicode_type0 / update_stream (step 4) and _parse_tounicode
(step 1, parses the same kind of stream we just rewrote on the
previous page).
Suggested fixes
In rough order of impact:
-
Don't count "changed" if the upgrade only touches GIDs the font
never uses. Walk the content streams once per page (PyMuPDF gives
page-level GID usage cheaply via page.get_text("rawdict") or
doc.get_page_text(..., flags=…)) to compute the set of
referenced GIDs per font xref, then in _merge only treat a
GID as changed if it's referenced and the existing entry
differs from db_map. For Dolpopa-7 this would drop "changed"
from ~1400 to ~0 per subset, skipping update_stream entirely.
-
Cache the loaded lookup JSON across patch_doc calls. The
current cache lives inside collect_font_merges, so on a 25-PDF
batch the same microsofthimalaya.json is parsed 25 times. A
module-level (or function-arg) cache keyed by (path, mtime)
would erase that cost. Same applies to the _build_db_index
call.
Item (1) on its own probably eliminates >90 % of the cost in the
Dolpopa-7 case and lets the tool be applied unconditionally to a
corpus.
Note: written by AI, reviewed by hand
PDF extraction very slow on Dolpopa-'Dzam-Thang-7-p1-644-with-folios.pdf distributed on https://jonangdharma.org/download/#DD
The PDF is produced by Word and font subsetting is per-page-range, so
the same logical font (Microsoft Himalaya) appears in the document as
89 distinct Type0 subset xrefs (
AAAA01+Microsoft-Himalaya,BBBB02+…, …). For each one,collect_font_mergesdoes:/ToUnicodestream (regex overbeginbfchar/beginbfrangeblocks),microsofthimalaya.json(1478 entries),changed,apply_font_merges_to_doccallsdoc.update_stream(tu_xref, _build_tounicode_type0(merged)),re-encoding all 1478 entries into a fresh CMap stream and asking
PyMuPDF to splice it into the document.
For Dolpopa-7 the existing ToUnicode of each subset already has correct
mappings for the ~10–80 GIDs that subset uses, but step 3 still finds
~1400 "missing" GIDs (the ones from the rest of the font that this
subset never references) and step 4 unconditionally rewrites the
stream. That's 58 stream rewrites, each ~50–60 KB, on a 100-page PDF.
Net effect: a no-op semantically, but ~4 s of work per subset and
~230 s for the document.
Confirmed by
cProfile: virtually all of the time is split between_build_tounicode_type0/update_stream(step 4) and_parse_tounicode(step 1, parses the same kind of stream we just rewrote on the
previous page).
Suggested fixes
In rough order of impact:
Don't count "changed" if the upgrade only touches GIDs the font
never uses. Walk the content streams once per page (PyMuPDF gives
page-level GID usage cheaply via
page.get_text("rawdict")ordoc.get_page_text(..., flags=…)) to compute the set ofreferenced GIDs per font xref, then in
_mergeonly treat aGID as
changedif it's referenced and the existing entrydiffers from
db_map. For Dolpopa-7 this would drop "changed"from ~1400 to ~0 per subset, skipping
update_streamentirely.Cache the loaded lookup JSON across
patch_doccalls. Thecurrent cache lives inside
collect_font_merges, so on a 25-PDFbatch the same
microsofthimalaya.jsonis parsed 25 times. Amodule-level (or function-arg) cache keyed by
(path, mtime)would erase that cost. Same applies to the
_build_db_indexcall.
Item (1) on its own probably eliminates >90 % of the cost in the
Dolpopa-7 case and lets the tool be applied unconditionally to a
corpus.