MAINT: Converge on one shared Font class for text extraction and appearance streams #3583

PJBrs · 2025-12-23T15:43:55Z

EDITED ON 2 JANUARY

This PR introduces one dedicated Font class for all text extraction code and appearance streams. This should be a basis for adding new font resources to appearance streams, in relation to bug #3514

Changes:

Adds new font class to replace the one in the text extraction code
Uses resulting font class in the layout mode text extraction code page.py
Uses resulting font class in the Appearance Stream code
Includes character widths for string encodings, which were previously omitted
Adds default width to character widths
Uses the resulting font class in the non-layout mode text extraction code in place of cmap
Removes obsolete code from cmap.py

This PR got rather big, because, first, I noticed how the original font class did not parse widths for some string encoded fonts, then I noticed that it didn't set a default width either. However, this resulted in rather low coverage. To resolve low coverage I also ported the non-layout mode text extraction code to the new Font class. This was a lot more work, but now coverage is very good, without the need to add more tests! The new font class works fine with the current appearance stream code and both the original and the layout mode text extraction code.

The current version - January 2 - is ready for review. I'm specifically looking for feedback on:

Is this PR too big? If so, I can logically limit it to, for instance, the first seven patches. This excludes replacing the cmaps in text extraction, and therefore results in lower test coverage (for the time being).

Small note for reviewing - the diff stat for this PR is rather big, but the relevant changes are mostly limited to _font.py and the _get_actual_text_widths method in ‎pypdf/_text_extraction/_text_extractor.py (012826d). The rest is only restructuring, mostly renaming and re-typing variables.

Some further advantages of this PR:

This PR adds functionality to the Font class for the layout mode text extraction code: parsing character widths for Type1 and TrueType fonts with string encodings
This patch removes code duplication in the sense that build_font_width_map did more or less the same as the original Font class
Increases robustness because all text extraction code (original and layout mode) and the appearance stream code use the same basis for font information
Net removal of more than 150 lines of code (see below)
Text extraction is faster with the new code (see further below)
Improves readability, because the original cmap tuple information can now be accessed as font.encoding, font.character_map, font.space_width and font_resource, instead of cmap[0], cmap[1], cmap[2] and cmap[3].

$ git diff --stat origin/main
 pypdf/_cmap.py                                             | 198 --------------------------------------------------------------------------------------------------------
 pypdf/_font.py                                             | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
 pypdf/_page.py                                             |  52 ++++++++++++----------------
 pypdf/_text_extraction/__init__.py                         |  30 +++++++---------
 pypdf/_text_extraction/_layout_mode/__init__.py            |   2 +-
 pypdf/_text_extraction/_layout_mode/_fixed_width_page.py   |   2 +-
 pypdf/_text_extraction/_layout_mode/_font.py               |  65 -----------------------------------
 pypdf/_text_extraction/_layout_mode/_text_state_manager.py |   4 +--
 pypdf/_text_extraction/_layout_mode/_text_state_params.py  |  10 +++---
 pypdf/_text_extraction/_text_extractor.py                  | 144 ++++++++++++++++++++++++++++++++++------------------------------------------
 pypdf/generic/_appearance_stream.py                        |  63 +++++++++++++++++-----------------
 resources/010-pdflatex-forms.txt                           |   2 +-
 resources/multicolumn-lorem-ipsum.txt                      |  80 +++++++++++++++++++++---------------------
 tests/test_cmap.py                                         |   8 +++--
 tests/test_text_extraction.py                              |  33 +++++++++---------
 15 files changed, 382 insertions(+), 536 deletions(-)

time (for i in $(seq 10); do pytest3 tests/test_text_extraction.py ; done)
Old:
real    0m56.213s
user    0m45.754s
sys     0m10.463s

New:
real    0m52.308s
user    0m42.508s
sys     0m9.802s

codecov · 2025-12-23T23:21:42Z

Codecov Report

❌ Patch coverage is 99.35484% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.31%. Comparing base (97d47a0) to head (7101294).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/generic/_appearance_stream.py	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3583      +/-   ##
==========================================
+ Coverage   97.30%   97.31%   +0.01%     
==========================================
  Files          56       55       -1     
  Lines        9838     9770      -68     
  Branches     1790     1780      -10     
==========================================
- Hits         9573     9508      -65     
+ Misses        157      155       -2     
+ Partials      108      107       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pypdf/_text_extraction/_text_extractor.py

pypdf/_text_extraction/_layout_mode/_text_state_params.py

pypdf/_text_extraction/__init__.py

pypdf/_text_extraction/_text_extractor.py

pypdf/_font.py

pypdf/_page.py

This patch ports the AppearanceStream to the new Font class. This is in hardly any way different from the original code, except making sure that a default width is set in the character widths for the 14 Adobe core fonts. This is not in fact necessary at this point, but will be when the Font class sets default width itself, and other code begins to depend on that.

This patch ensures that character widths are collected correctly also for fonts that have encoding defined as a string.

Previously, character widths were not computed for type1 and TrueType fonts when encoding was a string. For one test, that is, test_text_extraction_layout_mode in tests/test_workflows.py, this meant that all character widths were treated as one space width. Now, the real widths are used, which changes the output of the test significantly, but in keeping with the intended output. This patch implements the new output. To be sure, I counted the number of newlines and words in both versions, and they are exactly the same, so no spaces were accidentally omitted between words in the new version, nor were they added, since the new version has fewer spaces than the old one.

This patch makes sure that the Font class has a way to compute space_width.

The FontDescriptor code deals with fonts by type. After having dealt with Type1, MMType1, Type3 and TrueType fonts, it is not necessary to check if the remaining fonts are CID or composite fonts, they all are.

This patch ports the layout mode text extraction code to the new font class. This introduces one test failure, which itself appears to derive from a misconception about space width in the original Font class. Previously, a layout mode font was initialized in _page.py as follows: fonts[font_name] = _layout_mode.Font(*cmap, font_dict) *cmap, in this case was the return value of build_char_map, which consists of: - Font sub-type; - Space_width criteria (50% of width); - Encoding map; - Character-map; and - Font-dictionary Notice that build_char_map does _not_ return the width of a space, but the width of _half_ a space. However, if we look at the arguments to the layout mode Font class, clearly the class expects to be passed the full width of a space. This is also clear from the word_width method in the layout mode Font class, which substitutes a missing width with 2 * space_width. It follows that the layout mode Font class _expected_ to be passed a full space width, but really was only passed the width of half a space. When porting to the new Font class, this becomes problematic when calculating text width, because the new Font class uses self.character_widths["default"] as a fallback for a missing width, which is approximately (and in many cases exactly) the width of two spaces. This in turn causes problems with text extraction in cases where the width of a space itself is missing ("The Crazy Ones"), and cases where a font with a missing character width is calculated wider than before. For the first issue, this patch introduces a work-around that also exists in the conventional text extraction code, that is, dealing with missing space width separately. For the second issue - this causes one test to fail: the test_layout_mode_text_state in tests/test_text_extraction.py. This is entirely due to the existence of a unicode private range character in the file.

…t_mode_text_state For several reasons, the output of the test_layout_mode_text_state test has changed significantly with changing to the new Font class. Here's why: 1. The original layout mode Font class set a space width that was actually half a space wide in reality. In computing word with, a default fallback value was used of "self.space_width * 2", which in reality was just the width of one space. 2. The new Font class uses "self.character_widths["default"]" as a fallback value for calculating word width. This value is calculated as follows: - If a missing width is defined in a Font's font descriptor, set that as default width - Else if the width of a space is defined in a Font's character widths and it is not zero, set the width of two spaces as default width - Else calculate the average of all character widths and set that as default width For the document in test_layout_mode_text_state, this results in very different default character widths. In the original Font class, it set a space width of 125, and used 250 as a fallback widht. With the new Font class, it reads a value of 1000 from missing width in a font descriptor. The document contains one character from a private unicode range, the width of which is not defined. This character appears a number of times throughout the document. As a result, this character's width is calculated much wider with the new code than with the old code. In all other respects, though, the output is the same. So, the test_layout_mode_text_state's test goal - seeing whether a font change within a q context is addressed correctly - still holds. The expected output of this test is stored as a user attachment on github. Instead of replacing the document, just remove the space characters from the rendered output and check the result. This makes the test pass while keeping its intended purpose.

The compute_font_width method is no longer used and therefore obsolete.

This adds a warning to FontDescriptor that replicates a warning originally in the build_font_width_map method in cmap.py. In tests/test_cmap.py, test_function_in_font_widths specificallly tests for this warning. Adding this warning to FontDescriptor for the same problem case, the test keeps fulfilling its purpose, but now for the new Font class.

This patch stops collecting character maps, space widths and encodings to the TextExtractor, keeping only the font resource that is necessary in the TextExtractor class. All the other aspects are now covered with the Font class. Incidentally, this should reduce the number of times that font widths are collected during text extraction, which used to be once for every font resource (for collecting space width) and again during text extraction. Now it is only once, when the fonts are collected in page.py.

After moving the text extraction code to the font class, which collects its own font width map, this code is not needed anymore.

This removes three methods that have become obsolete since porting the non-layout text extraction code to the Font class.

The test for iss1533 was based on the old build_char_map code. Now that that code is removed, port the test to the new Font class, which should cover the underlying issue just the same.

This does not cause a circular import anymore after refactoring.

PJBrs · 2026-01-05T19:23:23Z

@stefan6419846 Thanks for your review! I addressed most of your points.

I did make a couple of additional changes. In the Font class, I changed this:

@@ -317,12 +322,5 @@ class Font:
     def text_width(self, text: str = "") -> float:
         """Sum of character widths specified in PDF font for the supplied text."""
         return sum(
-            [self.character_widths.get(char, self.space_width) for char in text], 0.0
+            [self.character_widths.get(char, self.character_widths["default"]) for char in text], 0.0
         )

The underlying logic is as follows - when a default character width is missing, then you should fall back to default width, not space width. As a rule, assume that default width is roughly the width of two spaces.

This causes problems in the layout mode font class, which, as I mentioned earlier, assumed that what was passed as half a space's width actually represented a full space width. That's why the original (wrong) code worked, while the new caused a bit of trouble. So I added the following, which also exists in the non-layout mode code:

@@ -117,8 +117,14 @@ class TextStateParams:
 
     def word_tx(self, word: str, TD_offset: float = 0.0) -> float:
         """Horizontal text displacement for any word according this text state"""
+        width: float = 0.0
+        for char in word:
+            if char == " ":
+                width += self.font.space_width
+            else:
+                width += self.font.text_width(char)
         return (
-            (self.font_size * ((self.font.text_width(word) - TD_offset) / 1000.0))
+            (self.font_size * ((width - TD_offset) / 1000.0))
             + self.Tc
             + word.count(" ") * self.Tw
         ) * (self.Tz / 100.0)

This fixes all tests, except one, where I think the changed rendering is actually correct. See this commit's message for explanation: 89cc310. In short, there is one important way in which the new layout mode code deviates from the old code, and that's the fallback width in the Font text_width code. This changes rendering in case we encounter a character with unknown width that is not a space.

Finally, in the non-layout extraction code, I noticed two things:

I was still passing around space_width while this was already part of the font class.
I did not take into account user-provided space width anymore.

I removed all that passing space width around, and added the following in _page.py, to restore the old behaviour:

@@ -1715,6 +1715,9 @@ class PageObject(DictionaryObject):
                     font_resource_object = cast(DictionaryObject, font_resources_dict[font_resource].get_object())
                     font_resources[font_resource] = font_resource_object
                     fonts[font_resource] = Font.from_font_resource(font_resource_object)
+                    # Override space width, if applicable
+                    if fonts[font_resource].character_widths.get(" ", 0) == 0:
+                        fonts[font_resource].space_width = space_width
                 except (AttributeError, TypeError):
                     pass

PJBrs · 2026-01-06T11:56:03Z

Forget my comment that this is faster. It's just the same as it was.

pypdf/_text_extraction/_layout_mode/_text_state_params.py

pypdf/_text_extraction/_text_extractor.py

pypdf/generic/_appearance_stream.py

This reverts "ENH: TextExtractor: Separate old and new text for width calculation" and embeds the font widths calculation within the _handle_tj() method in _text_extractor.py and in get_display_str() in _text_extraction/__init__.py instead. This way, we get the character widths within the same loop in which we collect the unicode characters, without the need to keep track of old and new text, and having to add or separate these later on. Also, it actually takes so little code that this hardly justified the _get_actual_text_widths that did this before.

PJBrs · 2026-01-06T18:38:04Z

@stefan6419846 Thanks for your additional comments! I think that I have addressed all of them. Please let me know if you have any other comments.

stefan6419846

Thanks for your patience.

…py-pdf/pypdf#3583) with `_cmap.build_char_map()` from

PJBrs marked this pull request as draft December 23, 2025 15:44

PJBrs force-pushed the fontwork branch from a458987 to f71eece Compare December 23, 2025 23:12

PJBrs force-pushed the fontwork branch 8 times, most recently from 93accb3 to 3be4e39 Compare December 30, 2025 18:56

PJBrs force-pushed the fontwork branch 5 times, most recently from eeb8357 to 2db3744 Compare January 2, 2026 13:20

PJBrs marked this pull request as ready for review January 2, 2026 13:29

PJBrs force-pushed the fontwork branch from 2db3744 to 6aad5ab Compare January 2, 2026 17:22

PJBrs commented Jan 2, 2026

View reviewed changes

pypdf/_text_extraction/_text_extractor.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jan 5, 2026

View reviewed changes

PJBrs added 10 commits January 5, 2026 20:05

ENH: Add Font class

505325b

ENH: Also collect character widths when encoding is a string

4a7e34b

This patch ensures that character widths are collected correctly also for fonts that have encoding defined as a string.

ENH: FontDescriptor: Add default width to character widths

4baf7ee

ENH: Use space width from own calculations

6f2ad9a

This patch makes sure that the Font class has a way to compute space_width.

MAINT: FontDescriptor: Remove superfluous if condition

628fe59

The FontDescriptor code deals with fonts by type. After having dealt with Type1, MMType1, Type3 and TrueType fonts, it is not necessary to check if the remaining fonts are CID or composite fonts, they all are.

MAINT: Remove specific Font class from layout mode

52f53e2

PJBrs added 14 commits January 5, 2026 20:05

MAINT: TextExtractor: use font instead of cmap in get_text_operands

a4a03a3

MAINT: TextExtractor: Use font character map in get_display_str

5dd01d7

MAINT: _cmap.py: Remove compute_font_width

db6e38b

The compute_font_width method is no longer used and therefore obsolete.

ENH: Placeholder commit Type3 Font Descriptor

bd1cf64

MAINT: Text extraction init: remove cmap

66fea16

MAINT: TextExtractor: Remove cmap attribute

68b9ecd

MAINT: _cmap.py: Remove build_font_width_map

b8ad572

After moving the text extraction code to the font class, which collects its own font width map, this code is not needed anymore.

MAINT: _cmap.py: Remove unused code

83a8827

This removes three methods that have become obsolete since porting the non-layout text extraction code to the Font class.

MAINT: _cmap.py: Remove get_actual_str_key method

9d74219

MAINT: _cmap.py: Remove compute_space_width

aa3cab1

MAINT: test_cmaps.py: Port test for iss1533 to new Font code

889194e

The test for iss1533 was based on the old build_char_map code. Now that that code is removed, port the test to the new Font class, which should cover the underlying issue just the same.

MAINT: font.py: Import _cmap's get_encoding in the normal way

1663759

This does not cause a circular import anymore after refactoring.

PJBrs force-pushed the fontwork branch from 8cffdf7 to 1663759 Compare January 5, 2026 19:08

ENH: TextExtractor: Separate old and new text for width calculation

2b1e18c

stefan6419846 reviewed Jan 6, 2026

View reviewed changes

pypdf/_text_extraction/_layout_mode/_text_state_params.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jan 6, 2026

View reviewed changes

pypdf/_text_extraction/_text_extractor.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jan 6, 2026

View reviewed changes

pypdf/generic/_appearance_stream.py Outdated Show resolved Hide resolved

PJBrs added 4 commits January 6, 2026 19:17

MAINT: _layout_mode: Rename TD_offset td_offset

e825e11

MAINT: _generate_appearance_stream_data: Don't type Optional

3d2b714

MAINT: _page.py: Update comment that mentioned old Font class

7101294

stefan6419846 approved these changes Jan 7, 2026

View reviewed changes

stefan6419846 merged commit d9ce594 into py-pdf:main Jan 7, 2026
18 checks passed

michelcrypt4d4mus pushed a commit to michelcrypt4d4mus/pdfalyzer that referenced this pull request Jan 15, 2026

Set max version of pypdf to below 6.6.0 because of breaking [change](…

993a73c

…py-pdf/pypdf#3583) with `_cmap.build_char_map()` from

michelcrypt4d4mus mentioned this pull request Jan 15, 2026

pdfalyzer does not work with pypdf 6.6.0 michelcrypt4d4mus/pdfalyzer#32

Open

MAINT: Converge on one shared Font class for text extraction and appearance streams #3583

MAINT: Converge on one shared Font class for text extraction and appearance streams #3583

Uh oh!

Conversation

PJBrs commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Jan 5, 2026

Uh oh!

PJBrs commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Jan 6, 2026

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PJBrs commented Dec 23, 2025 •

edited

Loading

codecov bot commented Dec 23, 2025 •

edited

Loading