ENH: Wrap and align text in flattened PDF forms #3465

PJBrs · 2025-09-13T14:47:27Z

This patch implements text wrapping and alignment in appearance streams.

My biggest doubt is in the formatting dict that I added.

The scale_text method was vibe-coded, as well as the code for right-aligned text and centered text, but they both work great.

The result offers a good basis for text wrapping. I did notice, however, that the results with pdftk are better. In the future, it would be nice to read the info for the annotation border from the annotiation instead of just adding some padding here and there (which is the case now). Also, I notice there's also an annotation option called "comb" that is not taken into account. Then there is annotation text colour... Finally, pdftk takes into account the font bounding box / ascent in deciding scaled font size.

For now, however, this PR "finishes" PDF flattening in the sense that it correctly wraps long texts and aligns it as intended.

Related but not fixed here: #2153
I think this does fix the alignment part of #1919

codecov · 2025-09-13T14:56:42Z

Codecov Report

❌ Patch coverage is 99.09910% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.11%. Comparing base (9fad9ff) to head (4ccfa3a).

Files with missing lines	Patch %	Lines
pypdf/generic/_appearance_stream.py	98.83%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3465      +/-   ##
==========================================
+ Coverage   97.09%   97.11%   +0.02%     
==========================================
  Files          56       57       +1     
  Lines        9658     9753      +95     
  Branches     1748     1767      +19     
==========================================
+ Hits         9377     9472      +95     
  Misses        168      168              
  Partials      113      113

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pypdf/_writer.py

PJBrs · 2025-09-26T11:20:53Z

This is a reworked version on top of #3466

Not for review right now.

This patch lets the _update_field_annotation method return an appearance stream instead of None, so that this method can be separated out of _writer.py later on.

Add a couple of comments to the update_page_form_fields method, and change the flatten command later on. Underlying logic: First set the field value, then get its appearance stream, and, if it has one, flatten it if appropriate.

This patch introduces a new module - appearance_stream - and copies two methods from _writer to this new module. Currently, these methods are needed to develop an appearance stream for a text annotation. They are: update_field_annotation (renamed from _update_field_annotation) generate_appearance_stream The update_field_annotation was a PdfWriter method, which means that the current code needs some refactoring, since it now has a circular import of PdfWriter. Other than changing self to writer in update_field_annotation, and changing the code in PdfWriter to call update_field_annotation from _appearance_stream, this patch changes nothing. In a future change, we might want to make a class TextAppearanceStream based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotaion). scale_text would also be a method in this class.

This patch introduces the TextAppearanceStream class, with .from_text_annotation as a class method to instantiate it from a text annotation. It includes the code from generate_appearance_stream and _update_field_annotation.

Code in _appearance_stream used various rather cryptic variable names that, for some coders, made it hard to understand what the code was doing. This patch tries to clarify those variable names to make it easier to understand what's going on, and make it easier later on to add functionality. Overview of the changes: txt --> text sel --> selection da --> default_appearance font_full_rev --> font_glyph_byte_map rct --> rect enc_line --> encoded_line af --> acro_form dr --> document_resources / document_font_resources font_res --> font_resource Furthermore, I undid some abbreviated imports: - AnnotationDictionaryAttributes no longer as AA - FieldDictionaryAttributes no longer as FA

This patch removes the variable name "font_height", because it means the same thing as font size. I think that font_height was introduced previously to distinguish between a font size found in an annotation's default appearance and the size set by a user. To be consistent, also use the variable user_font_name when it pertains to a user choice, and font_name for a font name found in a default appearance.

This patch adds more comments, especially to the from_text_annotation method, in the hope that this will later ease further refactoring.

This patch aims to make a couple of variables and associated imports more readable by writing them out in full instead of having very short abbreviations.

This patch makes the code for producing the appearance stream data into a separate method.

The y_offset calculation occurs very early on in the code, necessitating carrying it across various methods. This patch simplifies that logic.

This moves parsing the multiline field flag to the place where the other field flags are parsed, and moves the consequences for font size elsewhere.

Instead of passing around default appearance, construct it from given font name, size and color. Also, having a default appearance as an argument for a text stream appearance seems less "natural" than just passing font name, size and color. This patch also represents a small number of simplifications that improve test coverage.

Move the font resource parsing code to TextAppearanceStream, in the hope that, later, one might be able to generate a TextAppearanceStream directly. I wonder, though, where the necessary font resource would come from.

mypy complained that the .from_font_resource method's return type is Optional[FontDescriptor]. Change the code to not confuse mypy.

This adds a method to calculate the width of a text string. This method can later be used to wrap text at a certain length. Code blatantly copied from the _font.py file in the text extractor code.

This patch adds a method to scale and wrap text, depending on whether or not text is allowed to be wrapped. It takes a couple of arguments, including the text string itself, field width and height, font size, a FontDescriptor with character widths, and a bool specifying whether or not text is allowed to wrap. Returns the text in in the form of list of tuples, each tuple containing the length of a line and its contents, and the font size for these lines and lengths.

This patch scales and/or wrap text that does not fit into a text field unaltered, under the condition that font size was set to 0 in the default appearance stream. We only wrap text if the multiline bit was set in the corresponding annotation's field flags, otherwise we just scale the font until it fits. We move the escaping of parentheses below, so that it does not interfere with calculating the width of a text string.

Make sure that we always have Helvetica as a viable font resource, for which we surely have all necessary font metrics needed for text wrapping.

This patch changes the TextAppearanceStream code so that it can deal with right alignment and centered text. Note that both require correct font metrics in order to work.

We need the info that is in CORE_FONT_METRICS, and that is the same information as in _default_fonts_space_width anyway. So this patch removes a bit of redundancy.

Add tests for the TextStreamAppearance.

stefan6419846 reviewed Sep 13, 2025

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Sep 13, 2025

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

PJBrs marked this pull request as draft September 15, 2025 12:02

PJBrs force-pushed the wrap branch from 438cc37 to 6af7101 Compare September 26, 2025 10:56

PJBrs added 23 commits October 6, 2025 17:34

MAINT: _writer: let _update_field_annotation return appearance stream

20fae82

This patch lets the _update_field_annotation method return an appearance stream instead of None, so that this method can be separated out of _writer.py later on.

MAINT: _writer: refactor update_page_form_fields

9cdda3e

Add a couple of comments to the update_page_form_fields method, and change the flatten command later on. Underlying logic: First set the field value, then get its appearance stream, and, if it has one, flatten it if appropriate.

MAINT: Turn the appearance stream code into a class

63beca7

This patch introduces the TextAppearanceStream class, with .from_text_annotation as a class method to instantiate it from a text annotation. It includes the code from generate_appearance_stream and _update_field_annotation.

MAINT: _appearance_stream: More comments

d1a5421

This patch adds more comments, especially to the from_text_annotation method, in the hope that this will later ease further refactoring.

MAINT: _writer.py: Make some variables more readable

82ea683

This patch aims to make a couple of variables and associated imports more readable by writing them out in full instead of having very short abbreviations.

MAINT: _appearance_stream: Factor out generation of text appearance

8cd2820

This patch makes the code for producing the appearance stream data into a separate method.

MAINT: _appearance_stream: Move y_offset calculation

7ac510a

The y_offset calculation occurs very early on in the code, necessitating carrying it across various methods. This patch simplifies that logic.

MAINT: _appearance_stream: Move multiline parsing

c5f6e51

This moves parsing the multiline field flag to the place where the other field flags are parsed, and moves the consequences for font size elsewhere.

MAINT: _appearance_stream: Move font_resource parsing

fba4195

Move the font resource parsing code to TextAppearanceStream, in the hope that, later, one might be able to generate a TextAppearanceStream directly. I wonder, though, where the necessary font resource would come from.

MAINT: _appearance_stream: Document all methods

aa8b1f1

ROB: _font: Always returns a FontDescriptor; fix typing

8b910df

mypy complained that the .from_font_resource method's return type is Optional[FontDescriptor]. Change the code to not confuse mypy.

ENH: _font: Add method to calculate text width

a801d0d

This adds a method to calculate the width of a text string. This method can later be used to wrap text at a certain length. Code blatantly copied from the _font.py file in the text extractor code.

ROB: TextAppearanceStream: Add default font resource

f443ab6

Make sure that we always have Helvetica as a viable font resource, for which we surely have all necessary font metrics needed for text wrapping.

ENH: TextAppearanceStream: Add right alignment and centering

21e1cff

This patch changes the TextAppearanceStream code so that it can deal with right alignment and centered text. Note that both require correct font metrics in order to work.

MAINT: TextAppearanceStream: Don't use _default_fonts_space_width

3d00f94

We need the info that is in CORE_FONT_METRICS, and that is the same information as in _default_fonts_space_width anyway. So this patch removes a bit of redundancy.

ENH: tests: _appearance_stream

5ea72b0

Add tests for the TextStreamAppearance.

ENH: docs: Add documentation about flattening a PDF form

4ccfa3a

PJBrs force-pushed the wrap branch from 6af7101 to 4ccfa3a Compare October 6, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Wrap and align text in flattened PDF forms #3465

ENH: Wrap and align text in flattened PDF forms #3465

Uh oh!

PJBrs commented Sep 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ENH: Wrap and align text in flattened PDF forms #3465

Are you sure you want to change the base?

ENH: Wrap and align text in flattened PDF forms #3465

Uh oh!

Conversation

PJBrs commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PJBrs commented Sep 13, 2025 •

edited

Loading

codecov bot commented Sep 13, 2025 •

edited

Loading