MAINT: Refactor _writer code into _appearance_stream #3466

PJBrs · 2025-09-15T13:57:51Z

This PR is intended to result in a class TextAppearanceStream, based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotation). A future method scale_text would also be a method in this class.

This removes some code from _writer and makes text appearances more modular. Also, it paves the way for adding code for improving text wrapping etc. later on.

stefan6419846 · 2025-09-15T14:27:40Z

As already indicated in the other PR, please make this part of the generic directory instead of the root one.

codecov · 2025-09-15T14:35:23Z

Codecov Report

❌ Patch coverage is 98.62069% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.09%. Comparing base (9fad9ff) to head (aa8b1f1).

Files with missing lines	Patch %	Lines
pypdf/generic/_appearance_stream.py	98.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3466   +/-   ##
=======================================
  Coverage   97.09%   97.09%           
=======================================
  Files          56       57    +1     
  Lines        9658     9681   +23     
  Branches     1748     1753    +5     
=======================================
+ Hits         9377     9400   +23     
  Misses        168      168           
  Partials      113      113

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PJBrs · 2025-09-15T15:32:05Z

OK, I moved the module to generic.

Before I take this forward further, can you please comment on the general direction that I proposed here? At what point would this be sufficient to move forward with?

Definitely I need to avoid the circular import of PdfWriter that mypy rightly complains about. But should this also create the TextAppearanceStream class? Initialisible with an annotation? Or is the current basis enough for now?

EDIT

I any case, it would be nice to split update_field_annotation in two parts, one that analysis the annotation for all its formatting, and another that takes an appearance stream and turns it into an appearance dict object for the annotation.

But please share what direction you see for this refactoring!

PJBrs · 2025-09-16T05:43:54Z

The PR now represents exactly what I had in mind.

One question remains, does the new TextAppearanceStream class indeed warrant its own module, or should it be part of another module. Note that the other PR would quickly increase its size, and further potential change would only let this module grow. Please advise.

stefan6419846

Thanks for the PR. For me, having this new generic module is completely fine.

I have added some remarks about potential improvements.

pypdf/generic/_appearance_stream.py

stefan6419846 · 2025-09-16T15:21:21Z

pypdf/generic/_appearance_stream.py

+
+    def __init__(
+        self,
+        txt: str = "",


Please replace all abbreviations by proper names which do not require guessing or similar.

OK... so like:
txt -> text
sel -> selection
da -> default_appearance
af -> acro_form
font_full_rev -> {I don't even know what this is intended to mean}
rct -> rect {as is everywhere else)

I can certainly do that.

Please note that these are just the variable names from before the refactoring. Although I also agree that a lot of other improvement seems possible here. The question is what to include in this PR...

This is done, across the board, but only in _appearance_stream.py. I did not change the same variables in _writer.py.

While I am aware that this are the old names, while we are refactoring (and especially creating new data structures), it is desirable to improve naming as well, like here.

I'll check the closely associated writer code as well, but only in the update_page_form_field_values method.

I added a separate patch to improve readability of variables in the _writer.py code. I might have gone too far in that one, please advise. I could simply drop it again, no problem.

From my perspective, as a rather novice coder, all these abbreviations really hurt readability, but the reverse might be true from the perspective of an experienced coder,

pypdf/generic/_appearance_stream.py

pypdf/_writer.py

stefan6419846 · 2025-09-16T15:34:07Z

pypdf/_writer.py

+                        appearance_stream_obj = TextStreamAppearance.from_text_annotation(
+                            af, parent_annotation, annotation
+                        )
+                    # Add the appearance stream object


We might be able to move this to the new class as well.

Oh, I would need to study more to see what you mean here!

PJBrs · 2025-09-19T08:52:17Z

@stefan6419846 I tried to refactor more of the code and clarify more. I think that I addressed almost all of your above comments.

What I did not do: refactor the .from_text_annotation code into separate private methods. I did add a couple of comments to that end, and I also simplified the code a bit.

Finally, my changes reduced test coverage a little bit.

Next step:
I think I should re-improve test coverage.
I could refactor the .from_text_annotation code into separate private methods, but that would require more specific field flag parsing code:

Multiline flag
Comb flag
Rich text flag

(We don't have code that deals with the latter two cases.)

It would quite complicate all the changes I already made, and perhaps not improve the readabilty of this PR any further (might make it too big).

PJBrs · 2025-09-19T14:18:03Z

WAIT @stefan6419846

I went slightly the wrong direction. I don't think that default_appearance should be an argument to the TextAppearance class. It should instead be font_name, font_size, font_color. Don't review yet.

PJBrs · 2025-09-20T12:43:58Z

@stefan6419846 So, I did some final refactoring to try and put all code in its most "logical" place, while making the arguments to TextStreamAppearance as logical and straightforward as possible.

Despite this PR not being entirely finished, I marked it as ready for review because I now did a lot of refactoring. The present code is beginning to resemble a rewrite of what was originally there.

ToDos:

Write some documentation
Add a bit more testing
Move some existing tests that mainly pertain to the appearance stream to a separate test module test_appearance_stream

Other than that, I do not think any more refactoring would be possible at this junction, barring perhaps subdividing the from_text_annotation method into a series of private methods. However, the current refactoring in this PR, I think, was necessary before even considering this, because it appeared the code was somewhat all over the place.

Two things to note:

The new code does not pass around a default_appearance string anymore. Instead, it reconstructs it from font_name, font_size and font_color. As a consequence, with font being a float, the default appearance now shows font size always with at least on decimal. I've changed affected tests accordingly. I can also remove the decimal, and change a different test accordingly.
One patch (9f494dc) focuses on readability in _writer.py. Let me know if that one is alright.

EDIT
Please let me know whether moving existing tests to a new module would be necessary in this PR...

PJBrs · 2025-09-21T08:34:42Z

I added documentation, made the code a little bit simpler.

Two final things - I'm not sure whether the code that adds the font resource is in the right place, and I wonder whether we need to move some tests to a dedicated tests/test_appearance_stream.py file. Other than that, I think this is ready.

EDIT

Thinking about the font thing. This PR should ideally support two case. The first: initialising a TextStreamAppearance from an annotation. The second: initialising a TextStreamAppearance directly, even though that has no use case yet.

For the first case, associated font resources will normally be one of the 14 standard fonts, or an embedded font. The code supports setting a font name, but this implies that it only works in case of an embedded font that is already associated with the acro_form resources. The second case can hardly work, because it requires setting a font character map and there's only one bit of code that does that, and it's in the from_text_annotation code.

This poses various questions. First, about from_text_annotation, I wonder what use it is to manually set a font name if there is no associated font dictionary. I assume (but haven't tested) that this would only work with a system font. Second, for the case of initialising a TextStreamAppearance directly, one ought to be able to supply a font resource dictionary for a font embedded in the pdf. Otherwise, I don't see how it would work. But this also means that setting a font resource dictionary ought to be part of the TextStreamAppearance init method, and not the from_text_annotation method.

What to do.

Check whether setting a font for a non-embedded font works.
If so, find out whether pypdf might read font descriptor information from a system font

In any case, for this PR, it suggests that at least the font resource ought to be an argument to the TextStreamAppearance init method, and the same holds for embedding the font resource in the text stream appearance.

Nonetheless, that's only one more patch to this PR, and it does not change any code anymore.

EDIT 2

Passing a non-embedded font just does nothing. I'm assuming that, from an annotation, one would need to find font resources elsewhere, not in the acroform dictionary. And if you can't find a font, then it would need to be embedded. This means that, currently, changing a font name can only be expected to do something when you also have added a corresponding font resource to the acroform first.

#2770

EDIT 3 NOT FOR THIS PR - if we are a bit more serious about finding and parsing font resources, then from_text_annotation needs document resources as an argument. I'll keep it at that :-)

EDIT 4 DEFINITELY NOT FOR THIS PR - it turns out to be very easy to describe a font resource, and it would be trivial to offer the option to set whichever of the 14 base fonts by name with the current code. Furthermore, I tested a bit more, and it's also not so hard to add a font resource for a non-embedded font, although one would need to parse an external font file and generate a font resource from that.

PJBrs · 2025-09-21T19:35:06Z

Done!

... unless you'd like me to move some tests.

PJBrs · 2025-10-01T06:58:25Z

@stefan6419846 I worry a bit that I've made this PR too complex to review. Please advise!

In the meantime, I redid some of the patches to make them more coherent. This does not affect the changes in total, but it clarifies what each individual patch in this PR tries to accomplish. It also makes it easier to drop a couple of patches to simplifiy this PR.

I dropped one patch, that changed the location of the code that escapes parentheses (https://github.com/py-pdf/pypdf/pull/3466/files#diff-62512f011b84139b80ae75e1ee2b5caf32661f593564500a6ae8a358c88e7a9bR281) . The more I think about it, the more I'm convinced that that code cannot work as intendend. Especially when applied to selection values, it does escape parentheses in the user choice, but not in the list of choices. Also, it would again escape parentheses that are already escaped. I decided to just leave that bit of code alone for now.

In any case, please let me know if you can proceed with reviewing this code, or what simplifications (or other changes) I still need to make.

stefan6419846 · 2025-10-01T07:04:01Z

I am currently rather busy with other stuff - most of my latest changes here where rather trivial or required for my own use cases and thus prioritized from my side. Hopefully, I will be able to look into the changes again this month.

PJBrs · 2025-10-02T16:39:26Z

I am currently rather busy with other stuff - most of my latest changes here where rather trivial or required for my own use cases and thus prioritized from my side. Hopefully, I will be able to look into the changes again this month.

Take care! In the meantime, I'm only changing this PR to simplify for review. Although I don't think I can do anymore in that regard.

This patch lets the _update_field_annotation method return an appearance stream instead of None, so that this method can be separated out of _writer.py later on.

Add a couple of comments to the update_page_form_fields method, and change the flatten command later on. Underlying logic: First set the field value, then get its appearance stream, and, if it has one, flatten it if appropriate.

This patch introduces a new module - appearance_stream - and copies two methods from _writer to this new module. Currently, these methods are needed to develop an appearance stream for a text annotation. They are: update_field_annotation (renamed from _update_field_annotation) generate_appearance_stream The update_field_annotation was a PdfWriter method, which means that the current code needs some refactoring, since it now has a circular import of PdfWriter. Other than changing self to writer in update_field_annotation, and changing the code in PdfWriter to call update_field_annotation from _appearance_stream, this patch changes nothing. In a future change, we might want to make a class TextAppearanceStream based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotaion). scale_text would also be a method in this class.

This patch introduces the TextAppearanceStream class, with .from_text_annotation as a class method to instantiate it from a text annotation. It includes the code from generate_appearance_stream and _update_field_annotation.

Code in _appearance_stream used various rather cryptic variable names that, for some coders, made it hard to understand what the code was doing. This patch tries to clarify those variable names to make it easier to understand what's going on, and make it easier later on to add functionality. Overview of the changes: txt --> text sel --> selection da --> default_appearance font_full_rev --> font_glyph_byte_map rct --> rect enc_line --> encoded_line af --> acro_form dr --> document_resources / document_font_resources font_res --> font_resource Furthermore, I undid some abbreviated imports: - AnnotationDictionaryAttributes no longer as AA - FieldDictionaryAttributes no longer as FA

This patch removes the variable name "font_height", because it means the same thing as font size. I think that font_height was introduced previously to distinguish between a font size found in an annotation's default appearance and the size set by a user. To be consistent, also use the variable user_font_name when it pertains to a user choice, and font_name for a font name found in a default appearance.

This patch adds more comments, especially to the from_text_annotation method, in the hope that this will later ease further refactoring.

This patch aims to make a couple of variables and associated imports more readable by writing them out in full instead of having very short abbreviations.

This patch makes the code for producing the appearance stream data into a separate method.

The y_offset calculation occurs very early on in the code, necessitating carrying it across various methods. This patch simplifies that logic.

This moves parsing the multiline field flag to the place where the other field flags are parsed, and moves the consequences for font size elsewhere.

Instead of passing around default appearance, construct it from given font name, size and color. Also, having a default appearance as an argument for a text stream appearance seems less "natural" than just passing font name, size and color. This patch also represents a small number of simplifications that improve test coverage.

Move the font resource parsing code to TextAppearanceStream, in the hope that, later, one might be able to generate a TextAppearanceStream directly. I wonder, though, where the necessary font resource would come from.

PJBrs force-pushed the refactor branch from 393e464 to c4aa15b Compare September 15, 2025 14:26

PJBrs marked this pull request as draft September 15, 2025 15:15

PJBrs force-pushed the refactor branch from c4aa15b to 79f84cb Compare September 15, 2025 15:29

PJBrs force-pushed the refactor branch 2 times, most recently from dd22a58 to 2660a8a Compare September 15, 2025 20:46

PJBrs marked this pull request as ready for review September 16, 2025 05:41

stefan6419846 requested changes Sep 16, 2025

View reviewed changes

PJBrs force-pushed the refactor branch from 2660a8a to 79cbaab Compare September 17, 2025 13:01

PJBrs marked this pull request as draft September 17, 2025 13:01

PJBrs force-pushed the refactor branch 2 times, most recently from 28574fa to cf6588f Compare September 19, 2025 08:30

PJBrs requested a review from stefan6419846 September 19, 2025 08:46

PJBrs force-pushed the refactor branch 2 times, most recently from fa67555 to 52e4a6e Compare September 20, 2025 12:02

PJBrs marked this pull request as ready for review September 20, 2025 12:30

PJBrs force-pushed the refactor branch from 651b415 to fdb4dbf Compare September 21, 2025 08:22

PJBrs mentioned this pull request Sep 26, 2025

ENH: Wrap and align text in flattened PDF forms #3465

Draft

PJBrs force-pushed the refactor branch from d40ec2b to db20841 Compare October 1, 2025 06:42

PJBrs force-pushed the refactor branch from db20841 to 830ecd3 Compare October 2, 2025 16:25

PJBrs added 14 commits October 6, 2025 17:34

MAINT: _writer: let _update_field_annotation return appearance stream

20fae82

This patch lets the _update_field_annotation method return an appearance stream instead of None, so that this method can be separated out of _writer.py later on.

MAINT: _writer: refactor update_page_form_fields

9cdda3e

Add a couple of comments to the update_page_form_fields method, and change the flatten command later on. Underlying logic: First set the field value, then get its appearance stream, and, if it has one, flatten it if appropriate.

MAINT: Turn the appearance stream code into a class

63beca7

This patch introduces the TextAppearanceStream class, with .from_text_annotation as a class method to instantiate it from a text annotation. It includes the code from generate_appearance_stream and _update_field_annotation.

MAINT: _appearance_stream: More comments

d1a5421

This patch adds more comments, especially to the from_text_annotation method, in the hope that this will later ease further refactoring.

MAINT: _writer.py: Make some variables more readable

82ea683

This patch aims to make a couple of variables and associated imports more readable by writing them out in full instead of having very short abbreviations.

MAINT: _appearance_stream: Factor out generation of text appearance

8cd2820

This patch makes the code for producing the appearance stream data into a separate method.

MAINT: _appearance_stream: Move y_offset calculation

7ac510a

The y_offset calculation occurs very early on in the code, necessitating carrying it across various methods. This patch simplifies that logic.

MAINT: _appearance_stream: Move multiline parsing

c5f6e51

This moves parsing the multiline field flag to the place where the other field flags are parsed, and moves the consequences for font size elsewhere.

MAINT: _appearance_stream: Move font_resource parsing

fba4195

Move the font resource parsing code to TextAppearanceStream, in the hope that, later, one might be able to generate a TextAppearanceStream directly. I wonder, though, where the necessary font resource would come from.

MAINT: _appearance_stream: Document all methods

aa8b1f1

PJBrs force-pushed the refactor branch from 830ecd3 to aa8b1f1 Compare October 6, 2025 19:00

MAINT: Refactor _writer code into _appearance_stream #3466

Are you sure you want to change the base?

MAINT: Refactor _writer code into _appearance_stream #3466

Uh oh!

Conversation

PJBrs commented Sep 15, 2025

Uh oh!

stefan6419846 commented Sep 15, 2025

Uh oh!

codecov bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PJBrs commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PJBrs commented Sep 16, 2025

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PJBrs commented Sep 19, 2025

Uh oh!

PJBrs commented Sep 19, 2025

Uh oh!

PJBrs commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PJBrs commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PJBrs commented Sep 21, 2025

Uh oh!

PJBrs commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Oct 1, 2025

Uh oh!

PJBrs commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 15, 2025 •

edited

Loading

PJBrs commented Sep 15, 2025 •

edited

Loading

PJBrs commented Sep 20, 2025 •

edited

Loading

PJBrs commented Sep 21, 2025 •

edited

Loading

PJBrs commented Oct 1, 2025 •

edited

Loading