Skip to content

Conversation

PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Sep 15, 2025

This PR is intended to result in a class TextAppearanceStream, based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotation). A future method scale_text would also be a method in this class.

This removes some code from _writer and makes text appearances more modular. Also, it paves the way for adding code for improving text wrapping etc. later on.

@stefan6419846
Copy link
Collaborator

As already indicated in the other PR, please make this part of the generic directory instead of the root one.

Copy link

codecov bot commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 98.62069% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.09%. Comparing base (9fad9ff) to head (aa8b1f1).

Files with missing lines Patch % Lines
pypdf/generic/_appearance_stream.py 98.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3466   +/-   ##
=======================================
  Coverage   97.09%   97.09%           
=======================================
  Files          56       57    +1     
  Lines        9658     9681   +23     
  Branches     1748     1753    +5     
=======================================
+ Hits         9377     9400   +23     
  Misses        168      168           
  Partials      113      113           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PJBrs PJBrs marked this pull request as draft September 15, 2025 15:15
@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 15, 2025

OK, I moved the module to generic.

Before I take this forward further, can you please comment on the general direction that I proposed here? At what point would this be sufficient to move forward with?

Definitely I need to avoid the circular import of PdfWriter that mypy rightly complains about. But should this also create the TextAppearanceStream class? Initialisible with an annotation? Or is the current basis enough for now?

EDIT

I any case, it would be nice to split update_field_annotation in two parts, one that analysis the annotation for all its formatting, and another that takes an appearance stream and turns it into an appearance dict object for the annotation.

But please share what direction you see for this refactoring!

@PJBrs PJBrs force-pushed the refactor branch 2 times, most recently from dd22a58 to 2660a8a Compare September 15, 2025 20:46
@PJBrs PJBrs marked this pull request as ready for review September 16, 2025 05:41
@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 16, 2025

The PR now represents exactly what I had in mind.

One question remains, does the new TextAppearanceStream class indeed warrant its own module, or should it be part of another module. Note that the other PR would quickly increase its size, and further potential change would only let this module grow. Please advise.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. For me, having this new generic module is completely fine.

I have added some remarks about potential improvements.


def __init__(
self,
txt: str = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace all abbreviations by proper names which do not require guessing or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... so like:
txt -> text
sel -> selection
da -> default_appearance
af -> acro_form
font_full_rev -> {I don't even know what this is intended to mean}
rct -> rect {as is everywhere else)

I can certainly do that.

Please note that these are just the variable names from before the refactoring. Although I also agree that a lot of other improvement seems possible here. The question is what to include in this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, across the board, but only in _appearance_stream.py. I did not change the same variables in _writer.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am aware that this are the old names, while we are refactoring (and especially creating new data structures), it is desirable to improve naming as well, like here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the closely associated writer code as well, but only in the update_page_form_field_values method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate patch to improve readability of variables in the _writer.py code. I might have gone too far in that one, please advise. I could simply drop it again, no problem.

From my perspective, as a rather novice coder, all these abbreviations really hurt readability, but the reverse might be true from the perspective of an experienced coder,

appearance_stream_obj = TextStreamAppearance.from_text_annotation(
af, parent_annotation, annotation
)
# Add the appearance stream object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to move this to the new class as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I would need to study more to see what you mean here!

@PJBrs PJBrs marked this pull request as draft September 17, 2025 13:01
@PJBrs PJBrs force-pushed the refactor branch 2 times, most recently from 28574fa to cf6588f Compare September 19, 2025 08:30
@PJBrs PJBrs requested a review from stefan6419846 September 19, 2025 08:46
@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 19, 2025

@stefan6419846 I tried to refactor more of the code and clarify more. I think that I addressed almost all of your above comments.

What I did not do: refactor the .from_text_annotation code into separate private methods. I did add a couple of comments to that end, and I also simplified the code a bit.

Finally, my changes reduced test coverage a little bit.

Next step:
I think I should re-improve test coverage.
I could refactor the .from_text_annotation code into separate private methods, but that would require more specific field flag parsing code:

  • Multiline flag
  • Comb flag
  • Rich text flag

(We don't have code that deals with the latter two cases.)

It would quite complicate all the changes I already made, and perhaps not improve the readabilty of this PR any further (might make it too big).

@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 19, 2025

WAIT @stefan6419846

I went slightly the wrong direction. I don't think that default_appearance should be an argument to the TextAppearance class. It should instead be font_name, font_size, font_color. Don't review yet.

@PJBrs PJBrs force-pushed the refactor branch 2 times, most recently from fa67555 to 52e4a6e Compare September 20, 2025 12:02
@PJBrs PJBrs marked this pull request as ready for review September 20, 2025 12:30
@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 20, 2025

@stefan6419846 So, I did some final refactoring to try and put all code in its most "logical" place, while making the arguments to TextStreamAppearance as logical and straightforward as possible.

Despite this PR not being entirely finished, I marked it as ready for review because I now did a lot of refactoring. The present code is beginning to resemble a rewrite of what was originally there.

ToDos:

  1. Write some documentation
  2. Add a bit more testing
  3. Move some existing tests that mainly pertain to the appearance stream to a separate test module test_appearance_stream

Other than that, I do not think any more refactoring would be possible at this junction, barring perhaps subdividing the from_text_annotation method into a series of private methods. However, the current refactoring in this PR, I think, was necessary before even considering this, because it appeared the code was somewhat all over the place.

Two things to note:

  1. The new code does not pass around a default_appearance string anymore. Instead, it reconstructs it from font_name, font_size and font_color. As a consequence, with font being a float, the default appearance now shows font size always with at least on decimal. I've changed affected tests accordingly. I can also remove the decimal, and change a different test accordingly.
  2. One patch (9f494dc) focuses on readability in _writer.py. Let me know if that one is alright.

EDIT
Please let me know whether moving existing tests to a new module would be necessary in this PR...

@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 21, 2025

I added documentation, made the code a little bit simpler.

Two final things - I'm not sure whether the code that adds the font resource is in the right place, and I wonder whether we need to move some tests to a dedicated tests/test_appearance_stream.py file. Other than that, I think this is ready.

EDIT

Thinking about the font thing. This PR should ideally support two case. The first: initialising a TextStreamAppearance from an annotation. The second: initialising a TextStreamAppearance directly, even though that has no use case yet.

For the first case, associated font resources will normally be one of the 14 standard fonts, or an embedded font. The code supports setting a font name, but this implies that it only works in case of an embedded font that is already associated with the acro_form resources. The second case can hardly work, because it requires setting a font character map and there's only one bit of code that does that, and it's in the from_text_annotation code.

This poses various questions. First, about from_text_annotation, I wonder what use it is to manually set a font name if there is no associated font dictionary. I assume (but haven't tested) that this would only work with a system font. Second, for the case of initialising a TextStreamAppearance directly, one ought to be able to supply a font resource dictionary for a font embedded in the pdf. Otherwise, I don't see how it would work. But this also means that setting a font resource dictionary ought to be part of the TextStreamAppearance init method, and not the from_text_annotation method.

What to do.

  1. Check whether setting a font for a non-embedded font works.
  2. If so, find out whether pypdf might read font descriptor information from a system font

In any case, for this PR, it suggests that at least the font resource ought to be an argument to the TextStreamAppearance init method, and the same holds for embedding the font resource in the text stream appearance.

Nonetheless, that's only one more patch to this PR, and it does not change any code anymore.

EDIT 2

Passing a non-embedded font just does nothing. I'm assuming that, from an annotation, one would need to find font resources elsewhere, not in the acroform dictionary. And if you can't find a font, then it would need to be embedded. This means that, currently, changing a font name can only be expected to do something when you also have added a corresponding font resource to the acroform first.

#2770

EDIT 3 NOT FOR THIS PR - if we are a bit more serious about finding and parsing font resources, then from_text_annotation needs document resources as an argument. I'll keep it at that :-)

EDIT 4 DEFINITELY NOT FOR THIS PR - it turns out to be very easy to describe a font resource, and it would be trivial to offer the option to set whichever of the 14 base fonts by name with the current code. Furthermore, I tested a bit more, and it's also not so hard to add a font resource for a non-embedded font, although one would need to parse an external font file and generate a font resource from that.

@PJBrs
Copy link
Contributor Author

PJBrs commented Sep 21, 2025

Done!

... unless you'd like me to move some tests.

@PJBrs
Copy link
Contributor Author

PJBrs commented Oct 1, 2025

@stefan6419846 I worry a bit that I've made this PR too complex to review. Please advise!

In the meantime, I redid some of the patches to make them more coherent. This does not affect the changes in total, but it clarifies what each individual patch in this PR tries to accomplish. It also makes it easier to drop a couple of patches to simplifiy this PR.

I dropped one patch, that changed the location of the code that escapes parentheses (https://github.com/py-pdf/pypdf/pull/3466/files#diff-62512f011b84139b80ae75e1ee2b5caf32661f593564500a6ae8a358c88e7a9bR281) . The more I think about it, the more I'm convinced that that code cannot work as intendend. Especially when applied to selection values, it does escape parentheses in the user choice, but not in the list of choices. Also, it would again escape parentheses that are already escaped. I decided to just leave that bit of code alone for now.

In any case, please let me know if you can proceed with reviewing this code, or what simplifications (or other changes) I still need to make.

@stefan6419846
Copy link
Collaborator

I am currently rather busy with other stuff - most of my latest changes here where rather trivial or required for my own use cases and thus prioritized from my side. Hopefully, I will be able to look into the changes again this month.

@PJBrs
Copy link
Contributor Author

PJBrs commented Oct 2, 2025

I am currently rather busy with other stuff - most of my latest changes here where rather trivial or required for my own use cases and thus prioritized from my side. Hopefully, I will be able to look into the changes again this month.

Take care! In the meantime, I'm only changing this PR to simplify for review. Although I don't think I can do anymore in that regard.

PJBrs added 14 commits October 6, 2025 17:34
This patch lets the _update_field_annotation method return an
appearance stream instead of None, so that this method can be
separated out of _writer.py later on.
Add a couple of comments to the update_page_form_fields method,
and change the flatten command later on. Underlying logic:
First set the field value, then get its appearance stream, and,
if it has one, flatten it if appropriate.
This patch introduces a new module - appearance_stream - and
copies two methods from _writer to this new module. Currently,
these methods are needed to develop an appearance stream for a
text annotation. They are:
update_field_annotation (renamed from _update_field_annotation)
generate_appearance_stream

The update_field_annotation was a PdfWriter method, which means
that the current code needs some refactoring, since it now has a
circular import of PdfWriter.

Other than changing self to writer in update_field_annotation,
and changing the code in PdfWriter to call update_field_annotation
from _appearance_stream, this patch changes nothing.

In a future change, we might want to make a class
TextAppearanceStream based on generate_appearance_stream, with
.from_annotation(Annotation) as a class method (based on
update_field_annotaion). scale_text would also be a method in
this class.
This patch introduces the TextAppearanceStream class,
with .from_text_annotation as a class method to instantiate
it from a text annotation. It includes the code from
generate_appearance_stream and _update_field_annotation.
Code in _appearance_stream used various rather cryptic variable names
that, for some coders, made it hard to understand what the code was
doing. This patch tries to clarify those variable names to make it
easier to understand what's going on, and make it easier later on
to add functionality. Overview of the changes:
txt             --> text
sel             --> selection
da              --> default_appearance
font_full_rev   --> font_glyph_byte_map
rct             --> rect
enc_line        --> encoded_line
af              --> acro_form
dr              --> document_resources / document_font_resources
font_res        --> font_resource

Furthermore, I undid some abbreviated imports:
 - AnnotationDictionaryAttributes no longer as AA
 - FieldDictionaryAttributes no longer as FA
This patch removes the variable name "font_height", because it
means the same thing as font size. I think that font_height was
introduced previously to distinguish between a font size found
in an annotation's default appearance and the size set by a
user.

To be consistent, also use the variable user_font_name when it
pertains to a user choice, and font_name for a font name found
in a default appearance.
This patch adds more comments, especially to the
from_text_annotation method, in the hope that
this will later ease further refactoring.
This patch aims to make a couple of variables and associated imports
more readable by writing them out in full instead of having very
short abbreviations.
This patch makes the code for producing the appearance stream data
into a separate method.
The y_offset calculation occurs very early on in the code,
necessitating carrying it across various methods. This
patch simplifies that logic.
This moves parsing the multiline field flag to the place
where the other field flags are parsed, and moves the
consequences for font size elsewhere.
Instead of passing around default appearance, construct it
from given font name, size and color. Also, having a default
appearance as an argument for a text stream appearance seems
less "natural" than just passing font name, size and color.

This patch also represents a small number of simplifications
that improve test coverage.
Move the font resource parsing code to TextAppearanceStream,
in the hope that, later, one might be able to generate a
TextAppearanceStream directly. I wonder, though, where the
necessary font resource would come from.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants