-
Notifications
You must be signed in to change notification settings - Fork 1.5k
MAINT: Refactor _writer code into _appearance_stream #3466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As already indicated in the other PR, please make this part of the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3466 +/- ##
=======================================
Coverage 97.09% 97.09%
=======================================
Files 56 57 +1
Lines 9658 9681 +23
Branches 1748 1753 +5
=======================================
+ Hits 9377 9400 +23
Misses 168 168
Partials 113 113 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
OK, I moved the module to generic. Before I take this forward further, can you please comment on the general direction that I proposed here? At what point would this be sufficient to move forward with? Definitely I need to avoid the circular import of PdfWriter that mypy rightly complains about. But should this also create the TextAppearanceStream class? Initialisible with an annotation? Or is the current basis enough for now? EDIT I any case, it would be nice to split update_field_annotation in two parts, one that analysis the annotation for all its formatting, and another that takes an appearance stream and turns it into an appearance dict object for the annotation. But please share what direction you see for this refactoring! |
dd22a58
to
2660a8a
Compare
The PR now represents exactly what I had in mind. One question remains, does the new TextAppearanceStream class indeed warrant its own module, or should it be part of another module. Note that the other PR would quickly increase its size, and further potential change would only let this module grow. Please advise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. For me, having this new generic module is completely fine.
I have added some remarks about potential improvements.
pypdf/generic/_appearance_stream.py
Outdated
|
||
def __init__( | ||
self, | ||
txt: str = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace all abbreviations by proper names which do not require guessing or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK... so like:
txt -> text
sel -> selection
da -> default_appearance
af -> acro_form
font_full_rev -> {I don't even know what this is intended to mean}
rct -> rect {as is everywhere else)
I can certainly do that.
Please note that these are just the variable names from before the refactoring. Although I also agree that a lot of other improvement seems possible here. The question is what to include in this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done, across the board, but only in _appearance_stream.py. I did not change the same variables in _writer.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am aware that this are the old names, while we are refactoring (and especially creating new data structures), it is desirable to improve naming as well, like here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the closely associated writer code as well, but only in the update_page_form_field_values method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate patch to improve readability of variables in the _writer.py code. I might have gone too far in that one, please advise. I could simply drop it again, no problem.
From my perspective, as a rather novice coder, all these abbreviations really hurt readability, but the reverse might be true from the perspective of an experienced coder,
appearance_stream_obj = TextStreamAppearance.from_text_annotation( | ||
af, parent_annotation, annotation | ||
) | ||
# Add the appearance stream object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to move this to the new class as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I would need to study more to see what you mean here!
28574fa
to
cf6588f
Compare
@stefan6419846 I tried to refactor more of the code and clarify more. I think that I addressed almost all of your above comments. What I did not do: refactor the .from_text_annotation code into separate private methods. I did add a couple of comments to that end, and I also simplified the code a bit. Finally, my changes reduced test coverage a little bit. Next step:
(We don't have code that deals with the latter two cases.) It would quite complicate all the changes I already made, and perhaps not improve the readabilty of this PR any further (might make it too big). |
WAIT @stefan6419846 I went slightly the wrong direction. I don't think that default_appearance should be an argument to the TextAppearance class. It should instead be font_name, font_size, font_color. Don't review yet. |
fa67555
to
52e4a6e
Compare
@stefan6419846 So, I did some final refactoring to try and put all code in its most "logical" place, while making the arguments to TextStreamAppearance as logical and straightforward as possible. Despite this PR not being entirely finished, I marked it as ready for review because I now did a lot of refactoring. The present code is beginning to resemble a rewrite of what was originally there. ToDos:
Other than that, I do not think any more refactoring would be possible at this junction, barring perhaps subdividing the from_text_annotation method into a series of private methods. However, the current refactoring in this PR, I think, was necessary before even considering this, because it appeared the code was somewhat all over the place. Two things to note:
EDIT |
I added documentation, made the code a little bit simpler. Two final things - I'm not sure whether the code that adds the font resource is in the right place, and I wonder whether we need to move some tests to a dedicated tests/test_appearance_stream.py file. Other than that, I think this is ready. EDIT Thinking about the font thing. This PR should ideally support two case. The first: initialising a TextStreamAppearance from an annotation. The second: initialising a TextStreamAppearance directly, even though that has no use case yet. For the first case, associated font resources will normally be one of the 14 standard fonts, or an embedded font. The code supports setting a font name, but this implies that it only works in case of an embedded font that is already associated with the acro_form resources. The second case can hardly work, because it requires setting a font character map and there's only one bit of code that does that, and it's in the from_text_annotation code. This poses various questions. First, about from_text_annotation, I wonder what use it is to manually set a font name if there is no associated font dictionary. I assume (but haven't tested) that this would only work with a system font. Second, for the case of initialising a TextStreamAppearance directly, one ought to be able to supply a font resource dictionary for a font embedded in the pdf. Otherwise, I don't see how it would work. But this also means that setting a font resource dictionary ought to be part of the TextStreamAppearance init method, and not the from_text_annotation method. What to do.
In any case, for this PR, it suggests that at least the font resource ought to be an argument to the TextStreamAppearance init method, and the same holds for embedding the font resource in the text stream appearance. Nonetheless, that's only one more patch to this PR, and it does not change any code anymore. EDIT 2 Passing a non-embedded font just does nothing. I'm assuming that, from an annotation, one would need to find font resources elsewhere, not in the acroform dictionary. And if you can't find a font, then it would need to be embedded. This means that, currently, changing a font name can only be expected to do something when you also have added a corresponding font resource to the acroform first. EDIT 3 NOT FOR THIS PR - if we are a bit more serious about finding and parsing font resources, then from_text_annotation needs document resources as an argument. I'll keep it at that :-) EDIT 4 DEFINITELY NOT FOR THIS PR - it turns out to be very easy to describe a font resource, and it would be trivial to offer the option to set whichever of the 14 base fonts by name with the current code. Furthermore, I tested a bit more, and it's also not so hard to add a font resource for a non-embedded font, although one would need to parse an external font file and generate a font resource from that. |
Done! ... unless you'd like me to move some tests. |
@stefan6419846 I worry a bit that I've made this PR too complex to review. Please advise! In the meantime, I redid some of the patches to make them more coherent. This does not affect the changes in total, but it clarifies what each individual patch in this PR tries to accomplish. It also makes it easier to drop a couple of patches to simplifiy this PR. I dropped one patch, that changed the location of the code that escapes parentheses (https://github.com/py-pdf/pypdf/pull/3466/files#diff-62512f011b84139b80ae75e1ee2b5caf32661f593564500a6ae8a358c88e7a9bR281) . The more I think about it, the more I'm convinced that that code cannot work as intendend. Especially when applied to selection values, it does escape parentheses in the user choice, but not in the list of choices. Also, it would again escape parentheses that are already escaped. I decided to just leave that bit of code alone for now. In any case, please let me know if you can proceed with reviewing this code, or what simplifications (or other changes) I still need to make. |
I am currently rather busy with other stuff - most of my latest changes here where rather trivial or required for my own use cases and thus prioritized from my side. Hopefully, I will be able to look into the changes again this month. |
Take care! In the meantime, I'm only changing this PR to simplify for review. Although I don't think I can do anymore in that regard. |
This patch lets the _update_field_annotation method return an appearance stream instead of None, so that this method can be separated out of _writer.py later on.
Add a couple of comments to the update_page_form_fields method, and change the flatten command later on. Underlying logic: First set the field value, then get its appearance stream, and, if it has one, flatten it if appropriate.
This patch introduces a new module - appearance_stream - and copies two methods from _writer to this new module. Currently, these methods are needed to develop an appearance stream for a text annotation. They are: update_field_annotation (renamed from _update_field_annotation) generate_appearance_stream The update_field_annotation was a PdfWriter method, which means that the current code needs some refactoring, since it now has a circular import of PdfWriter. Other than changing self to writer in update_field_annotation, and changing the code in PdfWriter to call update_field_annotation from _appearance_stream, this patch changes nothing. In a future change, we might want to make a class TextAppearanceStream based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotaion). scale_text would also be a method in this class.
This patch introduces the TextAppearanceStream class, with .from_text_annotation as a class method to instantiate it from a text annotation. It includes the code from generate_appearance_stream and _update_field_annotation.
Code in _appearance_stream used various rather cryptic variable names that, for some coders, made it hard to understand what the code was doing. This patch tries to clarify those variable names to make it easier to understand what's going on, and make it easier later on to add functionality. Overview of the changes: txt --> text sel --> selection da --> default_appearance font_full_rev --> font_glyph_byte_map rct --> rect enc_line --> encoded_line af --> acro_form dr --> document_resources / document_font_resources font_res --> font_resource Furthermore, I undid some abbreviated imports: - AnnotationDictionaryAttributes no longer as AA - FieldDictionaryAttributes no longer as FA
This patch removes the variable name "font_height", because it means the same thing as font size. I think that font_height was introduced previously to distinguish between a font size found in an annotation's default appearance and the size set by a user. To be consistent, also use the variable user_font_name when it pertains to a user choice, and font_name for a font name found in a default appearance.
This patch adds more comments, especially to the from_text_annotation method, in the hope that this will later ease further refactoring.
This patch aims to make a couple of variables and associated imports more readable by writing them out in full instead of having very short abbreviations.
This patch makes the code for producing the appearance stream data into a separate method.
The y_offset calculation occurs very early on in the code, necessitating carrying it across various methods. This patch simplifies that logic.
This moves parsing the multiline field flag to the place where the other field flags are parsed, and moves the consequences for font size elsewhere.
Instead of passing around default appearance, construct it from given font name, size and color. Also, having a default appearance as an argument for a text stream appearance seems less "natural" than just passing font name, size and color. This patch also represents a small number of simplifications that improve test coverage.
Move the font resource parsing code to TextAppearanceStream, in the hope that, later, one might be able to generate a TextAppearanceStream directly. I wonder, though, where the necessary font resource would come from.
This PR is intended to result in a class TextAppearanceStream, based on generate_appearance_stream, with .from_annotation(Annotation) as a class method (based on update_field_annotation). A future method scale_text would also be a method in this class.
This removes some code from _writer and makes text appearances more modular. Also, it paves the way for adding code for improving text wrapping etc. later on.