Skip to content

Unable to merge a PDF with a buggy annotation #3211

Open
@mlorant

Description

@mlorant

I want to merge two PDF using PdfWriter, one generated using wkhtmltopdf and one uploaded by one of my users (which seems issued from an electronic signature, but I can't be sure). Due to the document content I can't provide the document publicly and I could not reproduce a document that would reproduce the problem...

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-55-generic-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.4.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

def merge_pdf_files(pdf_files):
    output = BytesIO()
    merger = PdfWriter()
    for pdf_file in pdf_files:
        merger.append(pdf_file)
    merger.write(output)
    merger.close()
    return output.getvalue()

merge_pdf_file(['myfirstfile.pdf'])  # output a valid PDF file w/o errors

# Both next lines will produce the same exception below
merge_pdf_file(['myfirstfile.pdf', 'user-uploaded-file.pdf'])
merge_pdf_file(['user-uploaded-file.pdf'])

Adding the exclude_fields=('/Annots',) argument to the append function silence the exception and returns a valid PDF but that's not a durable solution.

Traceback

This is the complete traceback I see:

File ~/mycode/myproject/some/utils/pdf.py:150, in merge_pdf_files(pdf_files)
    148 merger = PdfWriter()
    149 for pdf_file in pdf_files:
--> 150     merger.append(pdf_file)
    151 
    152 merger.write(output)

File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2621, in PdfWriter.append(self, fileobj, outline_item, pages, import_outline, excluded_fields)
   2612     self.merge(
   2613         None,
   2614         fileobj,
   (...)
   2618         excluded_fields,
   2619     )
   2620 else:  # if isinstance(outline_item,str):
-> 2621     self.merge(
   2622         None,
   2623         fileobj,
   2624         outline_item,
   2625         pages,
   2626         import_outline,
   2627         excluded_fields,
   2628     )

File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2779, in PdfWriter.merge(self, position, fileobj, outline_item, pages, import_outline, excluded_fields)
   2777 if "/Annots" not in excluded_fields:
   2778     for pag in srcpages.values():
-> 2779         lst = self._insert_filtered_annotations(
   2780             pag.original_page.get("/Annots", ()), pag, srcpages, reader
   2781         )
   2782         if len(lst) > 0:
   2783             pag[NameObject("/Annots")] = lst

File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2972, in PdfWriter._insert_filtered_annotations(self, annots, page, pages, reader)
   2970 else:
   2971     print(annots, ano, cast("DictionaryObject", ano["/A"]))
-> 2972     d = cast("DictionaryObject", ano["/A"])["/D"]
   2973     if isinstance(d, NullObject):
   2974         continue

File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/generic/_data_structures.py:478, in DictionaryObject.__getitem__(self, key)
    477 def __getitem__(self, key: Any) -> PdfObject:
--> 478     return dict.__getitem__(self, key).get_object()

KeyError: '/D'

The problem exists in pypdf==4.0.1 and pypdf==5.4.0 (the first one is the current version we're using in production but I tried with the latest one locally before submitting this issue). Even though I can't share the buggy PDF, here's some insights:

# In pypdf/_writer.py, at line 2971, just before the exception seen above
print(annots)
[
   IndirectObject(82, 0, 139166691950560), 
   IndirectObject(84, 0, 139166691950560), 
   IndirectObject(86, 0, 139166691950560), 
   IndirectObject(88, 0, 139166691950560), 
   IndirectObject(90, 0, 139166691950560), 
   IndirectObject(92, 0, 139166691950560), 
   IndirectObject(94, 0, 139166691950560), 
   IndirectObject(96, 0, 139166691950560), 
   IndirectObject(98, 0, 139166691950560)
] 

print(ano) 
{
   '/A': IndirectObject(85, 0, 139166691950560), 
   '/BS': {'/S': '/S', '/Type': '/Border', '/W': 0}, 
   '/Border': [0, 0, 0], 
   '/H': '/I', 
   '/Rect': [68.6001, 653.405, 526.2, 671.054], 
   '/StructParent': 9, 
   '/Subtype': '/Link', 
   '/Type': '/Annot'
} 
print(ano['/A'])
{'/S': '/GoTo'}

I don't know how is constructed the PDF format but \D is missing in the data structure and I would replace the following check :

d = cast("DictionaryObject", ano["/A"]).get("/D")
if not d or isinstance(d, NullObject):
    continue

I stay available if you need more data about the faulty PDF.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfWriterThe PdfWriter component is affectedis-robustness-issueFrom a users perspective, this is about robustnessworkflow-mergeFrom a users perspective, merging is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions