Description
I want to merge two PDF using PdfWriter
, one generated using wkhtmltopdf
and one uploaded by one of my users (which seems issued from an electronic signature, but I can't be sure). Due to the document content I can't provide the document publicly and I could not reproduce a document that would reproduce the problem...
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.8.0-55-generic-x86_64-with-glibc2.39
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.4.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
def merge_pdf_files(pdf_files):
output = BytesIO()
merger = PdfWriter()
for pdf_file in pdf_files:
merger.append(pdf_file)
merger.write(output)
merger.close()
return output.getvalue()
merge_pdf_file(['myfirstfile.pdf']) # output a valid PDF file w/o errors
# Both next lines will produce the same exception below
merge_pdf_file(['myfirstfile.pdf', 'user-uploaded-file.pdf'])
merge_pdf_file(['user-uploaded-file.pdf'])
Adding the exclude_fields=('/Annots',)
argument to the append function silence the exception and returns a valid PDF but that's not a durable solution.
Traceback
This is the complete traceback I see:
File ~/mycode/myproject/some/utils/pdf.py:150, in merge_pdf_files(pdf_files)
148 merger = PdfWriter()
149 for pdf_file in pdf_files:
--> 150 merger.append(pdf_file)
151
152 merger.write(output)
File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2621, in PdfWriter.append(self, fileobj, outline_item, pages, import_outline, excluded_fields)
2612 self.merge(
2613 None,
2614 fileobj,
(...)
2618 excluded_fields,
2619 )
2620 else: # if isinstance(outline_item,str):
-> 2621 self.merge(
2622 None,
2623 fileobj,
2624 outline_item,
2625 pages,
2626 import_outline,
2627 excluded_fields,
2628 )
File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2779, in PdfWriter.merge(self, position, fileobj, outline_item, pages, import_outline, excluded_fields)
2777 if "/Annots" not in excluded_fields:
2778 for pag in srcpages.values():
-> 2779 lst = self._insert_filtered_annotations(
2780 pag.original_page.get("/Annots", ()), pag, srcpages, reader
2781 )
2782 if len(lst) > 0:
2783 pag[NameObject("/Annots")] = lst
File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/_writer.py:2972, in PdfWriter._insert_filtered_annotations(self, annots, page, pages, reader)
2970 else:
2971 print(annots, ano, cast("DictionaryObject", ano["/A"]))
-> 2972 d = cast("DictionaryObject", ano["/A"])["/D"]
2973 if isinstance(d, NullObject):
2974 continue
File ~/.virtualenvs/.../lib/python3.12/site-packages/pypdf/generic/_data_structures.py:478, in DictionaryObject.__getitem__(self, key)
477 def __getitem__(self, key: Any) -> PdfObject:
--> 478 return dict.__getitem__(self, key).get_object()
KeyError: '/D'
The problem exists in pypdf==4.0.1 and pypdf==5.4.0 (the first one is the current version we're using in production but I tried with the latest one locally before submitting this issue). Even though I can't share the buggy PDF, here's some insights:
# In pypdf/_writer.py, at line 2971, just before the exception seen above
print(annots)
[
IndirectObject(82, 0, 139166691950560),
IndirectObject(84, 0, 139166691950560),
IndirectObject(86, 0, 139166691950560),
IndirectObject(88, 0, 139166691950560),
IndirectObject(90, 0, 139166691950560),
IndirectObject(92, 0, 139166691950560),
IndirectObject(94, 0, 139166691950560),
IndirectObject(96, 0, 139166691950560),
IndirectObject(98, 0, 139166691950560)
]
print(ano)
{
'/A': IndirectObject(85, 0, 139166691950560),
'/BS': {'/S': '/S', '/Type': '/Border', '/W': 0},
'/Border': [0, 0, 0],
'/H': '/I',
'/Rect': [68.6001, 653.405, 526.2, 671.054],
'/StructParent': 9,
'/Subtype': '/Link',
'/Type': '/Annot'
}
print(ano['/A'])
{'/S': '/GoTo'}
I don't know how is constructed the PDF format but \D
is missing in the data structure and I would replace the following check :
d = cast("DictionaryObject", ano["/A"]).get("/D")
if not d or isinstance(d, NullObject):
continue
I stay available if you need more data about the faulty PDF.