-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Some PDFs from West Lindsey fail to import, as they contain Chinese text, which smalot/pdfparser fails to decode properly.
Example: Risk Strategy 2025.pdf
The text is on the last page: " 欲了解更多信息,请致电". Decoding fails after the first two chars, leaving this: "欲了9�+�"p2���^�T�G�", which fails to insert into the database, with the error: "Incorrect string value: '\x83+\xA9".
This needs fixing in the pdfparser, but we could also handle the failure better - maybe by removing the broken text so the rest of the document saves (and logging that we did so).
Metadata
Metadata
Assignees
Labels
No labels