CJK text handling

Some PDFs from West Lindsey fail to import, as they contain Chinese text, which smalot/pdfparser fails to decode properly.

Example: [Risk Strategy 2025.pdf](https://github.com/user-attachments/files/23257259/Risk.Strategy.2025.pdf)

The text is on the last page: " 欲了解更多信息，请致电". Decoding fails after the first two chars, leaving this: "欲了9�+�"p2��^�T�G�", which fails to insert into the database, with the error: "Incorrect string value: '\x83+\xA9".

This needs fixing in the pdfparser, but we could also handle the failure better - maybe by removing the broken text so the rest of the document saves (and logging that we did so).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CJK text handling #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CJK text handling #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions