Skip to content

CJK text handling #47

@rupertj

Description

@rupertj

Some PDFs from West Lindsey fail to import, as they contain Chinese text, which smalot/pdfparser fails to decode properly.

Example: Risk Strategy 2025.pdf

The text is on the last page: " 欲了解更多信息,请致电". Decoding fails after the first two chars, leaving this: "欲了9�+�"p2���^�T�G�", which fails to insert into the database, with the error: "Incorrect string value: '\x83+\xA9".

This needs fixing in the pdfparser, but we could also handle the failure better - maybe by removing the broken text so the rest of the document saves (and logging that we did so).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions