Skip to content

feat/Understand and correctly parse ligatures #3471

Open
@jocubeit

Description

@jocubeit

Is your feature request related to a problem? Please describe.

Yes. When attempting to parse a PDF which uses a font that has ligatures, i.e. two or more characters are expressed as a single character, the character is not recognised in either the fast or hi-res strategies.

In the example PDF (attached), the letters fi are represented by a single conjoined character by the font.

When using the fast strategy the fi ligature is replaced with \u0000. For example: the word verifies becomes vari\u0000es and the word specification becomes speci\u0000cation.

When using the hi-res strategy the fi ligature is omitted entirely. For example: the word verifies becomes veries and the word specification becomes specication.

Describe the solution you'd like

I understand it is probably not possible to correct this in the fast strategy to understand the ligature.

It would be nice if the hi-res OCR strategy could imply any ligatures. It may not get it correct every time, but that's probably better than not at all.

Describe alternatives you've considered

I tried the hi-res strategy thinking OCR might work, but it didn't.

Additional context

I have attached the source PDF as an example, and the two JSON result files (fast.json and hi-res.json). Every instance of the character combinationfi is replaced with a single character ligature, and thus results in a suboptimal output.

Passwordless-Authentication.pdf
fast.json
hi-res.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions