feat/Understand and correctly parse ligatures

**Is your feature request related to a problem? Please describe.**

Yes. When attempting to parse a PDF which uses a font that has ligatures, i.e. two or more characters are expressed as a single character, the character is not recognised in either the `fast` or `hi-res` strategies.

In the example PDF (attached), the letters `fi` are represented by a single conjoined character by the font.

When using the `fast` strategy the `fi` ligature is replaced with `\u0000`. For example: the word `verifies` becomes `vari\u0000es` and the word `specification` becomes `speci\u0000cation`.

When using the `hi-res` strategy the `fi` ligature is omitted entirely. For example: the word `verifies` becomes `veries` and the word `specification` becomes `specication`.

**Describe the solution you'd like**

I understand it is probably not possible to correct this in the `fast` strategy to understand the ligature.

It would be nice if the `hi-res` OCR strategy could imply any ligatures. It may not get it correct every time, but that's probably better than not at all.

**Describe alternatives you've considered**

I tried the `hi-res` strategy thinking OCR might work, but it didn't.

**Additional context**

I have attached the source PDF as an example, and the two JSON result files (`fast.json` and `hi-res.json`). Every instance of the character combination`fi` is replaced with a single character ligature, and thus results in a suboptimal output.

[Passwordless-Authentication.pdf](https://github.com/user-attachments/files/16465246/Passwordless-Authentication.pdf)
[fast.json](https://github.com/user-attachments/files/16465243/fast.json)
[hi-res.json](https://github.com/user-attachments/files/16465244/hi-res.json)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat/Understand and correctly parse ligatures #3471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat/Understand and correctly parse ligatures #3471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions