[SPARKNLP-1161] Adding features to PDF Reader by danilojsl · Pull Request #14596 · JohnSnowLabs/spark-nlp

danilojsl · 2025-06-04T21:47:45Z

Description

This PR introduces two new configurable parameters to the PdfToText transformer and PDF Reader to enrich PDF parsing:

extractCoordinates: When enabled, outputs spatial metadata (text position and dimensions) per character in the PDF. Outputs are stored in a new column as a positions array containing structured page coordinate mappings.
normalizeLigatures: When extractCoordinates is enabled, this option ensures ligature characters (e.g., ﬁ, ﬂ, œ) are normalized to their decomposed forms (fi, fl, oe).
Prevents these typographic ligatures from being interpreted as distinct characters in downstream text analysis.
exception: New Output Column for Fault Tolerance
A new exception column has been introduced to capture and log any processing errors encountered when handling individual PDF documents.

This enhancement ensures:

Fine-grained coordinate mapping for each character enables spatial reasoning and layout-aware models.
Ligature normalization improves text consistency and downstream linguistic accuracy, aligning extracted data with model expectations and training datasets.
Batch jobs are not interrupted by a single corrupt or malformed PDF.
Detailed error messages are recorded per document, supporting granular debugging and post-analysis.

Motivation and Context

Many downstream NLP tasks, such as entity recognition, layout analysis, and table extraction, require precise positional context of text elements in PDFs. Previously, these components provided only linear text extraction, losing valuable spatial metadata.

Additionally, typographic ligatures (like ﬁ, ﬂ, or œ) can lead to inconsistent tokenization and entity boundary errors when not normalized. These characters often distort string matching and model predictions in document processing pipelines.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…DF reader

* [SPARKNLP-1161] Adding extractCoordinates and normalizeLigatures to PDF reader * [SPARKNLP-1161] Updating PDF reader Demo notebook [skip test] * [SPARKNLP-1161] Fix typos in PDF reader Demo notebook [skip test] * [SPARKNLP-1162] Adding exceptions log column * [SPARKNLP-1161] Updating demo notebook

[SPARKNLP-1161] Adding extractCoordinates and normalizeLigatures to P…

a5f8a10

…DF reader

danilojsl self-assigned this Jun 4, 2025

danilojsl requested review from DevinTDHa and maziyarpanahi June 4, 2025 21:48

[SPARKNLP-1161] Updating PDF reader Demo notebook [skip test]

064e174

DevinTDHa reviewed Jun 6, 2025

View reviewed changes

Comment thread examples/python/reader/SparkNLP_PDF_Reader_Demo.ipynb

Comment thread src/main/scala/com/johnsnowlabs/reader/util/pdf/CustomStripper.java

DevinTDHa changed the base branch from master to release/603-release-candidate June 6, 2025 15:21

[SPARKNLP-1161] Fix typos in PDF reader Demo notebook [skip test]

940268b

DevinTDHa marked this pull request as draft June 10, 2025 10:56

[SPARKNLP-1162] Adding exceptions log column

460e33a

danilojsl changed the base branch from release/603-release-candidate to master June 11, 2025 22:56

danilojsl added the enhancement label Jun 11, 2025

[SPARKNLP-1161] Updating demo notebook

5359d08

DevinTDHa changed the base branch from master to release/604-release-candidate June 23, 2025 10:01

DevinTDHa marked this pull request as ready for review June 23, 2025 10:02

DevinTDHa merged commit 280e98f into release/604-release-candidate Jun 23, 2025
4 of 6 checks passed

DevinTDHa mentioned this pull request Jun 24, 2025

Spark NLP 6.0.4 Release #14611

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1161] Adding features to PDF Reader#14596

[SPARKNLP-1161] Adding features to PDF Reader#14596
DevinTDHa merged 5 commits into
release/604-release-candidatefrom
feature/SPARKNLP-1161-Adding-extractCoordinates-and-normalizeLigatures-parameters-to-PDF-Reader

danilojsl commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

danilojsl commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danilojsl commented Jun 4, 2025 •

edited

Loading