[SPARKNLP-1161] Adding features to PDF Reader#14596
Merged
Conversation
DevinTDHa
reviewed
Jun 6, 2025
280e98f
into
release/604-release-candidate
4 of 6 checks passed
DevinTDHa
pushed a commit
that referenced
this pull request
Jun 30, 2025
* [SPARKNLP-1161] Adding extractCoordinates and normalizeLigatures to PDF reader * [SPARKNLP-1161] Updating PDF reader Demo notebook [skip test] * [SPARKNLP-1161] Fix typos in PDF reader Demo notebook [skip test] * [SPARKNLP-1162] Adding exceptions log column * [SPARKNLP-1161] Updating demo notebook
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces two new configurable parameters to the
PdfToTexttransformer and PDF Reader to enrich PDF parsing:extractCoordinates: When enabled, outputs spatial metadata (text position and dimensions) per character in the PDF. Outputs are stored in a new column as a positions array containing structured page coordinate mappings.normalizeLigatures: When extractCoordinates is enabled, this option ensures ligature characters (e.g., fi, fl, œ) are normalized to their decomposed forms (fi, fl, oe).Prevents these typographic ligatures from being interpreted as distinct characters in downstream text analysis.
exception: New Output Column for Fault ToleranceA new exception column has been introduced to capture and log any processing errors encountered when handling individual PDF documents.
This enhancement ensures:
Motivation and Context
Many downstream NLP tasks, such as entity recognition, layout analysis, and table extraction, require precise positional context of text elements in PDFs. Previously, these components provided only linear text extraction, losing valuable spatial metadata.
Additionally, typographic ligatures (like fi, fl, or œ) can lead to inconsistent tokenization and entity boundary errors when not normalized. These characters often distort string matching and model predictions in document processing pipelines.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: