-
Notifications
You must be signed in to change notification settings - Fork 304
Description
Version
26.1.2
Which installation method(s) does this occur on?
No response
Describe the bug.
Input: A PDF file with more than one page, where some pages contain only images or tables.
If we configure nv-ingest to perform only text extraction (without OCR, table, chart, or image processing), then pages that contain only images will produce embeddings as N/A.
Starting with version 26.1.2, we validate that a parquet file must contain only valid values (no N/A). If pdf_split_page_count: 32 is used and some pages contain only images, the entire parquet file upload is rejected. In previous versions, this was not the case.
As a result, performance becomes slower because we must set pdf_split_page_count: 1 to avoid rejecting the whole file due to a single page without text.
We should have an option to disable checking parquet file values before upload. This issue won't surface with ingestor api as during api return, nv-ingest filter out invalid values and returns. But reject the whole file in case of s3 upload.
Minimum reproducible example
Relevant log output
Other/Misc.
No response