Skip to content

[BUG]: single N/A value in embedding, whole file get rejected for parquet file s3 upload. #1549

@veerkumar

Description

@veerkumar

Version

26.1.2

Which installation method(s) does this occur on?

No response

Describe the bug.

Input: A PDF file with more than one page, where some pages contain only images or tables.

If we configure nv-ingest to perform only text extraction (without OCR, table, chart, or image processing), then pages that contain only images will produce embeddings as N/A.

Starting with version 26.1.2, we validate that a parquet file must contain only valid values (no N/A). If pdf_split_page_count: 32 is used and some pages contain only images, the entire parquet file upload is rejected. In previous versions, this was not the case.

As a result, performance becomes slower because we must set pdf_split_page_count: 1 to avoid rejecting the whole file due to a single page without text.

We should have an option to disable checking parquet file values before upload. This issue won't surface with ingestor api as during api return, nv-ingest filter out invalid values and returns. But reject the whole file in case of s3 upload.

Minimum reproducible example

Relevant log output

Other/Misc.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions