Skip to content

feat/Guard against excessive memory usage when partitioning PDFs #2129

Open
@flash1293

Description

@flash1293

Is your feature request related to a problem? Please describe.
Using the OCR strategy when partitioning PDFs, processing of some PDF files will allocate a large amount of memory that isn't available in all environments (e.g. when running via Google cloud run with limited resources).

For example, the following 23MB PDF causes memory usage of >10GB when partitioning: https://drive.google.com/file/d/1lr-Pwh3QTVfdY4F6R-fk4tVU9FNSK27p/view?usp=sharing

Describe the solution you'd like

Unstructured should employ sensitive defaults to avoid this kind of situations (e.g. a max size of a page when rendered in memory). This could also be configurable as optional argument on the partitioning method.

In cases where this isn't feasible, the partitioning method should raise a descriptive exception so the caller can handle the situation gracefully instead of crashing the process.

The most important aspect is giving a way to limit the amount of memory unstructured will use during partitioning.

Describe alternatives you've considered

Alternatively, the partitioning can be run in a separate memory-limited process which is controlled by another process. In case the partitioning process runs out of memory, the orchestration process can handle the situation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions