Skip to content

Commit 961c8d5

Browse files
authored
feat: use block matrix to reduce peak memory usage for matmul (#3947)
This PR targets the most memory expensive operation in partition pdf and images: deduplicate pdfminer elements. In large pages the number of elements can be over 10k, which would generate multiple 10k x 10k square double float matrices during deduplication, pushing peak memory usage close to 13Gb ![Screenshot 2025-03-06 at 3 22 52 PM](https://github.com/user-attachments/assets/fdc26806-947b-4b5a-9d8e-4faeb0179b9f) This PR breaks this computation down by computing partial IOU. More precisely it computes IOU for each 2000 elements against all the elements at a time to reduce peak memory usage by about 10x to around 1.6Gb. ![image](https://github.com/user-attachments/assets/e7b9f149-2b6a-4fc9-83c7-652e20849b76) The block size is configurable based on user preference for peak memory usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.
1 parent 19373de commit 961c8d5

File tree

3 files changed

+14
-6
lines changed

3 files changed

+14
-6
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
## 0.16.24-dev5
1+
## 0.16.24
22

33
### Enhancements
44

55
- **Support dynamic partitioner file type registration**. Use `create_file_type` to create new file type that can be handled
66
in unstructured and `register_partitioner` to enable registering your own partitioner for any file type.
77

88
- **`extract_image_block_types` now also works for CamelCase elemenet type names**. Previously `NarrativeText` and similar CamelCase element types can't be extracted using the mentioned parameter in `partition`. Now figures for those elements can be extracted like `Image` and `Table` elements
9+
- **use block matrix to reduce peak memory usage for pdf/image partition**.
910

1011
### Features
1112

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.16.24-dev5" # pragma: no cover
1+
__version__ = "0.16.24" # pragma: no cover

unstructured/partition/pdf_image/pdfminer_processing.py

+11-4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import os
34
from typing import TYPE_CHECKING, Any, BinaryIO, Iterable, List, Optional, Union, cast
45

56
import numpy as np
@@ -708,10 +709,16 @@ def remove_duplicate_elements(
708709
) -> TextRegions:
709710
"""Removes duplicate text elements extracted by PDFMiner from a document layout."""
710711

711-
iou = boxes_self_iou(elements.element_coords, threshold)
712-
# this is equivalent of finding those rows where `not iou[i, i + 1 :].any()`, i.e., any element
713-
# that has no overlap above the threshold with any other elements
714-
return elements.slice(~np.triu(iou, k=1).any(axis=1))
712+
coords = elements.element_coords
713+
# experiments show 2e3 is the block size that constrains the peak memory around 1Gb for this
714+
# function; that accounts for all the intermediate matricies allocated and memory for storing
715+
# final results
716+
memory_cap_in_gb = os.getenv("UNST_MATMUL_MEMORY_CAP_IN_GB", 1)
717+
n_split = np.ceil(coords.shape[0] / 2e3 / memory_cap_in_gb)
718+
splits = np.array_split(coords, n_split, axis=0)
719+
720+
ious = [~np.triu(boxes_iou(split, coords, threshold), k=1).any(axis=1) for split in splits]
721+
return elements.slice(np.concatenate(ious))
715722

716723

717724
def aggregate_embedded_text_by_block(

0 commit comments

Comments
 (0)