Feat/vectorize layout merging #3900

badGarnet · 2025-02-04T02:01:10Z

This PR rewrites the logic in unstructured_inference that merges extracted with inferred layout using vectorized operations. The goal is to:

vectorize the operation to improve memory and cpu efficiency
apply logic equally without order being a factor (the unstructured_inference version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results
rewrite the loop into clear steps with clear rules
setup stage for followup improvements

While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element.

Test:

Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change.

…-merging

… update (#3911) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

pawel-kmiecik

LGTM!
Just a nit: I'd add type hints in all functions and some docstrings, especially for _merge and _mark functions

plutasnyy

For visibility, here is the previous version https://github.com/Unstructured-IO/unstructured-inference/blob/cd9ea8b846cd7ae26c81c49c52278d3504e941ea/unstructured_inference/inference/layoutelement.py#L221

Do we understand the impact of this change on metrics and how much speed/memory we gained?

Honestly, this logic is quite complex and I can't confidently say that I fully understand it ;) I did my best and I really appreciate all the comments and efforts to make it easier to follow!

I think each specific rule could be a separate function with docstring showing in a bit more visual way what it does.

I hope we have lots of tests there 🤞

I added some questions, but I think none of them are blockers. This task was quite challenging, good job @badGarnet ! 🌷

plutasnyy · 2025-02-07T12:14:35Z

...uctured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json

    }
  },
+  {
+    "type": "FigureCaption",


Is there any specific reason we started detecting more objects of the type FigureCaption? Are they special in any way?

great catch; this is actually a bug in refactoring; fixing that

plutasnyy · 2025-02-07T12:26:46Z

unstructured/partition/pdf_image/pdfminer_processing.py

+
+
+# TODO (yao): move those vector math utils into a separated sub module to void import issues
+def boxes_iou(


nit Maybe this could be renamed to 'are_boxes_overlapping`, typing would be nice here

plutasnyy · 2025-02-07T12:32:02Z

unstructured/partition/pdf_image/pdfminer_processing.py

+
+    # ==== RULE 1: any inferred box that is almost the same as an extracted image box, inferred is
+    # removed
+    # NOTE (yao): what if od model detects table but pdfminer says image -> we would lose the table


I assume if the text is extractable it is an easier task to detect the table by pdfminer than for the OD model; but this is just an assumption

right; if text is extractable in general this is not a problem. It is only a problem when table is in a container that pdfminer marks as image, which can happen.

plutasnyy · 2025-02-07T12:36:28Z

unstructured/partition/pdf_image/pdfminer_processing.py

+    # ==== RULE 2. if there is a inferred region almost the same as the extracted text-region ->
+    # keep inferred and removed extracted region; here we put more trust in OD model more than
+    # pdfminer for bounding box


Was this decision made for any particular reason? Did we observe worse bboxes from pdfminer?
FYI I know the scope of this change is purely code refactor ;)

Not sure to be honest. For the moment we generally observe OD model has pretty good bound boxes and definitely richer ontology of element types so here it is more a statement that: "we keep OD model's element type"

plutasnyy · 2025-02-07T12:43:22Z

unstructured/partition/pdf_image/pdfminer_processing.py

+    extracted_layout: LayoutElements,
+    inferred_layout: LayoutElements,
+    same_region_threshold: float,
+) -> tuple[np.ndarray, np.ndarray]:


isn't the type just np.ndarray? This is the type returned by boxes_iou
this function modifies inferred_layout, I probably would prefer if it would return a modified copy, I believe the reasoning here was to avoid copying entire inferred_layout as this may be huge right?

plutasnyy · 2025-02-07T12:58:10Z

unstructured/partition/pdf_image/pdfminer_processing.py

+        extracted_to_iter[matches] = False
+        inferred_to_iter[inferred_index] = False
+        # then expand inferred box by all the extracted boxes
+        # FIXME (yao): this part is broken at the moment


Should it be there?

oops; need to delete

is this part still broken based on the comment?

tbs17 · 2025-02-07T15:31:05Z

unstructured/partition/pdf_image/pdfminer_processing.py

-    from unstructured_inference.inference.layoutelement import LayoutElements
-    from unstructured_inference.inference.layoutelement import (
-        merge_inferred_layout_with_extracted_layout as merge_inferred_with_extracted_page,
-    )


i see previously we are importing the func merge_inferred_layout_with_extracted_layout as merge_inferred_with_extracted_page, that means it relies on unstructured-inference package.

now here, after removing this import, it seems to decouple the unstructured-inference and unstructured and moved the merging logic here to unstructured. will we remove the old merging logic in unstructured-inference to sync? the affected use case would be: some other repo is relying on unstructured-inference for that function instead of importing from unstructured

I think long term goal is to fold inference into unstructured as we see currently the setup is slowing down development. And as this refactor suggests many functions really need both repos together to be properly defined (e.g., types defined in inference)

…-merging

This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

badGarnet · 2025-02-07T19:57:32Z

...structured_ingest/expected-structured-output/google-drive/recalibrating-risk-report.pdf.json

    }
  },
+  {
+    "type": "Image",


now the two side by side photos are recognized as two images instead of just one

tbs17 · 2025-02-07T20:47:37Z

unstructured/partition/pdf_image/pdfminer_processing.py

+    have expanded due to merging.
+    """
+    # in theory one extracted __should__ only match at most one inferred region, given inferred
+    # region can not overlap; so first match here __should__ also be the only match


maybe it's rare to see inferred regions being overlapped, but if that does happen, how much of the below code still would hold? maybe a test on such edge case?

tbs17 · 2025-02-07T20:53:10Z

unstructured/partition/pdf_image/pdfminer_processing.py

+            inferred_layout.slice([inferred_index]),
+            *[extracted_layout.slice([match]) for match in matches],
+        )
+    inferred_to_proc[inferred_to_proc] = inferred_to_iter


are the use of inferred_to_iter and extracted_to_iter necessary? i got confused when i first saw inferred_to_proc and extracted_to_proc didn't get returned but updated but it could be just me.

would below work?

def _merge_extracted_that_are_subregion_of_inferred_text( extracted_layout, inferred_layout, extracted_is_subregion_of_inferred, extracted_to_proc, inferred_to_proc, ): """Merge extracted regions that are subregions of inferred text regions.""" for inferred_index, inferred_row in enumerate(extracted_is_subregion_of_inferred.T): matches = np.where(inferred_row)[0] if not matches.size: continue # Mark matched regions as processed extracted_to_proc[matches] = False inferred_to_proc[inferred_index] = False # Expand inferred bounding box to encompass matched extracted regions inferred_layout.element_coords[[inferred_index]] = _minimum_containing_coords( inferred_layout.slice([inferred_index]), *[extracted_layout.slice([match]) for match in matches], ) return inferred_layout

This PR rewrites the logic in `unstructured_inference` that merges extracted with inferred layout using vectorized operations. The goal is to: - vectorize the operation to improve memory and cpu efficiency - apply logic equally without order being a factor (the `unstructured_inference` version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results - rewrite the loop into clear steps with clear rules - setup stage for followup improvements While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element. ## Test: Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]>

badGarnet and others added 20 commits January 27, 2025 17:24

wip

ade9f88

wip

05e3f8f

wip

b765252

partial confirmed working refactoring

d435550

Merge remote-tracking branch 'origin/main' into feat/vectorize-layout…

665c1c0

…-merging

refactor main function into modular pieces

d3b9adc

fix bugs during pdf test

1525416

fix or merging logic

dbf08a8

bump version and update changelog

6de1dee

fix setting updated coords

58e71d1

remove debug lines

bcb2cde

Merge remote-tracking branch 'origin/main' into feat/vectorize-layout…

0acc3ef

…-merging

use exact same filtering list for non-text inferred element types

318c929

Merge remote-tracking branch 'origin/main' into feat/vectorize-layout…

2927c35

…-merging

revert experiments

f106d30

fix: fix logic to provide candidates for processing

60f7e81

[wip] Feat/supplement inferred with extracted <- Ingest test fixtures…

d068cdb

… update (#3911) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

fix: add logic to early exit

ddd21ba

fix: fix slicing call syntax

de1464b

clean up comments

022698e

badGarnet marked this pull request as ready for review February 7, 2025 03:04

badGarnet requested review from christinestraub, cragwolfe, pawel-kmiecik, plutasnyy and scanny February 7, 2025 03:04

pawel-kmiecik approved these changes Feb 7, 2025

View reviewed changes

plutasnyy approved these changes Feb 7, 2025

View reviewed changes

plutasnyy reviewed Feb 7, 2025

View reviewed changes

tbs17 reviewed Feb 7, 2025

View reviewed changes

badGarnet and others added 3 commits February 7, 2025 12:30

fix: fix whats to process for extracted condition

6531d24

Merge remote-tracking branch 'origin/main' into feat/vectorize-layout…

309c5c1

…-merging

Feat/vectorize layout merging <- Ingest test fixtures update (#3914)

b6dbc90

This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <[email protected]>

badGarnet commented Feb 7, 2025

View reviewed changes

badGarnet enabled auto-merge February 7, 2025 19:58

badGarnet added this pull request to the merge queue Feb 7, 2025

tbs17 reviewed Feb 7, 2025

View reviewed changes

Merged via the queue into main with commit 723c074 Feb 7, 2025
41 checks passed

badGarnet deleted the feat/vectorize-layout-merging branch February 7, 2025 20:59



		# TODO (yao): move those vector math utils into a separated sub module to void import issues
		def boxes_iou(

Feat/vectorize layout merging #3900

Feat/vectorize layout merging #3900

Uh oh!

Conversation

badGarnet commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test:

Uh oh!

pawel-kmiecik left a comment

Choose a reason for hiding this comment

Uh oh!

plutasnyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbs17 Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbs17 Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

badGarnet commented Feb 4, 2025 •

edited

Loading

tbs17 Feb 7, 2025 •

edited

Loading

badGarnet Feb 7, 2025 •

edited

Loading

tbs17 Feb 7, 2025 •

edited

Loading