Skip to content

Conversation

@badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Feb 4, 2025

This PR rewrites the logic in unstructured_inference that merges extracted with inferred layout using vectorized operations. The goal is to:

  • vectorize the operation to improve memory and cpu efficiency
  • apply logic equally without order being a factor (the unstructured_inference version uses loops and modifies the content of the inner loop on the fly -> order of the out loop, which is the order of extracted elements becomes a factor) determining the merging results
  • rewrite the loop into clear steps with clear rules
  • setup stage for followup improvements

While this PR aim to reproduce the existing behavior as much as possible it is not an exact replica of the looped version. Because order is not a factor any more some extracted elements that used to be not considered part of a larger inferred element (due to processing order being not optimum) are now properly merged. This lead to changes in one ingest test. For example, the change shows that now we properly merge the section numerical number with the section title as the full title element.

Test:

Since the goal of this refactor is to preserve as much existing behavior as possible we rely on existing tests. As mentioned above the one file that changed output during ingest test is a net positive change.

@badGarnet badGarnet marked this pull request as ready for review February 7, 2025 03:04
Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Just a nit: I'd add type hints in all functions and some docstrings, especially for _merge and _mark functions

Copy link
Contributor

@plutasnyy plutasnyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For visibility, here is the previous version https://github.com/Unstructured-IO/unstructured-inference/blob/cd9ea8b846cd7ae26c81c49c52278d3504e941ea/unstructured_inference/inference/layoutelement.py#L221

Do we understand the impact of this change on metrics and how much speed/memory we gained?

Honestly, this logic is quite complex and I can't confidently say that I fully understand it ;) I did my best and I really appreciate all the comments and efforts to make it easier to follow!

I think each specific rule could be a separate function with docstring showing in a bit more visual way what it does.

I hope we have lots of tests there 🤞

I added some questions, but I think none of them are blockers. This task was quite challenging, good job @badGarnet ! 🌷

}
},
{
"type": "FigureCaption",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason we started detecting more objects of the type FigureCaption? Are they special in any way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch; this is actually a bug in refactoring; fixing that



# TODO (yao): move those vector math utils into a separated sub module to void import issues
def boxes_iou(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Maybe this could be renamed to 'are_boxes_overlapping`, typing would be nice here


# ==== RULE 1: any inferred box that is almost the same as an extracted image box, inferred is
# removed
# NOTE (yao): what if od model detects table but pdfminer says image -> we would lose the table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if the text is extractable it is an easier task to detect the table by pdfminer than for the OD model; but this is just an assumption

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right; if text is extractable in general this is not a problem. It is only a problem when table is in a container that pdfminer marks as image, which can happen.

Comment on lines +259 to +261
# ==== RULE 2. if there is a inferred region almost the same as the extracted text-region ->
# keep inferred and removed extracted region; here we put more trust in OD model more than
# pdfminer for bounding box
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this decision made for any particular reason? Did we observe worse bboxes from pdfminer?
FYI I know the scope of this change is purely code refactor ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to be honest. For the moment we generally observe OD model has pretty good bound boxes and definitely richer ontology of element types so here it is more a statement that: "we keep OD model's element type"

extracted_layout: LayoutElements,
inferred_layout: LayoutElements,
same_region_threshold: float,
) -> tuple[np.ndarray, np.ndarray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the type just np.ndarray? This is the type returned by boxes_iou
this function modifies inferred_layout, I probably would prefer if it would return a modified copy, I believe the reasoning here was to avoid copying entire inferred_layout as this may be huge right?

extracted_to_iter[matches] = False
inferred_to_iter[inferred_index] = False
# then expand inferred box by all the extracted boxes
# FIXME (yao): this part is broken at the moment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops; need to delete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part still broken based on the comment?

from unstructured_inference.inference.layoutelement import LayoutElements
from unstructured_inference.inference.layoutelement import (
merge_inferred_layout_with_extracted_layout as merge_inferred_with_extracted_page,
)
Copy link
Contributor

@tbs17 tbs17 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see previously we are importing the func merge_inferred_layout_with_extracted_layout as merge_inferred_with_extracted_page, that means it relies on unstructured-inference package.

now here, after removing this import, it seems to decouple the unstructured-inference and unstructured and moved the merging logic here to unstructured. will we remove the old merging logic in unstructured-inference to sync? the affected use case would be: some other repo is relying on unstructured-inference for that function instead of importing from unstructured

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think long term goal is to fold inference into unstructured as we see currently the setup is slowing down development. And as this refactor suggests many functions really need both repos together to be properly defined (e.g., types defined in inference)

badGarnet and others added 3 commits February 7, 2025 12:30
}
},
{
"type": "Image",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the two side by side photos are recognized as two images instead of just one

Copy link
Collaborator Author

@badGarnet badGarnet Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page8_layout_final

@badGarnet badGarnet enabled auto-merge February 7, 2025 19:58
@badGarnet badGarnet added this pull request to the merge queue Feb 7, 2025
have expanded due to merging.
"""
# in theory one extracted __should__ only match at most one inferred region, given inferred
# region can not overlap; so first match here __should__ also be the only match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's rare to see inferred regions being overlapped, but if that does happen, how much of the below code still would hold? maybe a test on such edge case?

inferred_layout.slice([inferred_index]),
*[extracted_layout.slice([match]) for match in matches],
)
inferred_to_proc[inferred_to_proc] = inferred_to_iter
Copy link
Contributor

@tbs17 tbs17 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the use of inferred_to_iter and extracted_to_iter necessary? i got confused when i first saw inferred_to_proc and extracted_to_proc didn't get returned but updated but it could be just me.

would below work?

def _merge_extracted_that_are_subregion_of_inferred_text(
    extracted_layout,
    inferred_layout,
    extracted_is_subregion_of_inferred,
    extracted_to_proc,
    inferred_to_proc,
):
    """Merge extracted regions that are subregions of inferred text regions."""
    for inferred_index, inferred_row in enumerate(extracted_is_subregion_of_inferred.T):
        matches = np.where(inferred_row)[0]
        if not matches.size:
            continue
        # Mark matched regions as processed
        extracted_to_proc[matches] = False
        inferred_to_proc[inferred_index] = False
        # Expand inferred bounding box to encompass matched extracted regions
        inferred_layout.element_coords[[inferred_index]] = _minimum_containing_coords(
            inferred_layout.slice([inferred_index]),
            *[extracted_layout.slice([match]) for match in matches],
        )
    return inferred_layout

Merged via the queue into main with commit 723c074 Feb 7, 2025
41 checks passed
@badGarnet badGarnet deleted the feat/vectorize-layout-merging branch February 7, 2025 20:59
temp-adelyn pushed a commit to temp-adelyn/unstructured that referenced this pull request Mar 3, 2025
This PR rewrites the logic in `unstructured_inference` that merges
extracted with inferred layout using vectorized operations. The goal is
to:
- vectorize the operation to improve memory and cpu efficiency
- apply logic equally without order being a factor (the
`unstructured_inference` version uses loops and modifies the content of
the inner loop on the fly -> order of the out loop, which is the order
of extracted elements becomes a factor) determining the merging results
- rewrite the loop into clear steps with clear rules
- setup stage for followup improvements

While this PR aim to reproduce the existing behavior as much as possible
it is not an exact replica of the looped version. Because order is not a
factor any more some extracted elements that used to be not considered
part of a larger inferred element (due to processing order being not
optimum) are now properly merged. This lead to changes in one ingest
test. For example, the change shows that now we properly merge the
section numerical number with the section title as the full title
element.

## Test:

Since the goal of this refactor is to preserve as much existing behavior
as possible we rely on existing tests. As mentioned above the one file
that changed output during ingest test is a net positive change.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants