Commit 85bfb1b
Feat/vectorize layout merging (Unstructured-IO#3900)
This PR rewrites the logic in `unstructured_inference` that merges
extracted with inferred layout using vectorized operations. The goal is
to:
- vectorize the operation to improve memory and cpu efficiency
- apply logic equally without order being a factor (the
`unstructured_inference` version uses loops and modifies the content of
the inner loop on the fly -> order of the out loop, which is the order
of extracted elements becomes a factor) determining the merging results
- rewrite the loop into clear steps with clear rules
- setup stage for followup improvements
While this PR aim to reproduce the existing behavior as much as possible
it is not an exact replica of the looped version. Because order is not a
factor any more some extracted elements that used to be not considered
part of a larger inferred element (due to processing order being not
optimum) are now properly merged. This lead to changes in one ingest
test. For example, the change shows that now we properly merge the
section numerical number with the section title as the full title
element.
## Test:
Since the goal of this refactor is to preserve as much existing behavior
as possible we rely on existing tests. As mentioned above the one file
that changed output during ingest test is a net positive change.
---------
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>1 parent 3886dd4 commit 85bfb1b
File tree
6 files changed
+436
-34
lines changed- test_unstructured_ingest/expected-structured-output
- google-drive
- local-single-file-with-pdf-infer-table-structure
- unstructured
- partition/pdf_image
6 files changed
+436
-34
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
| 6 | + | |
5 | 7 | | |
6 | 8 | | |
7 | 9 | | |
| |||
Lines changed: 63 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4773 | 4773 | | |
4774 | 4774 | | |
4775 | 4775 | | |
| 4776 | + | |
| 4777 | + | |
| 4778 | + | |
| 4779 | + | |
| 4780 | + | |
| 4781 | + | |
| 4782 | + | |
| 4783 | + | |
| 4784 | + | |
| 4785 | + | |
| 4786 | + | |
| 4787 | + | |
| 4788 | + | |
| 4789 | + | |
| 4790 | + | |
| 4791 | + | |
| 4792 | + | |
| 4793 | + | |
| 4794 | + | |
| 4795 | + | |
| 4796 | + | |
| 4797 | + | |
| 4798 | + | |
| 4799 | + | |
| 4800 | + | |
| 4801 | + | |
| 4802 | + | |
| 4803 | + | |
| 4804 | + | |
| 4805 | + | |
| 4806 | + | |
| 4807 | + | |
| 4808 | + | |
| 4809 | + | |
| 4810 | + | |
| 4811 | + | |
| 4812 | + | |
| 4813 | + | |
| 4814 | + | |
| 4815 | + | |
| 4816 | + | |
| 4817 | + | |
| 4818 | + | |
| 4819 | + | |
| 4820 | + | |
| 4821 | + | |
| 4822 | + | |
| 4823 | + | |
| 4824 | + | |
| 4825 | + | |
| 4826 | + | |
| 4827 | + | |
| 4828 | + | |
| 4829 | + | |
| 4830 | + | |
| 4831 | + | |
| 4832 | + | |
| 4833 | + | |
| 4834 | + | |
| 4835 | + | |
| 4836 | + | |
| 4837 | + | |
4776 | 4838 | | |
4777 | 4839 | | |
4778 | | - | |
| 4840 | + | |
4779 | 4841 | | |
4780 | 4842 | | |
4781 | 4843 | | |
| |||
Lines changed: 8 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
360 | 360 | | |
361 | 361 | | |
362 | 362 | | |
363 | | - | |
364 | | - | |
| 363 | + | |
| 364 | + | |
365 | 365 | | |
366 | 366 | | |
367 | 367 | | |
| |||
1600 | 1600 | | |
1601 | 1601 | | |
1602 | 1602 | | |
1603 | | - | |
1604 | | - | |
1605 | | - | |
| 1603 | + | |
| 1604 | + | |
| 1605 | + | |
1606 | 1606 | | |
1607 | 1607 | | |
1608 | 1608 | | |
| |||
1622 | 1622 | | |
1623 | 1623 | | |
1624 | 1624 | | |
1625 | | - | |
1626 | | - | |
1627 | | - | |
| 1625 | + | |
| 1626 | + | |
| 1627 | + | |
1628 | 1628 | | |
1629 | 1629 | | |
1630 | 1630 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
357 | 357 | | |
358 | 358 | | |
359 | 359 | | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
360 | 367 | | |
361 | 368 | | |
362 | 369 | | |
| |||
434 | 441 | | |
435 | 442 | | |
436 | 443 | | |
437 | | - | |
438 | | - | |
439 | | - | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
440 | 456 | | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | 457 | | |
445 | | - | |
446 | | - | |
| 458 | + | |
| 459 | + | |
447 | 460 | | |
448 | | - | |
| 461 | + | |
449 | 462 | | |
450 | 463 | | |
451 | 464 | | |
| |||
0 commit comments