Skip to content

Commit 3973a30

Browse files
Feat: Add pdfminer parameters configuration (#3918)
This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: plutasnyy <[email protected]>
1 parent b521bce commit 3973a30

File tree

13 files changed

+188
-109
lines changed

13 files changed

+188
-109
lines changed

CHANGELOG.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
## 0.16.21-dev4
1+
## 0.16.21-dev5
22

33
### Enhancements
44
- **Use password** to load PDF with all modes
55

66
- **use vectorized logic to merge inferred and extracted layouts**. Using the new `LayoutElements` data structure and numpy library to refactor the layout merging logic to improve compute performance as well as making logic more clear
77

8+
- **Add PDF Miner configuration** Now PDF Miner can be configured via `pdfminer_line_overlap`, `pdfminer_word_margin`, `pdfminer_line_margin` and `pdfminer_char_margin` parameters added to partition method.
9+
810
### Features
911

1012
### Fixes

test_unstructured/partition/pdf_image/test_pdfminer_processing.py

+20
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1+
from unittest.mock import patch
2+
13
import numpy as np
24
import pytest
5+
from pdfminer.layout import LAParams
36
from PIL import Image
47
from unstructured_inference.constants import Source as InferenceSource
58
from unstructured_inference.inference.elements import (
@@ -11,6 +14,7 @@
1114
from unstructured_inference.inference.layout import DocumentLayout, LayoutElement, PageLayout
1215

1316
from test_unstructured.unit_utils import example_doc_path
17+
from unstructured.partition.auto import partition
1418
from unstructured.partition.pdf_image.pdfminer_processing import (
1519
_validate_bbox,
1620
aggregate_embedded_text_by_block,
@@ -242,3 +246,19 @@ def test_process_file_with_pdfminer():
242246
assert len(layout)
243247
assert "LayoutParser: A Unified Toolkit for Deep\n" in layout[0].texts
244248
assert links[0][0]["url"] == "https://layout-parser.github.io"
249+
250+
251+
@patch("unstructured.partition.pdf_image.pdfminer_utils.LAParams", return_value=LAParams())
252+
def test_laprams_are_passed_from_partition_to_pdfminer(pdfminer_mock):
253+
partition(
254+
filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"),
255+
pdfminer_line_margin=1.123,
256+
pdfminer_char_margin=None,
257+
pdfminer_line_overlap=0.0123,
258+
pdfminer_word_margin=3.21,
259+
)
260+
assert pdfminer_mock.call_args.kwargs == {
261+
"line_margin": 1.123,
262+
"line_overlap": 0.0123,
263+
"word_margin": 3.21,
264+
}

test_unstructured_ingest/expected-structured-output/biomed-api/65/11/main.PMC6312790.pdf.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -513,7 +513,7 @@
513513
"type": "Title"
514514
},
515515
{
516-
"element_id": "be270e13c935334fa3b17b13066d639b",
516+
"element_id": "9764a7d0d48e56e28ae267d6fe521036",
517517
"metadata": {
518518
"data_source": {},
519519
"filetype": "application/pdf",
@@ -522,7 +522,7 @@
522522
],
523523
"page_number": 2
524524
},
525-
"text": "The results of the experiment are presented in this session. The results obtained from weight loss method for stainless steel Type 316 immersed in 0.5 M H2SO4 solution in the absence and presence of different concentrations of egg shell powder (ES) are presented in Figs. 1–3 respectively. It can be seen clearly from these Figures that the efficiency of egg shell powder increase with the inhibitor con- centration, The increase in its efficiency could be as a result of increase in the constituent molecule",
525+
"text": "The results of the experiment are presented in this session. The results obtained from weight loss method for stainless steel Type 316 immersed in 0.5 M H2SO4 solution in the absence and presence of different concentrations of egg shell powder (ES) are presented in Figs.1–3 respectively. It can be seen clearly from these Figures that the efficiency of egg shell powder increase with the inhibitor con- centration, The increase in its efficiency could be as a result of increase in the constituent molecule",
526526
"type": "NarrativeText"
527527
},
528528
{

test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json

+16-16
Original file line numberDiff line numberDiff line change
@@ -465,7 +465,7 @@
465465
"type": "Title"
466466
},
467467
{
468-
"element_id": "0cc9334df550d1730f2d468941a38225",
468+
"element_id": "02c4df0e110486afd2bd74245e7d93d9",
469469
"metadata": {
470470
"data_source": {},
471471
"filetype": "application/pdf",
@@ -474,14 +474,14 @@
474474
],
475475
"links": [
476476
{
477-
"start_index": 386,
477+
"start_index": 383,
478478
"text": "https :// orlib . uqcloud . net /",
479479
"url": "https://orlib.uqcloud.net/"
480480
}
481481
],
482482
"page_number": 2
483483
},
484-
"text": "Subject area Operations research More specific subject area Vehicle scheduling Type of data Tables, text files How data were acquired Artificially generated by a C þ þ program on Intels Xeons CPU E5– 2670 v2 with Linux operating system. Data format Raw Experimental factors Sixty randomly generated instances of the MDVSP with the number of depots in (8, 12, 16) and the number of trips in (1500, 2000, 2500, 3000) Experimental features Randomly generated instances Data source location IITB-Monash Research Academy, IIT Bombay, Powai, Mumbai, India. Data accessibility Data can be downloaded from https://orlib.uqcloud.net/ Related research article Kulkarni, S., Krishnamoorthy, M., Ranade, A., Ernst, A.T. and Patil, R., 2018. A new formulation and a column generation-based heuristic for the multiple depot vehicle scheduling problem. Transportation Research Part B: Methodological, 118, pp. 457–487 [3].",
484+
"text": "Subject area Operations research More specific subject area Vehicle scheduling Type of data Tables, text files How data were acquired Artificially generated by a þ program on Intels Xeons CPU E5– 2670 v2 with Linux operating system. Data format Raw Experimental factors Sixty randomly generated instances of the MDVSP with the number of depots in (8,12,16) and the number of trips in (1500, 2000, 2500, 3000) Experimental features Randomly generated instances Data source location IITB-Monash Research Academy, IIT Bombay, Powai, Mumbai, India. Data accessibility Data can be downloaded from https://orlib.uqcloud.net/ Related research article Kulkarni, S., Krishnamoorthy, M., Ranade, A., Ernst, A.T. and Patil, R., 2018. A new formulation and a column generation-based heuristic for the multiple depot vehicle scheduling problem. Transportation Research Part B: Methodological, 118, pp. 457–487 [3].",
485485
"type": "Table"
486486
},
487487
{
@@ -576,7 +576,7 @@
576576
"type": "Title"
577577
},
578578
{
579-
"element_id": "683993fc4592941bf8b06173870aa63c",
579+
"element_id": "1f3d79f338b86fbfcfa7054f11de28f0",
580580
"metadata": {
581581
"data_source": {},
582582
"filetype": "application/pdf",
@@ -585,14 +585,14 @@
585585
],
586586
"links": [
587587
{
588-
"start_index": 611,
588+
"start_index": 609,
589589
"text": "https :// orlib . uqcloud . net",
590590
"url": "https://orlib.uqcloud.net"
591591
}
592592
],
593593
"page_number": 2
594594
},
595-
"text": "The dataset contains 60 different problem instances of the multiple depot vehicle scheduling pro- blem (MDVSP). Each problem instance is provided in a separate file. Each file is named as ‘RN-m-n-k.dat’, where ‘m’, ‘n’, and ‘k’ denote the number of depots, the number of trips, and the instance number for the size, ‘ðm; nÞ’, respectively. For example, the problem instance, ‘RN-8–1500-01.dat’, is the first problem instance with 8 depots and 1500 trips. For the number of depots, m, we used three values, 8, 12, and 16. The four values for the number of trips, n, are 1500, 2000, 2500, and 3000. For each size, ðm; nÞ, five instances are provided. The dataset can be downloaded from https://orlib.uqcloud.net. For each problem instance, the following information is provided:",
595+
"text": "The dataset contains 60 different problem instances of the multiple depot vehicle scheduling pro- blem (MDVSP). Each problem instance is provided in a separate file. Each file is named as ‘RN-m-n-k.dat’, where ‘m’, ‘n’, and ‘k’ denote the number of depots, the number of trips, and the instance number for the size, ‘ðm;nÞ’, respectively. For example, the problem instance, ‘RN-8–1500-01.dat’, is the first problem instance with 8 depots and 1500 trips. For the number of depots, m, we used three values, 8,12, and 16. The four values for the number of trips, n, are 1500, 2000, 2500, and 3000. For each size, ðm;nÞ, five instances are provided. The dataset can be downloaded from https://orlib.uqcloud.net. For each problem instance, the following information is provided:",
596596
"type": "NarrativeText"
597597
},
598598
{
@@ -661,7 +661,7 @@
661661
"type": "UncategorizedText"
662662
},
663663
{
664-
"element_id": "96ca028aef61c1fd98c9f0232a833498",
664+
"element_id": "39943e8e76f7ddd879284cf782cac2f4",
665665
"metadata": {
666666
"data_source": {},
667667
"filetype": "application/pdf",
@@ -670,7 +670,7 @@
670670
],
671671
"page_number": 2
672672
},
673-
"text": "For each trip i A 1; 2; …; n, a start time, ts i , an end time, te i , a start location, ls i , and an end location, le i , and",
673+
"text": "For each trip iA1;2;…;n, a start time, ts i, an end time, te i , a start location, ls i, and an end location, le i , and",
674674
"type": "NarrativeText"
675675
},
676676
{
@@ -726,7 +726,7 @@
726726
"type": "NarrativeText"
727727
},
728728
{
729-
"element_id": "2bd550b209c7c06c42966aad21822ea5",
729+
"element_id": "9698643b7f3d779d8a5fdb13dffef106",
730730
"metadata": {
731731
"data_source": {},
732732
"filetype": "application/pdf",
@@ -735,7 +735,7 @@
735735
],
736736
"page_number": 3
737737
},
738-
"text": "and end location of the trip. A long trip is about 3–5 h in duration and has the same start and end location. For all instances, m r l and the locations 1; …; m correspond to depots, while the remaining locations only appear as trip start and end locations.",
738+
"text": "and end location of the trip. A long trip is about 3–5 h in duration and has the same start and end location. For all instances, mrl and the locations 1;…;m correspond to depots, while the remaining locations only appear as trip start and end locations.",
739739
"type": "NarrativeText"
740740
},
741741
{
@@ -804,7 +804,7 @@
804804
"type": "NarrativeText"
805805
},
806806
{
807-
"element_id": "9d3f44c51fe13ebdf6b9511859e4f1b7",
807+
"element_id": "02146cfa4d68e86d868e99acab4f7c42",
808808
"metadata": {
809809
"data_source": {},
810810
"filetype": "application/pdf",
@@ -813,7 +813,7 @@
813813
],
814814
"page_number": 3
815815
},
816-
"text": "For each instance size ðm; nÞ, Table 1 provides the average of the number of locations, the number of times, the number of vehicles, and the number of possible empty travels, over five instances. The number of locations includes m distinct locations for depots and the number of locations at which various trips start or end. The number of times includes the start and the end time of the planning horizon and the start/end times for the trips. The number of vehicles is the total number of vehicles from all the depots. The number of possible empty travels is the number of possible connections between trips that require a vehicle travelling empty between two consecutive trips in a schedule.",
816+
"text": "For each instance size ðm;nÞ, Table 1 provides the average of the number of locations, the number of times, the number of vehicles, and the number of possible empty travels, over five instances. The number of locations includes m distinct locations for depots and the number of locations at which various trips start or end. The number of times includes the start and the end time of the planning horizon and the start/end times for the trips. The number of vehicles is the total number of vehicles from all the depots. The number of possible empty travels is the number of possible connections between trips that require a vehicle travelling empty between two consecutive trips in a schedule.",
817817
"type": "NarrativeText"
818818
},
819819
{
@@ -830,7 +830,7 @@
830830
"type": "NarrativeText"
831831
},
832832
{
833-
"element_id": "d9904b5393369c5204af83b64035802a",
833+
"element_id": "fc4b1e0c5bb8b330e2160f6615975401",
834834
"metadata": {
835835
"data_source": {},
836836
"filetype": "application/pdf",
@@ -839,7 +839,7 @@
839839
],
840840
"page_number": 3
841841
},
842-
"text": "The dataset also includes a program ‘GenerateInstance.cpp’ that can be used to generate new instances. The program takes three inputs, the number of depots ðmÞ, the number of trips ðnÞ, and the number of instances for each size ðm; nÞ.",
842+
"text": "The dataset also includes a program ‘GenerateInstance.cpp’ that can be used to generate new instances. The program takes three inputs, the number of depots ðmÞ, the number of trips ðnÞ, and the number of instances for each size ðm;nÞ.",
843843
"type": "NarrativeText"
844844
},
845845
{
@@ -934,7 +934,7 @@
934934
"type": "NarrativeText"
935935
},
936936
{
937-
"element_id": "17e17590003c0f514220c453f88da6b7",
937+
"element_id": "86e18db80eab89d0556c22321732e4e7",
938938
"metadata": {
939939
"data_source": {},
940940
"filetype": "application/pdf",
@@ -943,7 +943,7 @@
943943
],
944944
"page_number": 4
945945
},
946-
"text": "Number of Number of columns in Description lines each line 1 3 The number of depots, the number of trips, and the number of locations. 1 m The number of vehicles rd at each depot d. n 4 One line for each trip, i ¼ 1; 2; …; n. Each line provides the start location ls i , the start i , the end location le time ts i and the end time te i for the corresponding trip. l l Each element, δij; where i; j A 1; 2; …; l, refers to the travel time between location i and location j.",
946+
"text": "Number of Number of columns in Description lines each line 1 3 The number of depots, the number of trips, and the number of locations. 1 m The number of vehicles rd at each depot d. n 4 One line for each trip, i ¼ 1;2;…;n. Each line provides the start location ls i, the start i, the end location le time ts i and the end time te i for the corresponding trip. l l Each element, δij; where i;jA1;2;…;l, refers to the travel time between location i and location j.",
947947
"type": "Table"
948948
},
949949
{

test_unstructured_ingest/expected-structured-output/google-drive/recalibrating-risk-report.pdf.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -1799,8 +1799,8 @@
17991799
},
18001800
{
18011801
"type": "Image",
1802-
"element_id": "1b93c33208a85ba6d2a69d23babd6def",
1803-
"text": "25 24.6 20 18.4 e 15 10 5 4.6 2.8 0 C oal Oil Bio m ass N atural gas 0.07 Wind 0.04 H ydropo w er 0.02 S olar 0.01 N uclear ",
1802+
"element_id": "c0a86e51afb417a3b057d7cf101bbed6",
1803+
"text": "25 24.6 20 18.4 e 15 10 5 4.6 2.8 0 Coal Oil Bio m ass Natural gas 0.07 Wind 0.04 Hydropower 0.02 Solar 0.01 Nuclear ",
18041804
"metadata": {
18051805
"filetype": "application/pdf",
18061806
"languages": [

0 commit comments

Comments
 (0)