Commit 9e5ff22
authored
fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822)
Fixes: #3815
Verified on my very large documents that it doesn't unnecessarily and
unsuccessfully "repair" them.
You may or may not wish to keep the version check in `patch_psparser`.
Since ~you're pinning the version of pdfminer.six and since it isn't
guaranteed that the bug in question will be fixed in the next
pdfminer.six release (but it is rather serious, so I should hope so),
then perhaps you just want to unconditionally patch it.~ it seems like
pinning of versions is only operative when running from Docker (good!)
so never mind! Keep that version check!
Also corrected an import so that if you do feel like using a newer
version of pdfminer.six, it won't break on you.
---------
Authored-by: David Huggins-Daines <[email protected]>1 parent e230364 commit 9e5ff22
File tree
6 files changed
+87
-18
lines changed- test_unstructured/partition/pdf_image
- unstructured
- partition
- pdf_image
- patches
6 files changed
+87
-18
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1205 | 1205 | | |
1206 | 1206 | | |
1207 | 1207 | | |
| 1208 | + | |
1208 | 1209 | | |
1209 | | - | |
1210 | 1210 | | |
1211 | 1211 | | |
1212 | 1212 | | |
| |||
1215 | 1215 | | |
1216 | 1216 | | |
1217 | 1217 | | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
| 1224 | + | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
| 1228 | + | |
| 1229 | + | |
| 1230 | + | |
| 1231 | + | |
1218 | 1232 | | |
1219 | 1233 | | |
1220 | 1234 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | 14 | | |
16 | 15 | | |
17 | 16 | | |
| |||
96 | 95 | | |
97 | 96 | | |
98 | 97 | | |
99 | | - | |
| 98 | + | |
100 | 99 | | |
101 | 100 | | |
102 | 101 | | |
103 | 102 | | |
104 | 103 | | |
105 | | - | |
106 | | - | |
107 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
108 | 110 | | |
109 | 111 | | |
110 | 112 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| 2 | + | |
2 | 3 | | |
3 | | - | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
4 | 14 | | |
| 15 | + | |
5 | 16 | | |
6 | | - | |
7 | | - | |
8 | | - | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
9 | 26 | | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
| 27 | + | |
14 | 28 | | |
15 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
16 | 33 | | |
17 | 34 | | |
18 | 35 | | |
| |||
22 | 39 | | |
23 | 40 | | |
24 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
0 commit comments