Commit d64c57d
feat: consider rotated text as low fidelityfeat: consider rotated text (#4190)
This PR updates the function `is_text_embedded`:
- now considers both if chars are invisible or rotated (as a result
includes some refactoring of variable names)
- rotated text elements can have wrong character order compared to
natural reading order -> if feed into downstream applications like
embedding text the element loses its semantic meaning
- as a result this update flags texts with too many rotated characters
as only partially embedded: its source is technically embedded but it
may need post processing to be useful
---------
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: badGarnet <[email protected]>1 parent 138661a commit d64c57d
File tree
6 files changed
+48
-29
lines changed- test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure
- test_unstructured/partition/pdf_image
- unstructured
- partition
- pdf_image
- utils
6 files changed
+48
-29
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
Lines changed: 8 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
284 | 284 | | |
285 | 285 | | |
286 | 286 | | |
287 | | - | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
288 | 290 | | |
289 | 291 | | |
290 | 292 | | |
| |||
310 | 312 | | |
311 | 313 | | |
312 | 314 | | |
313 | | - | |
| 315 | + | |
314 | 316 | | |
315 | 317 | | |
316 | 318 | | |
| 319 | + | |
317 | 320 | | |
318 | 321 | | |
319 | | - | |
| 322 | + | |
320 | 323 | | |
321 | 324 | | |
322 | 325 | | |
| |||
351 | 354 | | |
352 | 355 | | |
353 | 356 | | |
354 | | - | |
| 357 | + | |
355 | 358 | | |
356 | 359 | | |
357 | 360 | | |
358 | 361 | | |
359 | 362 | | |
360 | 363 | | |
361 | | - | |
| 364 | + | |
Lines changed: 1 addition & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | 7 | | |
9 | 8 | | |
10 | 9 | | |
| |||
49 | 48 | | |
50 | 49 | | |
51 | 50 | | |
52 | | - | |
53 | 51 | | |
54 | 52 | | |
55 | 53 | | |
| |||
72 | 70 | | |
73 | 71 | | |
74 | 72 | | |
75 | | - | |
76 | 73 | | |
77 | 74 | | |
78 | 75 | | |
| |||
95 | 92 | | |
96 | 93 | | |
97 | 94 | | |
98 | | - | |
| 95 | + | |
99 | 96 | | |
100 | 97 | | |
101 | 98 | | |
| |||
118 | 115 | | |
119 | 116 | | |
120 | 117 | | |
121 | | - | |
122 | 118 | | |
123 | 119 | | |
124 | 120 | | |
| |||
163 | 159 | | |
164 | 160 | | |
165 | 161 | | |
166 | | - | |
167 | 162 | | |
168 | 163 | | |
169 | 164 | | |
| |||
186 | 181 | | |
187 | 182 | | |
188 | 183 | | |
189 | | - | |
190 | 184 | | |
191 | 185 | | |
192 | 186 | | |
| |||
209 | 203 | | |
210 | 204 | | |
211 | 205 | | |
212 | | - | |
213 | 206 | | |
214 | 207 | | |
215 | 208 | | |
| |||
232 | 225 | | |
233 | 226 | | |
234 | 227 | | |
235 | | - | |
236 | 228 | | |
237 | 229 | | |
238 | 230 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
380 | 381 | | |
381 | 382 | | |
382 | 383 | | |
383 | | - | |
384 | | - | |
385 | | - | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
386 | 406 | | |
387 | 407 | | |
388 | 408 | | |
389 | 409 | | |
390 | | - | |
| 410 | + | |
391 | 411 | | |
392 | 412 | | |
393 | 413 | | |
394 | 414 | | |
395 | | - | |
396 | | - | |
397 | | - | |
398 | | - | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
399 | 422 | | |
400 | 423 | | |
401 | 424 | | |
| |||
406 | 429 | | |
407 | 430 | | |
408 | 431 | | |
409 | | - | |
410 | | - | |
| 432 | + | |
| 433 | + | |
411 | 434 | | |
412 | 435 | | |
413 | 436 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
174 | 174 | | |
175 | 175 | | |
176 | 176 | | |
177 | | - | |
178 | | - | |
179 | | - | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
180 | 180 | | |
181 | 181 | | |
182 | 182 | | |
| |||
0 commit comments