Commit 344202f
feat: detect language for PDFs (#4051)
The `@apply_metadata` decorator already contains logic to detect the
language of the element text (on either a document or element level).
Update pdfs, and later images, to use this decorator to get accurate
element language results outputted.
Test
```
from unstructured.partition.auto import partition
def test_partition_pdf():
pdf_path = "example-docs/language-docs/fr_olap.pdf"
elements = partition(pdf_path) # optionally set `detect_language_per_element=True)`
print(f"Number of elements partitioned: {len(elements)}")
# Check if elements are returned
assert len(elements) > 0, "No elements were partitioned from the PDF."
# check language outputted for each element
for element in elements:
print(element)
print(element.metadata.languages)
print("-------------------------------")
test_partition_pdf()
```
---------
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: shreyanid <[email protected]>1 parent 2ffaf6f commit 344202f
File tree
13 files changed
+129
-63
lines changed- example-docs/language-docs
- test_unstructured_ingest
- expected-structured-output
- local-single-file-with-encoding
- pdf-fast-reprocess/azure
- src
- test_unstructured/partition
- pdf_image
- unstructured
- partition
- common
13 files changed
+129
-63
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
1 | 10 | | |
2 | 11 | | |
3 | 12 | | |
| |||
Binary file not shown.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
569 | 569 | | |
570 | 570 | | |
571 | 571 | | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
572 | 615 | | |
573 | 616 | | |
574 | 617 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
561 | 561 | | |
562 | 562 | | |
563 | 563 | | |
| 564 | + | |
564 | 565 | | |
565 | 566 | | |
566 | 567 | | |
| |||
1301 | 1302 | | |
1302 | 1303 | | |
1303 | 1304 | | |
| 1305 | + | |
| 1306 | + | |
| 1307 | + | |
| 1308 | + | |
| 1309 | + | |
| 1310 | + | |
| 1311 | + | |
| 1312 | + | |
| 1313 | + | |
| 1314 | + | |
| 1315 | + | |
| 1316 | + | |
| 1317 | + | |
| 1318 | + | |
| 1319 | + | |
| 1320 | + | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
| 1324 | + | |
| 1325 | + | |
1304 | 1326 | | |
1305 | 1327 | | |
1306 | 1328 | | |
| |||
1309 | 1331 | | |
1310 | 1332 | | |
1311 | 1333 | | |
1312 | | - | |
| 1334 | + | |
1313 | 1335 | | |
1314 | 1336 | | |
1315 | 1337 | | |
1316 | 1338 | | |
1317 | 1339 | | |
1318 | | - | |
| 1340 | + | |
1319 | 1341 | | |
1320 | 1342 | | |
1321 | | - | |
1322 | | - | |
| 1343 | + | |
| 1344 | + | |
1323 | 1345 | | |
1324 | 1346 | | |
1325 | 1347 | | |
| |||
Lines changed: 3 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
9 | | - | |
10 | | - | |
11 | | - | |
| 8 | + | |
12 | 9 | | |
13 | 10 | | |
14 | 11 | | |
| |||
29 | 26 | | |
30 | 27 | | |
31 | 28 | | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 29 | + | |
36 | 30 | | |
37 | 31 | | |
38 | 32 | | |
| |||
53 | 47 | | |
54 | 48 | | |
55 | 49 | | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
| 50 | + | |
60 | 51 | | |
61 | 52 | | |
62 | 53 | | |
| |||
77 | 68 | | |
78 | 69 | | |
79 | 70 | | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | 71 | | |
84 | 72 | | |
85 | 73 | | |
| |||
0 commit comments