Commit 8653c59
authored
Fix
This PR fixes the issue described here:
Unstructured-IO/unstructured#2463
Now `text_as_html` will only be available for elements that are HTML
strings (contain HTML tags)
E.g. output for **non** html element
```json
{
"element_id": "4a44dc15364204a80fe80e9039455cc1",
"metadata": {
"coordinates": {
"layout_height": 3301,
"layout_width": 2550,
"points": [
[170, 13],
[170, 140],
[427, 140],
[427, 13]
],
"system": "PixelSpace"
},
"file_directory": "/home/ubuntu/Documents",
"filename": "purchasing-payment-policy-10.pdf",
"filetype": "application/pdf",
"languages": ["eng"],
"last_modified": "2024-02-02T11:49:38",
"page_number": 1,
"parent_id": "e3b0c44298fc1c149afbf4c8996fb924"
},
"text": "10",
"type": "UncategorizedText"
}
```
E.g. output for html element
```json
{
"element_id": "398766f59dd6b37bd38b6d612159cd3e",
"metadata": {
"coordinates": {
"layout_height": 3301,
"layout_width": 2550,
"points": [
[433, 2180],
[433, 2181],
[2290, 2181],
[2290, 2180]
],
"system": "PixelSpace"
},
"file_directory": "/home/ubuntu/Documents",
"filename": "purchasing-payment-policy-10.pdf",
"filetype": "application/pdf",
"languages": ["eng"],
"last_modified": "2024-02-02T11:49:38",
"page_number": 1,
"text_as_html": "<table><tbody><tr><td></td><td> Subject Matter Expert / Department</td><td> Contract Review Responsibility</td><td></td></tr><tbody></table>"
},
"text": "Subject Matter Expert / Department Contract Review Responsibility",
"type": "Table"
}
```html_as_text appearing in every element metadata (#319)1 parent ed5f2c2 commit 8653c59
File tree
4 files changed
+38
-8
lines changed- test_unstructured_inference/models
- unstructured_inference
- models
4 files changed
+38
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
1 | 5 | | |
2 | 6 | | |
3 | 7 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
| |||
422 | 423 | | |
423 | 424 | | |
424 | 425 | | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
174 | | - | |
175 | | - | |
| 174 | + | |
| 175 | + | |
176 | 176 | | |
177 | | - | |
178 | | - | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
179 | 180 | | |
180 | 181 | | |
181 | | - | |
182 | | - | |
183 | | - | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
184 | 186 | | |
185 | 187 | | |
186 | 188 | | |
| |||
0 commit comments