Skip to content

Commit 5bb95b5

Browse files
authored
Fix parsing table cells (#3904)
This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```
1 parent 451ad97 commit 5bb95b5

File tree

11 files changed

+581
-120
lines changed

11 files changed

+581
-120
lines changed

CHANGELOG.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
1-
## 0.16.19-dev2
1+
## 0.16.19-dev3
22

33
### Enhancements
44

55
### Features
66

77
### Fixes
8-
- **fix a bug where table extraction is skipped when it shouldn't**. Pages with just one table as its content or starts with a table misses table extraction. The routing logic is now fixed.
8+
- **Fix a bug where table extraction is skipped when it shouldn't**. Pages with just one table as its content or starts with a table misses table extraction. The routing logic is now fixed.
99
- **Correct deprecated `ruff` invocation in `make tidy`**. This will future-proof it or avoid surprises if someone happens to upgrade Ruff.
1010
- **Remove upper bound constraint on python version** in setup.py. Python3.13 is not yet officially supported, but allow users to try.
11+
- **Fixes removing HTML elements from the inside of table cells** in html partition v=2.0. The HTML partitioner now correctly preserves HTML elements from the inside of table cells.
1112

1213
## 0.16.17
1314

test_unstructured/documents/unstructured_json_output/example.json

+46-6
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44
"metadata": {
55
"category_depth": 0,
66
"filename": "example.pdf",
7+
"filetype": "text/html",
8+
"languages": [
9+
"eng"
10+
],
711
"page_number": 1,
812
"parent_id": "897a8a47377c4ad6aab839a929879537",
913
"text_as_html": "<div class=\"Page\" data-page-number=\"1\" id=\"3a6b156a81764e17be128264241f8136\" />"
@@ -16,6 +20,10 @@
1620
"metadata": {
1721
"category_depth": 1,
1822
"filename": "example.pdf",
23+
"filetype": "text/html",
24+
"languages": [
25+
"eng"
26+
],
1927
"page_number": 1,
2028
"parent_id": "3a6b156a81764e17be128264241f8136",
2129
"text_as_html": "<header class=\"Header\" id=\"45b3d0053468484ba1c7b53998115412\" />"
@@ -28,9 +36,13 @@
2836
"metadata": {
2937
"category_depth": 2,
3038
"filename": "example.pdf",
39+
"filetype": "text/html",
40+
"languages": [
41+
"eng"
42+
],
3143
"page_number": 1,
3244
"parent_id": "45b3d0053468484ba1c7b53998115412",
33-
"text_as_html": "<h1 class=\"Title\" id=\"c95473e8a3704fc2b418697f9fddb27b\">Header </h1>"
45+
"text_as_html": "<h1 class=\"Title\" id=\"c95473e8a3704fc2b418697f9fddb27b\">Header</h1>"
3446
},
3547
"text": "Header",
3648
"type": "Title"
@@ -40,9 +52,13 @@
4052
"metadata": {
4153
"category_depth": 2,
4254
"filename": "example.pdf",
55+
"filetype": "text/html",
56+
"languages": [
57+
"eng"
58+
],
4359
"page_number": 1,
4460
"parent_id": "45b3d0053468484ba1c7b53998115412",
45-
"text_as_html": "<time class=\"CalendarDate\" id=\"379cbfdc16d44bd6a59e6cfabe6438d5\">Date: October 30, 2023 </time>"
61+
"text_as_html": "<time class=\"CalendarDate\" id=\"379cbfdc16d44bd6a59e6cfabe6438d5\">Date: October 30, 2023</time>"
4662
},
4763
"text": "Date: October 30, 2023",
4864
"type": "UncategorizedText"
@@ -52,9 +68,13 @@
5268
"metadata": {
5369
"category_depth": 1,
5470
"filename": "example.pdf",
71+
"filetype": "text/html",
72+
"languages": [
73+
"eng"
74+
],
5575
"page_number": 1,
5676
"parent_id": "3a6b156a81764e17be128264241f8136",
57-
"text_as_html": "<form class=\"Form\" id=\"637c2f6935fb4353a5f73025ce04619d\"> <label class=\"FormField\" for=\"company-name\" id=\"50027cccbe1948c9853ce0de37b635c2\">From field name </label><input class=\"FormFieldValue\" id=\"0032242af75c4b37984ea7fea9aac74c\" value=\"Example value\" /></form>"
77+
"text_as_html": "<form class=\"Form\" id=\"637c2f6935fb4353a5f73025ce04619d\"><label class=\"FormField\" for=\"company-name\" id=\"50027cccbe1948c9853ce0de37b635c2\">From field name</label><input class=\"FormFieldValue\" id=\"0032242af75c4b37984ea7fea9aac74c\" value=\"Example value\" /></form>"
5878
},
5979
"text": "From field name Example value",
6080
"type": "UncategorizedText"
@@ -64,6 +84,10 @@
6484
"metadata": {
6585
"category_depth": 1,
6686
"filename": "example.pdf",
87+
"filetype": "text/html",
88+
"languages": [
89+
"eng"
90+
],
6791
"page_number": 1,
6892
"parent_id": "3a6b156a81764e17be128264241f8136",
6993
"text_as_html": "<section class=\"Section\" id=\"592422373ed741b68a077e2003f8ed81\" />"
@@ -76,9 +100,13 @@
76100
"metadata": {
77101
"category_depth": 2,
78102
"filename": "example.pdf",
103+
"filetype": "text/html",
104+
"languages": [
105+
"eng"
106+
],
79107
"page_number": 1,
80108
"parent_id": "592422373ed741b68a077e2003f8ed81",
81-
"text_as_html": "<table class=\"Table\" id=\"dc3792d4422e444f90876b56d0cfb20d\"> <thead> <tr> <th>Description</th><th>Row header</th></tr></thead><tbody> <tr> <td>Value description</td><td>50 $ (1.32 %)</td></tr></tbody></table>"
109+
"text_as_html": "<table class=\"Table\" id=\"dc3792d4422e444f90876b56d0cfb20d\"><thead><tr><th>Description</th><th>Row header</th></tr></thead><tbody><tr><td>Value description</td><td><span>50 $</span><span>(1.32 %)</span></td></tr></tbody></table>"
82110
},
83111
"text": "Description Row header Value description 50 $ (1.32 %)",
84112
"type": "Table"
@@ -88,6 +116,10 @@
88116
"metadata": {
89117
"category_depth": 1,
90118
"filename": "example.pdf",
119+
"filetype": "text/html",
120+
"languages": [
121+
"eng"
122+
],
91123
"page_number": 1,
92124
"parent_id": "3a6b156a81764e17be128264241f8136",
93125
"text_as_html": "<section class=\"Section\" id=\"1032242af75c4b37984ea7fea9aac74c\" />"
@@ -100,9 +132,13 @@
100132
"metadata": {
101133
"category_depth": 2,
102134
"filename": "example.pdf",
135+
"filetype": "text/html",
136+
"languages": [
137+
"eng"
138+
],
103139
"page_number": 1,
104140
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
105-
"text_as_html": "<h2 class=\"Subtitle\" id=\"2a4e2c4a689f4f9a8c180b6b521e45c3\">2. Subtitle </h2>"
141+
"text_as_html": "<h2 class=\"Subtitle\" id=\"2a4e2c4a689f4f9a8c180b6b521e45c3\">2. Subtitle</h2>"
106142
},
107143
"text": "2. Subtitle",
108144
"type": "Title"
@@ -112,9 +148,13 @@
112148
"metadata": {
113149
"category_depth": 2,
114150
"filename": "example.pdf",
151+
"filetype": "text/html",
152+
"languages": [
153+
"eng"
154+
],
115155
"page_number": 1,
116156
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
117-
"text_as_html": "<p class=\"NarrativeText\" id=\"5591f7a4df01447e82515ce45f686fbe\">Paragraph text </p>"
157+
"text_as_html": "<p class=\"NarrativeText\" id=\"5591f7a4df01447e82515ce45f686fbe\">Paragraph text</p>"
118158
},
119159
"text": "Paragraph text",
120160
"type": "NarrativeText"

0 commit comments

Comments
 (0)