Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
## 0.16.19-dev2
## 0.16.19-dev3

### Enhancements

### Features

### Fixes
- **fix a bug where table extraction is skipped when it shouldn't**. Pages with just one table as its content or starts with a table misses table extraction. The routing logic is now fixed.
- **Fix a bug where table extraction is skipped when it shouldn't**. Pages with just one table as its content or starts with a table misses table extraction. The routing logic is now fixed.
- **Correct deprecated `ruff` invocation in `make tidy`**. This will future-proof it or avoid surprises if someone happens to upgrade Ruff.
- **Remove upper bound constraint on python version** in setup.py. Python3.13 is not yet officially supported, but allow users to try.
- **Fixes removing HTML elements from the inside of table cells** in html partition v=2.0. The HTML partitioner now correctly preserves HTML elements from the inside of table cells.

## 0.16.17

Expand Down
52 changes: 46 additions & 6 deletions test_unstructured/documents/unstructured_json_output/example.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
"metadata": {
"category_depth": 0,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "897a8a47377c4ad6aab839a929879537",
"text_as_html": "<div class=\"Page\" data-page-number=\"1\" id=\"3a6b156a81764e17be128264241f8136\" />"
Expand All @@ -16,6 +20,10 @@
"metadata": {
"category_depth": 1,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "3a6b156a81764e17be128264241f8136",
"text_as_html": "<header class=\"Header\" id=\"45b3d0053468484ba1c7b53998115412\" />"
Expand All @@ -28,9 +36,13 @@
"metadata": {
"category_depth": 2,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "45b3d0053468484ba1c7b53998115412",
"text_as_html": "<h1 class=\"Title\" id=\"c95473e8a3704fc2b418697f9fddb27b\">Header </h1>"
"text_as_html": "<h1 class=\"Title\" id=\"c95473e8a3704fc2b418697f9fddb27b\">Header</h1>"
},
"text": "Header",
"type": "Title"
Expand All @@ -40,9 +52,13 @@
"metadata": {
"category_depth": 2,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "45b3d0053468484ba1c7b53998115412",
"text_as_html": "<time class=\"CalendarDate\" id=\"379cbfdc16d44bd6a59e6cfabe6438d5\">Date: October 30, 2023 </time>"
"text_as_html": "<time class=\"CalendarDate\" id=\"379cbfdc16d44bd6a59e6cfabe6438d5\">Date: October 30, 2023</time>"
},
"text": "Date: October 30, 2023",
"type": "UncategorizedText"
Expand All @@ -52,9 +68,13 @@
"metadata": {
"category_depth": 1,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "3a6b156a81764e17be128264241f8136",
"text_as_html": "<form class=\"Form\" id=\"637c2f6935fb4353a5f73025ce04619d\"> <label class=\"FormField\" for=\"company-name\" id=\"50027cccbe1948c9853ce0de37b635c2\">From field name </label><input class=\"FormFieldValue\" id=\"0032242af75c4b37984ea7fea9aac74c\" value=\"Example value\" /></form>"
"text_as_html": "<form class=\"Form\" id=\"637c2f6935fb4353a5f73025ce04619d\"><label class=\"FormField\" for=\"company-name\" id=\"50027cccbe1948c9853ce0de37b635c2\">From field name</label><input class=\"FormFieldValue\" id=\"0032242af75c4b37984ea7fea9aac74c\" value=\"Example value\" /></form>"
},
"text": "From field name Example value",
"type": "UncategorizedText"
Expand All @@ -64,6 +84,10 @@
"metadata": {
"category_depth": 1,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "3a6b156a81764e17be128264241f8136",
"text_as_html": "<section class=\"Section\" id=\"592422373ed741b68a077e2003f8ed81\" />"
Expand All @@ -76,9 +100,13 @@
"metadata": {
"category_depth": 2,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "592422373ed741b68a077e2003f8ed81",
"text_as_html": "<table class=\"Table\" id=\"dc3792d4422e444f90876b56d0cfb20d\"> <thead> <tr> <th>Description</th><th>Row header</th></tr></thead><tbody> <tr> <td>Value description</td><td>50 $ (1.32 %)</td></tr></tbody></table>"
"text_as_html": "<table class=\"Table\" id=\"dc3792d4422e444f90876b56d0cfb20d\"><thead><tr><th>Description</th><th>Row header</th></tr></thead><tbody><tr><td>Value description</td><td><span>50 $</span><span>(1.32 %)</span></td></tr></tbody></table>"
},
"text": "Description Row header Value description 50 $ (1.32 %)",
"type": "Table"
Expand All @@ -88,6 +116,10 @@
"metadata": {
"category_depth": 1,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "3a6b156a81764e17be128264241f8136",
"text_as_html": "<section class=\"Section\" id=\"1032242af75c4b37984ea7fea9aac74c\" />"
Expand All @@ -100,9 +132,13 @@
"metadata": {
"category_depth": 2,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
"text_as_html": "<h2 class=\"Subtitle\" id=\"2a4e2c4a689f4f9a8c180b6b521e45c3\">2. Subtitle </h2>"
"text_as_html": "<h2 class=\"Subtitle\" id=\"2a4e2c4a689f4f9a8c180b6b521e45c3\">2. Subtitle</h2>"
},
"text": "2. Subtitle",
"type": "Title"
Expand All @@ -112,9 +148,13 @@
"metadata": {
"category_depth": 2,
"filename": "example.pdf",
"filetype": "text/html",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
"text_as_html": "<p class=\"NarrativeText\" id=\"5591f7a4df01447e82515ce45f686fbe\">Paragraph text </p>"
"text_as_html": "<p class=\"NarrativeText\" id=\"5591f7a4df01447e82515ce45f686fbe\">Paragraph text</p>"
},
"text": "Paragraph text",
"type": "NarrativeText"
Expand Down
Loading
Loading