Skip to content

Commit 8787857

Browse files
authored
Merge branch 'main' into main
2 parents 9ac325e + 3b718ec commit 8787857

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1881
-1342
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ tags
190190
# Persistent undo
191191
[._]*.un~
192192

193-
.DS_Store
193+
*.DS_Store
194194

195195
# Ruff cache
196196
.ruff_cache/

CHANGELOG.md

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,71 @@
1+
## 0.16.12-dev0
2+
3+
### Enhancements
4+
5+
- **Prepare auto-partitioning for pluggable partitioners**. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.
6+
7+
### Features
8+
9+
### Fixes
10+
11+
## 0.16.11
12+
13+
### Enhancements
14+
15+
- **Enhance quote standardization tests** with additional Unicode scenarios
16+
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
17+
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
18+
19+
### Features
20+
21+
### Fixes
22+
23+
- Fix ipv4 regex to correctly include up to three digit octets.
24+
25+
## 0.16.10
26+
27+
### Enhancements
28+
29+
### Features
30+
31+
### Fixes
32+
33+
- **Fix original file doctype detection** from cct converted file paths for metrics calculation.
34+
35+
## 0.16.9
36+
37+
### Enhancements
38+
39+
### Features
40+
41+
### Fixes
42+
43+
- **Fix NLTK Download** to not download from unstructured S3 Bucket
44+
45+
## 0.16.8
46+
47+
### Enhancements
48+
- **Metrics: Weighted table average is optional**
49+
50+
### Features
51+
52+
### Fixes
53+
54+
## 0.16.7
55+
56+
### Enhancements
57+
- **Add image_alt_mode to partition_html** Adds an `image_alt_mode` parameter to `partition_html()` to control how alt text is extracted from images in HTML documents for `html_parser_version=v2` . The parameter can be set to `to_text` to extract alt text as text from `<img>` html tags
58+
59+
### Features
60+
61+
### Fixes
62+
63+
164
## 0.16.6
265

366
### Enhancements
4-
- **Every <table> tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning. This change is made to improve the accuracy of table extraction from HTML documents.
5-
- **Every HTML has default ontology class assigned** When parsing HTML to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class
67+
- **Every `<table>` tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning (`html_parser_version=v2`. This change is made to improve the accuracy of table extraction from HTML documents.
68+
- **Every HTML has default ontology class assigned** When parsing HTML with `html_parser_version=v2` to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class
669
- **Use (number of actual table) weighted average for table metrics** In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores
770

871
### Features

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,10 @@ test-extra-pypandoc:
198198
test-extra-xlsx:
199199
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/partition/test_xlsx.py
200200

201+
.PHONY: test-text-extraction-evaluate
202+
test-text-extraction-evaluate:
203+
PYTHONPATH=. CI=$(CI) ${PYTHON} -m pytest test_unstructured/metrics/test_text_extraction.py
204+
201205
## check: runs linters (includes tests)
202206
.PHONY: check
203207
check: check-ruff check-black check-flake8 check-version
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# pyright: reportPrivateUsage=false
2+
3+
"""
4+
Script to render HTML from unstructured elements.
5+
NOTE: This script is not intended to be used as a module.
6+
NOTE: For now script is only intended to be used with elements generated with
7+
`partition_html(html_parser_version=v2)`
8+
TODO: It was noted that unstructured_elements_to_ontology func always returns a single page
9+
This script is using helper functions to handle multiple pages.
10+
"""
11+
12+
import argparse
13+
import logging
14+
import os
15+
import select
16+
import sys
17+
from collections import defaultdict
18+
from typing import List, Sequence
19+
20+
from bs4 import BeautifulSoup
21+
22+
from unstructured.documents import elements
23+
from unstructured.partition.html.transformations import unstructured_elements_to_ontology
24+
from unstructured.staging.base import elements_from_json
25+
26+
# Configure logging
27+
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
28+
logger = logging.getLogger(__name__)
29+
30+
31+
def extract_document_div(html_content: str) -> str:
32+
pos = html_content.find(">")
33+
if pos != -1:
34+
return html_content[: pos + 1]
35+
logger.error("No '>' found in the HTML content.")
36+
raise ValueError("No '>' found in the HTML content.")
37+
38+
39+
def extract_page_div(html_content: str) -> str:
40+
soup = BeautifulSoup(html_content, "html.parser")
41+
page_divs = soup.find_all("div", class_="Page")
42+
if len(page_divs) != 1:
43+
logger.error(
44+
"Expected exactly one <div> element with class 'Page'. Found %d.", len(page_divs)
45+
)
46+
raise ValueError("Expected exactly one <div> element with class 'Page'.")
47+
return str(page_divs[0])
48+
49+
50+
def fold_document_div(
51+
html_document_start: str, html_document_end: str, html_per_page: List[str]
52+
) -> str:
53+
html_document = html_document_start
54+
for page_html in html_per_page:
55+
html_document += page_html
56+
html_document += html_document_end
57+
return html_document
58+
59+
60+
def group_elements_by_page(
61+
unstructured_elements: Sequence[elements.Element],
62+
) -> Sequence[Sequence[elements.Element]]:
63+
pages_dict = defaultdict(list)
64+
65+
for element in unstructured_elements:
66+
page_number = element.metadata.page_number
67+
pages_dict[page_number].append(element)
68+
69+
pages_list = list(pages_dict.values())
70+
return pages_list
71+
72+
73+
def rendered_html(*, filepath: str | None = None, text: str | None = None) -> str:
74+
"""Renders HTML from a JSON file with unstructured elements.
75+
76+
Args:
77+
filepath (str): path to JSON file with unstructured elements.
78+
79+
Returns:
80+
str: HTML content.
81+
"""
82+
if filepath is None and text is None:
83+
logger.error("Either filepath or text must be provided.")
84+
raise ValueError("Either filepath or text must be provided.")
85+
if filepath is not None and text is not None:
86+
logger.error("Both filepath and text cannot be provided.")
87+
raise ValueError("Both filepath and text cannot be provided.")
88+
if filepath is not None:
89+
logger.info("Rendering HTML from file: %s", filepath)
90+
else:
91+
logger.info("Rendering HTML from text.")
92+
93+
unstructured_elements = elements_from_json(filename=filepath, text=text)
94+
unstructured_elements_per_page = group_elements_by_page(unstructured_elements)
95+
# parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
96+
parsed_ontology_per_page = [
97+
unstructured_elements_to_ontology(elements) for elements in unstructured_elements_per_page
98+
]
99+
html_per_page = [parsed_ontology.to_html() for parsed_ontology in parsed_ontology_per_page]
100+
101+
html_document_start = extract_document_div(html_per_page[0])
102+
html_document_end = "</div>"
103+
html_per_page = [extract_page_div(page) for page in html_per_page]
104+
105+
return fold_document_div(html_document_start, html_document_end, html_per_page)
106+
107+
108+
def _main():
109+
if os.getenv("PROCESS_FROM_STDIN") == "true":
110+
logger.info("Processing from STDIN (PROCESS_FROM_STDIN is set to 'true')")
111+
if select.select([sys.stdin], [], [], 0.1)[0]:
112+
content = sys.stdin.read()
113+
html = rendered_html(text=content)
114+
sys.stdout.write(html)
115+
else:
116+
logger.error("No input provided via STDIN. Exiting.")
117+
sys.exit(1)
118+
else:
119+
logger.info("Processing from command line arguments")
120+
parser = argparse.ArgumentParser(description="Render HTML from unstructured elements.")
121+
parser.add_argument(
122+
"filepath", help="Path to JSON file with unstructured elements.", type=str
123+
)
124+
parser.add_argument(
125+
"--outdir",
126+
help="Path to directory where the rendered html will be stored.",
127+
type=str,
128+
default=None,
129+
nargs="?",
130+
)
131+
args = parser.parse_args()
132+
133+
html = rendered_html(filepath=args.filepath)
134+
if args.outdir is None:
135+
args.outdir = os.path.dirname(args.filepath)
136+
os.makedirs(args.outdir, exist_ok=True)
137+
outpath = os.path.join(
138+
args.outdir, os.path.basename(args.filepath).replace(".json", ".html")
139+
)
140+
with open(outpath, "w") as f:
141+
f.write(html)
142+
logger.info("HTML rendered and saved to: %s", outpath)
143+
144+
145+
if __name__ == "__main__":
146+
_main()

scripts/user/u-tables-inspect.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ jq -c '.[] | select(.type == "Table") | .metadata.text_as_html' "$JSON_FILE" | w
4545
HTML_CONTENT=${HTML_CONTENT%\"}
4646
# add a border and padding to clearly see cell definition
4747
# shellcheck disable=SC2001
48+
HTML_CONTENT=$(echo "$HTML_CONTENT" | sed 's/<table /<table border="1" cellpadding="10" /')
49+
# shellcheck disable=SC2001
4850
HTML_CONTENT=$(echo "$HTML_CONTENT" | sed 's/<table>/<table border="1" cellpadding="10">/')
4951
# add newlines for readability in the html
5052
# shellcheck disable=SC2001

0 commit comments

Comments
 (0)