Skip to content

Commit 4175948

Browse files
authored
Merge branch 'main' into chore/bump-inference
2 parents 3298ab3 + 6ba376a commit 4175948

File tree

4 files changed

+41
-25
lines changed

4 files changed

+41
-25
lines changed

CHANGELOG.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
1-
## 0.15.14-dev14
1+
## 0.15.15-dev0
22

33
### Enhancements
44

5-
* **Bump `unstructured-inference` to 0.7.38** and upgrade other dependencies
5+
* **Bump `unstructured-inference` to 0.7.39** and upgrade other dependencies
6+
7+
### Features
8+
9+
* **Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like `.filename`, `.filetype` and `.languages`.** This will be installed in a closely following PR to replace the four currently being used for this purpose.
10+
11+
### Fixes
12+
13+
## 0.15.14
14+
15+
### Enhancements
616

717
### Features
818

@@ -20,6 +30,7 @@
2030
* **Remove obsolete min_partition/max_partition args from TXT and EML.** The legacy `min_partition` and `max_partition` parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from `partition_text()` and `partition_email()`.
2131
* **Remove double-decoration on EML and MSG.** Refactor these partitioners to rely on the new `@apply_metadata()` decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
2232
* **Remove double-decoration for PPT.** Remove decorators from the delegating PPT partitioner.
33+
* **Quick-fix CI error in auto test-filetype.** Better fix to follow shortly.
2334

2435
## 0.15.13
2536

test_unstructured/partition/test_auto.py

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1207,35 +1207,39 @@ def test_auto_partition_overwrites_any_filetype_applied_by_file_specific_partiti
12071207

12081208

12091209
@pytest.mark.parametrize(
1210-
"file_type",
1210+
("file_name", "file_type"),
12111211
[
1212-
t
1213-
for t in FileType
1214-
if t
1215-
not in (
1216-
FileType.EMPTY,
1217-
FileType.JSON,
1218-
FileType.UNK,
1219-
FileType.WAV,
1220-
FileType.XLS,
1221-
FileType.ZIP,
1222-
)
1223-
and t.partitioner_shortname != "image"
1212+
("stanley-cups.csv", FileType.CSV),
1213+
("simple.doc", FileType.DOC),
1214+
("simple.docx", FileType.DOCX),
1215+
("fake-email.eml", FileType.EML),
1216+
("simple.epub", FileType.EPUB),
1217+
("fake-html.html", FileType.HTML),
1218+
("README.md", FileType.MD),
1219+
("fake-email.msg", FileType.MSG),
1220+
("simple.odt", FileType.ODT),
1221+
("pdf/DA-1p.pdf", FileType.PDF),
1222+
("fake-power-point.ppt", FileType.PPT),
1223+
("simple.pptx", FileType.PPTX),
1224+
("README.rst", FileType.RST),
1225+
("fake-doc.rtf", FileType.RTF),
1226+
("stanley-cups.tsv", FileType.TSV),
1227+
("fake-text.txt", FileType.TXT),
1228+
("tests-example.xls", FileType.XLSX),
1229+
("stanley-cups.xlsx", FileType.XLSX),
1230+
("factbook.xml", FileType.XML),
12241231
],
12251232
)
1226-
def test_auto_partition_applies_the_correct_filetype_for_all_filetypes(file_type: FileType):
1233+
def test_auto_partition_applies_the_correct_filetype_for_all_filetypes(
1234+
file_name: str, file_type: FileType
1235+
):
1236+
file_path = example_doc_path(file_name)
12271237
partition_fn_name = file_type.partitioner_function_name
12281238
module = import_module(file_type.partitioner_module_qname)
12291239
partition_fn = getattr(module, partition_fn_name)
12301240

1231-
# -- partition the first example-doc with the extension for this filetype --
1232-
elements: list[Element] = []
1233-
doc_path = example_doc_path("pdf") if file_type == FileType.PDF else example_doc_path("")
1234-
extensions = file_type._extensions
1235-
for file in pathlib.Path(doc_path).iterdir():
1236-
if file.is_file() and file.suffix in extensions:
1237-
elements = partition_fn(str(file))
1238-
break
1241+
# -- partition the example-doc for this filetype --
1242+
elements = partition_fn(file_path)
12391243

12401244
assert elements
12411245
assert all(

test_unstructured_ingest/test-ingest-dest.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ tests_to_ignore=(
6565
'dropbox.sh'
6666
'sharepoint.sh'
6767
'databricks-volumes.sh'
68+
'vectara.sh'
6869
)
6970

7071
for test in "${all_tests[@]}"; do

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.15.14-dev14" # pragma: no cover
1+
__version__ = "0.15.15-dev0" # pragma: no cover

0 commit comments

Comments
 (0)