Skip to content

Commit 6ba8135

Browse files
authored
fix: check ole storage content to differentiate filetypes (#3581)
### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```
1 parent ddb6cb6 commit 6ba8135

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+171
-149
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ jobs:
215215
strategy:
216216
matrix:
217217
python-version: ["3.10"]
218-
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "msg", "pdf-image", "pptx", "xlsx"]
218+
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "pdf-image", "pptx", "xlsx"]
219219
runs-on: ubuntu-latest
220220
env:
221221
NLTK_DATA: ${{ github.workspace }}/nltk_data

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.9-dev1
1+
## 0.15.9
22

33
### Enhancements
44

@@ -8,6 +8,7 @@
88

99
### Fixes
1010

11+
* **Check storage contents for OLE file type detection** Updates `detect_filetype` to check the content of OLE files to more reliable differentiate DOC, PPT, XLS, and MSG files. As part of this, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency.
1112
* **Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile** Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFiles have been replaced with TemporaryFileDirectory to avoid a known issue: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
1213

1314
## 0.15.8

Makefile

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,6 @@ install-pypandoc:
8383
install-markdown:
8484
python3 -m pip install -r requirements/extra-markdown.txt
8585

86-
.PHONY: install-msg
87-
install-msg:
88-
python3 -m pip install -r requirements/extra-msg.txt
89-
9086
.PHONY: install-pdf-image
9187
install-pdf-image:
9288
python3 -m pip install -r requirements/extra-pdf-image.txt
@@ -100,7 +96,7 @@ install-xlsx:
10096
python3 -m pip install -r requirements/extra-xlsx.txt
10197

10298
.PHONY: install-all-docs
103-
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-msg install-pdf-image install-pptx install-xlsx
99+
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-pdf-image install-pptx install-xlsx
104100

105101
.PHONY: install-all-ingest
106102
install-all-ingest:
@@ -343,12 +339,6 @@ test-extra-epub:
343339
test-extra-markdown:
344340
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_md.py
345341

346-
.PHONY: test-extra-msg
347-
test-extra-msg:
348-
# NOTE(scanny): exclude attachment test because partitioning attachments requires other extras
349-
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_msg.py \
350-
-k "not test_partition_msg_can_process_attachments"
351-
352342
.PHONY: test-extra-odt
353343
test-extra-odt:
354344
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_odt.py

requirements/base.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ unstructured-client
2121
wrapt
2222
tqdm
2323
psutil
24+
python-oxmsg

requirements/base.txt

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ backoff==2.2.1
1010
# via -r ./base.in
1111
beautifulsoup4==4.12.3
1212
# via -r ./base.in
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# httpcore
1616
# httpx
@@ -23,7 +23,9 @@ charset-normalizer==3.3.2
2323
# requests
2424
# unstructured-client
2525
click==8.1.7
26-
# via nltk
26+
# via
27+
# nltk
28+
# python-oxmsg
2729
dataclasses-json==0.6.7
2830
# via
2931
# -r ./base.in
@@ -70,6 +72,8 @@ nltk==3.9.1
7072
# via -r ./base.in
7173
numpy==1.26.4
7274
# via -r ./base.in
75+
olefile==0.47
76+
# via python-oxmsg
7377
orderly-set==5.2.2
7478
# via deepdiff
7579
packaging==24.1
@@ -86,6 +90,8 @@ python-iso639==2024.4.27
8690
# via -r ./base.in
8791
python-magic==0.4.27
8892
# via -r ./base.in
93+
python-oxmsg==0.0.1
94+
# via -r ./base.in
8995
rapidfuzz==3.9.6
9096
# via -r ./base.in
9197
regex==2024.7.24
@@ -120,6 +126,7 @@ typing-extensions==4.12.2
120126
# anyio
121127
# emoji
122128
# pypdf
129+
# python-oxmsg
123130
# typing-inspect
124131
# unstructured-client
125132
typing-inspect==0.9.0
@@ -128,7 +135,7 @@ typing-inspect==0.9.0
128135
# unstructured-client
129136
unstructured-client==0.25.5
130137
# via -r ./base.in
131-
urllib3==1.26.19
138+
urllib3==1.26.20
132139
# via
133140
# -c ././deps/constraints.txt
134141
# requests

requirements/dev.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ bleach==6.1.0
3434
# via nbconvert
3535
build==1.2.1
3636
# via pip-tools
37-
certifi==2024.7.4
37+
certifi==2024.8.30
3838
# via
3939
# -c ./base.txt
4040
# -c ./test.txt
@@ -130,7 +130,7 @@ jsonschema[format-nongpl]==3.2.0
130130
# jupyter-events
131131
# jupyterlab-server
132132
# nbformat
133-
jupyter==1.1.0
133+
jupyter==1.1.1
134134
# via -r ./dev.in
135135
jupyter-client==7.4.9
136136
# via
@@ -370,7 +370,7 @@ typing-extensions==4.12.2
370370
# -c ./test.txt
371371
# anyio
372372
# ipython
373-
urllib3==1.26.19
373+
urllib3==1.26.20
374374
# via
375375
# -c ././deps/constraints.txt
376376
# -c ./base.txt

requirements/extra-msg.in

Lines changed: 0 additions & 4 deletions
This file was deleted.

requirements/extra-msg.txt

Lines changed: 0 additions & 18 deletions
This file was deleted.

requirements/extra-paddleocr.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ anyio==4.4.0
1010
# httpx
1111
astor==0.8.1
1212
# via paddlepaddle
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# -c ./base.txt
1616
# httpcore
@@ -170,7 +170,7 @@ typing-extensions==4.12.2
170170
# paddlepaddle
171171
unstructured-paddleocr==2.8.1.0
172172
# via -r ./extra-paddleocr.in
173-
urllib3==1.26.19
173+
urllib3==1.26.20
174174
# via
175175
# -c ././deps/constraints.txt
176176
# -c ./base.txt

requirements/extra-pdf-image.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ antlr4-python3-runtime==4.9.3
88
# via omegaconf
99
cachetools==5.5.0
1010
# via google-auth
11-
certifi==2024.7.4
11+
certifi==2024.8.30
1212
# via
1313
# -c ./base.txt
1414
# requests
@@ -279,7 +279,7 @@ unstructured-inference==0.7.36
279279
# via -r ./extra-pdf-image.in
280280
unstructured-pytesseract==0.3.13
281281
# via -r ./extra-pdf-image.in
282-
urllib3==1.26.19
282+
urllib3==1.26.20
283283
# via
284284
# -c ././deps/constraints.txt
285285
# -c ./base.txt

0 commit comments

Comments
 (0)