Skip to content

Commit 6ba8135

Browse files
authored
fix: check ole storage content to differentiate filetypes (#3581)
### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```
1 parent ddb6cb6 commit 6ba8135

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+171
-149
lines changed

.github/workflows/ci.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ jobs:
215215
strategy:
216216
matrix:
217217
python-version: ["3.10"]
218-
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "msg", "pdf-image", "pptx", "xlsx"]
218+
extra: ["csv", "docx", "odt", "markdown", "pypandoc", "pdf-image", "pptx", "xlsx"]
219219
runs-on: ubuntu-latest
220220
env:
221221
NLTK_DATA: ${{ github.workspace }}/nltk_data

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.9-dev1
1+
## 0.15.9
22

33
### Enhancements
44

@@ -8,6 +8,7 @@
88

99
### Fixes
1010

11+
* **Check storage contents for OLE file type detection** Updates `detect_filetype` to check the content of OLE files to more reliable differentiate DOC, PPT, XLS, and MSG files. As part of this, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency.
1112
* **Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile** Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFiles have been replaced with TemporaryFileDirectory to avoid a known issue: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
1213

1314
## 0.15.8

Makefile

+1-11
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,6 @@ install-pypandoc:
8383
install-markdown:
8484
python3 -m pip install -r requirements/extra-markdown.txt
8585

86-
.PHONY: install-msg
87-
install-msg:
88-
python3 -m pip install -r requirements/extra-msg.txt
89-
9086
.PHONY: install-pdf-image
9187
install-pdf-image:
9288
python3 -m pip install -r requirements/extra-pdf-image.txt
@@ -100,7 +96,7 @@ install-xlsx:
10096
python3 -m pip install -r requirements/extra-xlsx.txt
10197

10298
.PHONY: install-all-docs
103-
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-msg install-pdf-image install-pptx install-xlsx
99+
install-all-docs: install-base install-csv install-docx install-epub install-odt install-pypandoc install-markdown install-pdf-image install-pptx install-xlsx
104100

105101
.PHONY: install-all-ingest
106102
install-all-ingest:
@@ -343,12 +339,6 @@ test-extra-epub:
343339
test-extra-markdown:
344340
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_md.py
345341

346-
.PHONY: test-extra-msg
347-
test-extra-msg:
348-
# NOTE(scanny): exclude attachment test because partitioning attachments requires other extras
349-
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_msg.py \
350-
-k "not test_partition_msg_can_process_attachments"
351-
352342
.PHONY: test-extra-odt
353343
test-extra-odt:
354344
PYTHONPATH=. CI=$(CI) pytest test_unstructured/partition/test_odt.py

requirements/base.in

+1
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ unstructured-client
2121
wrapt
2222
tqdm
2323
psutil
24+
python-oxmsg

requirements/base.txt

+10-3
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ backoff==2.2.1
1010
# via -r ./base.in
1111
beautifulsoup4==4.12.3
1212
# via -r ./base.in
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# httpcore
1616
# httpx
@@ -23,7 +23,9 @@ charset-normalizer==3.3.2
2323
# requests
2424
# unstructured-client
2525
click==8.1.7
26-
# via nltk
26+
# via
27+
# nltk
28+
# python-oxmsg
2729
dataclasses-json==0.6.7
2830
# via
2931
# -r ./base.in
@@ -70,6 +72,8 @@ nltk==3.9.1
7072
# via -r ./base.in
7173
numpy==1.26.4
7274
# via -r ./base.in
75+
olefile==0.47
76+
# via python-oxmsg
7377
orderly-set==5.2.2
7478
# via deepdiff
7579
packaging==24.1
@@ -86,6 +90,8 @@ python-iso639==2024.4.27
8690
# via -r ./base.in
8791
python-magic==0.4.27
8892
# via -r ./base.in
93+
python-oxmsg==0.0.1
94+
# via -r ./base.in
8995
rapidfuzz==3.9.6
9096
# via -r ./base.in
9197
regex==2024.7.24
@@ -120,6 +126,7 @@ typing-extensions==4.12.2
120126
# anyio
121127
# emoji
122128
# pypdf
129+
# python-oxmsg
123130
# typing-inspect
124131
# unstructured-client
125132
typing-inspect==0.9.0
@@ -128,7 +135,7 @@ typing-inspect==0.9.0
128135
# unstructured-client
129136
unstructured-client==0.25.5
130137
# via -r ./base.in
131-
urllib3==1.26.19
138+
urllib3==1.26.20
132139
# via
133140
# -c ././deps/constraints.txt
134141
# requests

requirements/dev.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ bleach==6.1.0
3434
# via nbconvert
3535
build==1.2.1
3636
# via pip-tools
37-
certifi==2024.7.4
37+
certifi==2024.8.30
3838
# via
3939
# -c ./base.txt
4040
# -c ./test.txt
@@ -130,7 +130,7 @@ jsonschema[format-nongpl]==3.2.0
130130
# jupyter-events
131131
# jupyterlab-server
132132
# nbformat
133-
jupyter==1.1.0
133+
jupyter==1.1.1
134134
# via -r ./dev.in
135135
jupyter-client==7.4.9
136136
# via
@@ -370,7 +370,7 @@ typing-extensions==4.12.2
370370
# -c ./test.txt
371371
# anyio
372372
# ipython
373-
urllib3==1.26.19
373+
urllib3==1.26.20
374374
# via
375375
# -c ././deps/constraints.txt
376376
# -c ./base.txt

requirements/extra-msg.in

-4
This file was deleted.

requirements/extra-msg.txt

-18
This file was deleted.

requirements/extra-paddleocr.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ anyio==4.4.0
1010
# httpx
1111
astor==0.8.1
1212
# via paddlepaddle
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# -c ./base.txt
1616
# httpcore
@@ -170,7 +170,7 @@ typing-extensions==4.12.2
170170
# paddlepaddle
171171
unstructured-paddleocr==2.8.1.0
172172
# via -r ./extra-paddleocr.in
173-
urllib3==1.26.19
173+
urllib3==1.26.20
174174
# via
175175
# -c ././deps/constraints.txt
176176
# -c ./base.txt

requirements/extra-pdf-image.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ antlr4-python3-runtime==4.9.3
88
# via omegaconf
99
cachetools==5.5.0
1010
# via google-auth
11-
certifi==2024.7.4
11+
certifi==2024.8.30
1212
# via
1313
# -c ./base.txt
1414
# requests
@@ -279,7 +279,7 @@ unstructured-inference==0.7.36
279279
# via -r ./extra-pdf-image.in
280280
unstructured-pytesseract==0.3.13
281281
# via -r ./extra-pdf-image.in
282-
urllib3==1.26.19
282+
urllib3==1.26.20
283283
# via
284284
# -c ././deps/constraints.txt
285285
# -c ./base.txt

requirements/huggingface.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./huggingface.in
66
#
7-
certifi==2024.7.4
7+
certifi==2024.8.30
88
# via
99
# -c ./base.txt
1010
# requests
@@ -103,7 +103,7 @@ typing-extensions==4.12.2
103103
# -c ./base.txt
104104
# huggingface-hub
105105
# torch
106-
urllib3==1.26.19
106+
urllib3==1.26.20
107107
# via
108108
# -c ././deps/constraints.txt
109109
# -c ./base.txt

requirements/ingest/airtable.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
annotated-types==0.7.0
88
# via pydantic
9-
certifi==2024.7.4
9+
certifi==2024.8.30
1010
# via
1111
# -c ./ingest/../base.txt
1212
# requests
@@ -36,7 +36,7 @@ typing-extensions==4.12.2
3636
# pyairtable
3737
# pydantic
3838
# pydantic-core
39-
urllib3==1.26.19
39+
urllib3==1.26.20
4040
# via
4141
# -c ./ingest/../base.txt
4242
# -c ./ingest/../deps/constraints.txt

requirements/ingest/astradb.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ cassandra-driver==3.29.1
1414
# via cassio
1515
cassio==0.1.8
1616
# via astrapy
17-
certifi==2024.7.4
17+
certifi==2024.8.30
1818
# via
1919
# -c ./ingest/../base.txt
2020
# httpcore
@@ -91,7 +91,7 @@ typing-extensions==4.12.2
9191
# via
9292
# -c ./ingest/../base.txt
9393
# anyio
94-
urllib3==1.26.19
94+
urllib3==1.26.20
9595
# via
9696
# -c ./ingest/../base.txt
9797
# -c ./ingest/../deps/constraints.txt

requirements/ingest/azure-cognitive-search.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ azure-core==1.30.2
1010
# via azure-search-documents
1111
azure-search-documents==11.5.1
1212
# via -r ./ingest/azure-cognitive-search.in
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# -c ./ingest/../base.txt
1616
# requests
@@ -38,7 +38,7 @@ typing-extensions==4.12.2
3838
# -c ./ingest/../base.txt
3939
# azure-core
4040
# azure-search-documents
41-
urllib3==1.26.19
41+
urllib3==1.26.20
4242
# via
4343
# -c ./ingest/../base.txt
4444
# -c ./ingest/../deps/constraints.txt

requirements/ingest/azure.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ azure-identity==1.17.1
2727
# via adlfs
2828
azure-storage-blob==12.22.0
2929
# via adlfs
30-
certifi==2024.7.4
30+
certifi==2024.8.30
3131
# via
3232
# -c ./ingest/../base.txt
3333
# requests
@@ -94,7 +94,7 @@ typing-extensions==4.12.2
9494
# azure-core
9595
# azure-identity
9696
# azure-storage-blob
97-
urllib3==1.26.19
97+
urllib3==1.26.20
9898
# via
9999
# -c ./ingest/../base.txt
100100
# -c ./ingest/../deps/constraints.txt

requirements/ingest/box.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ boxfs==0.3.0
1010
# via -r ./ingest/box.in
1111
boxsdk[jwt]==3.13.0
1212
# via boxfs
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# -c ./ingest/../base.txt
1616
# requests
@@ -51,7 +51,7 @@ six==1.16.0
5151
# via
5252
# -c ./ingest/../base.txt
5353
# python-dateutil
54-
urllib3==1.26.19
54+
urllib3==1.26.20
5555
# via
5656
# -c ./ingest/../base.txt
5757
# -c ./ingest/../deps/constraints.txt

requirements/ingest/chroma.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ build==1.2.1
2424
# via chromadb
2525
cachetools==5.5.0
2626
# via google-auth
27-
certifi==2024.7.4
27+
certifi==2024.8.30
2828
# via
2929
# -c ./ingest/../base.txt
3030
# httpcore
@@ -268,7 +268,7 @@ typing-extensions==4.12.2
268268
# starlette
269269
# typer
270270
# uvicorn
271-
urllib3==1.26.19
271+
urllib3==1.26.20
272272
# via
273273
# -c ./ingest/../base.txt
274274
# -c ./ingest/../deps/constraints.txt

requirements/ingest/clarifai.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
#
55
# pip-compile ./ingest/clarifai.in
66
#
7-
certifi==2024.7.4
7+
certifi==2024.8.30
88
# via
99
# -c ./ingest/../base.txt
1010
# requests
@@ -74,7 +74,7 @@ tqdm==4.66.5
7474
# clarifai
7575
tritonclient==2.41.1
7676
# via clarifai
77-
urllib3==1.26.19
77+
urllib3==1.26.20
7878
# via
7979
# -c ./ingest/../base.txt
8080
# -c ./ingest/../deps/constraints.txt

requirements/ingest/confluence.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
#
55
# pip-compile ./ingest/confluence.in
66
#
7-
atlassian-python-api==3.41.14
7+
atlassian-python-api==3.41.15
88
# via -r ./ingest/confluence.in
99
beautifulsoup4==4.12.3
1010
# via
1111
# -c ./ingest/../base.txt
1212
# atlassian-python-api
13-
certifi==2024.7.4
13+
certifi==2024.8.30
1414
# via
1515
# -c ./ingest/../base.txt
1616
# requests
@@ -45,7 +45,7 @@ soupsieve==2.6
4545
# via
4646
# -c ./ingest/../base.txt
4747
# beautifulsoup4
48-
urllib3==1.26.19
48+
urllib3==1.26.20
4949
# via
5050
# -c ./ingest/../base.txt
5151
# -c ./ingest/../deps/constraints.txt

requirements/ingest/databricks-volumes.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#
77
cachetools==5.5.0
88
# via google-auth
9-
certifi==2024.7.4
9+
certifi==2024.8.30
1010
# via
1111
# -c ./ingest/../base.txt
1212
# requests
@@ -34,7 +34,7 @@ requests==2.32.3
3434
# databricks-sdk
3535
rsa==4.9
3636
# via google-auth
37-
urllib3==1.26.19
37+
urllib3==1.26.20
3838
# via
3939
# -c ./ingest/../base.txt
4040
# -c ./ingest/../deps/constraints.txt

0 commit comments

Comments
 (0)