Description
Describe the bug
I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)
File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK
To Reproduce
from io import BytesIO
from unstructured.partition.auto import partition
with open("Test.doc", "rb") as f:
# Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file
elements = partition(file=BytesIO(f.read()))
Expected behavior
The extension should be detected as .doc and should return partitions.
Screenshots
NA
Environment Info
Python version: 3.10.15
unstructured version: 0.17.5
unstructured-inference version: 0.8.10
Additional context
The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching line , it fails as the file name returned is random and is sometimes an integer.
Since this is the last effort to identify the extension, can we use metadata file name before doing this check ?
This PR proposes a potential fix for the same - #3786