bug/Wrongly detected fileType for exported documents

**Describe the bug**
I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)

File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK

**To Reproduce**
from io import BytesIO
from unstructured.partition.auto import partition

with open("Test.doc", "rb") as f:
    # Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file
    elements = partition(file=BytesIO(f.read()))

**Expected behavior**
The extension should be detected as .doc and should return partitions.

**Screenshots**
NA

**Environment Info**
Python version:  3.10.15
unstructured version:  0.17.5
unstructured-inference version:  0.8.10


**Additional context**
The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching [line](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L355) , it fails as the file name returned is random and is sometimes an integer. 

Since this is the last effort to identify the extension, can we use metadata file name before doing this check ?
This PR proposes a potential fix for the same - https://github.com/Unstructured-IO/unstructured/pull/3786 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug/Wrongly detected fileType for exported documents #3980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug/Wrongly detected fileType for exported documents #3980

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions