Skip to content

bug/Wrongly detected fileType for exported documents #3980

Open
@srisudarsan

Description

@srisudarsan

Describe the bug
I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)

File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK

To Reproduce
from io import BytesIO
from unstructured.partition.auto import partition

with open("Test.doc", "rb") as f:
# Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file
elements = partition(file=BytesIO(f.read()))

Expected behavior
The extension should be detected as .doc and should return partitions.

Screenshots
NA

Environment Info
Python version: 3.10.15
unstructured version: 0.17.5
unstructured-inference version: 0.8.10

Additional context
The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching line , it fails as the file name returned is random and is sometimes an integer.

Since this is the last effort to identify the extension, can we use metadata file name before doing this check ?
This PR proposes a potential fix for the same - #3786

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions