Skip to content

bug/a text file cannot be loaded from a ZipExtFile #4097

@sebovzeoueb

Description

@sebovzeoueb

Describe the bug

I have some code which takes uploaded files and passes them into the langchain UnstructuredLoader, which as you can see from my error log down below is calling Unstructured's partition function. When the uploaded file is a zip file I'm using Python's built-in zipfile module to load the contents into file-like objects. I've tried with several different text files with the same results. In all cases I'm passing a file-like object into Unstructured.

  • Uploading the text file directly: success
  • Uploading a zip file containing PDF, DOCX, PNG etc.: success
  • Uploading a zip file containing the working text file:
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/partition/auto.py", line 292, in partition
2025-09-23 16:50:09     elements = partition(filename=filename, file=file, **partitioning_kwargs)
2025-09-23 16:50:09                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper
2025-09-23 16:50:09     elements = func(*args, **kwargs)
2025-09-23 16:50:09                ^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
2025-09-23 16:50:09     elements = func(*args, **kwargs)
2025-09-23 16:50:09                ^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/partition/text.py", line 81, in partition_text
2025-09-23 16:50:09     encoding, file_text = read_txt_file(file=file, encoding=encoding)
2025-09-23 16:50:09                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/file_utils/encoding.py", line 146, in read_txt_file
2025-09-23 16:50:09     formatted_encoding, file_text = detect_file_encoding(file=file)
2025-09-23 16:50:09                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/file_utils/encoding.py", line 70, in detect_file_encoding
2025-09-23 16:50:09     byte_data = convert_to_bytes(file)
2025-09-23 16:50:09                 ^^^^^^^^^^^^^^^^^^^^^^
2025-09-23 16:50:09   File "/shabti/.venv/lib/python3.12/site-packages/unstructured/partition/common/common.py", line 386, in convert_to_bytes
2025-09-23 16:50:09     raise ValueError("Invalid file-like object type")
2025-09-23 16:50:09 ValueError: Invalid file-like object type

To Reproduce

with zipfile.ZipFile(file) as my_zip:
    for info in my_zip.infolist():
        loader = UnstructuredLoader(
            file=my_zip.open(info),
            strategy="auto",
            chunking_strategy="by_title",
            metadata_filename=info.filename,
        )
        pages = loader.load()

Expected behavior
The file is able to be loaded

Environment Info
Please run python scripts/collect_env.py and paste the output here.

I can't find where this collect_env.py is in my installation.

I created a Docker image based on astral/uv:python3.12-trixie-slim with unstructured[all-docs]>=0.18.14 in my Python dependencies. I have installed all the recommended system dependencies except libmagic as I am also having some issues with that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions