Skip to content

fix read write tmpfile on windows #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions unstructured_inference/inference/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import os
import tempfile
import random
from pathlib import PurePath
from typing import BinaryIO, Collection, List, Optional, Union, cast

Expand Down Expand Up @@ -351,11 +352,14 @@ def process_data_with_model(
) -> DocumentLayout:
"""Processes pdf file in the form of a file handler (supporting a read method) into a
DocumentLayout by using a model identified by model_name."""
with tempfile.NamedTemporaryFile() as tmp_file:
tmp_file.write(data.read())
tmp_file.flush() # Make sure the file is written out
Copy link
Contributor

@pawel-kmiecik pawel-kmiecik Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating a separate file, could you try with this smaller change:

with tempfile.NamedTemporaryFile() as tmp_file:
        tmp_file.write(data.read())
        tmp_file.flush() 
        tmp_file.seek(0)
        layout = process_file_with_model(...)

Unfortunately I don't have a windows machine to check myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, probably it's not this issue - I'd try to create tempfile with options:

tempfile.NamedTemporaryFile(delete=True, delete_on_close=False)

According to windows recommendations from here:
https://docs.python.org/3/library/tempfile.html

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, probably it's not this issue - I'd try to create tempfile with options:

tempfile.NamedTemporaryFile(delete=True, delete_on_close=False)

According to windows recommendations from here: https://docs.python.org/3/library/tempfile.html

Note that this only exists for Python 3.12, the docs for even Python 3.11 do not have the delete_on_close parameter: https://docs.python.org/3.11/library/tempfile.html#tempfile.NamedTemporaryFile

The reason I'm commenting is because I'm also running into this issue on windows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And wound it defeat the purpose of a temp file to leave it on device after things have been done with it?
I guess you could do something like this if you dont want to use a temp folder

def process_data_with_model(
    data: BinaryIO,
    model_name: Optional[str],
    **kwargs,
) -> DocumentLayout:
    """Processes pdf file in the form of a file handler (supporting a read method) into a
    DocumentLayout by using a model identified by model_name."""
    file_name = ''
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        tmp_file.write(data.read())
        tmp_file.flush()  # Make sure the file is written out
        file_name = tmp_file.name
        
    try:
        layout = process_file_with_model(
            file_name,
            model_name,
            **kwargs,
        )
    finally:
        os.remove(file_name)

    return layout

with tempfile.TemporaryDirectory() as td:
f_name = os.path.join(td, ''.join(random.choices("abcdefghijklmnopqrstuvwxyz0123456789_", k=8)))
with open(f_name, "wb") as tmp_file:
tmp_file.write(data.read() if hasattr(data, "read") else data)
tmp_file.flush()

layout = process_file_with_model(
tmp_file.name,
f_name,
model_name,
**kwargs,
)
Expand Down