Skip to content

legacy office doc type conversion is not thread-safe in a container setup with Rocky Linux (potentially in general) #3763

Open
@cwang

Description

@cwang

Describe the bug

The convert_office_doc function used to convert file types such as ppt and doc to their modern equivalents (pptx and docx respectively for example) is NOT thread safe as in the subprocess spun in a thread would randomly return exit code 1 without doing actual conversion via soffice in a container setup with Rocky Linux base images.

See

To Reproduce
Take a bundle of legacy office docs such as a few .doc and a few .ppt files, and call partition function in a thread pool setup, to see that randomly one of the doc would fail to get converted (therefore the whole partition function for that file fails). BUT it's definitely not always one file but can be any legacy file in that pack of documents, which suggests to me it's not a file issue but a threading with subprocess issue.

Expected behavior
The legacy to modern office file conversion should always work despite threading or not.

Screenshots
N/A

Environment Info
I've tested with a wide range of Rocky base images + Python 3.10/3.11/3.12 for this issue.

Additional context
My workaround is to always do sequential processing among a pack of documents, by picking out all the legacy office docs and put them in a single thread to be processed sequentially. It's not ideal but maybe it should be mentioned in the OSS docs if no fix is coming any time soon?

Metadata

Metadata

Assignees

No one assigned

    Labels

    docRelated to Microsoft Word (.doc) legacy file formatenhancementNew feature or requestpptRelated to Microsoft PowerPoint (.ppt) legacy file format

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions