Skip to content

Conversation

@YooshiJay
Copy link

Below is part Error report:

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\doc.py:74, in partition_doc(filename, file, metadata_filename, metadata_last_modified, libre_office_filter, **kwargs)
     70         f.write(file.read())
     72 # -- convert the .doc file to .docx. The resulting file takes the same base-name as the
     73 # -- source file and is written to `target_dir`.
---> 74 convert_office_doc(
     75     source_file_path,
     76     target_dir,
     77     target_format="docx",
     78     target_filter=libre_office_filter,
     79 )
     81 # -- compute the path of the resulting .docx document --
     82 _, filename_no_path = os.path.split(os.path.abspath(source_file_path))

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\common\common.py:299, in convert_office_doc(input_filename, output_directory, target_format, target_filter, wait_for_soffice_ready_time_out)
    297 sleep_time = 0.1
    298 output = subprocess.run(command, capture_output=True)
--> 299 message = output.stdout.decode().strip()
    300 # we can't rely on returncode unfortunately because on macOS it would return 0 even when the
    301 # command failed to run; instead we have to rely on the stdout being empty as a sign of the
    302 # process failed
    303 while (wait_time < wait_for_soffice_ready_time_out) and (message == ""):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 60: invalid continuation byte

The core reason is that my document is written by Chinese, so its encoding method is not utf-8. Actually, when I simply modify error code to "message = output.stdout.decode(“gbk”).strip()", it works!

Thus, I simply add a process to check the document‘s encoding method. Hope it helps! :)
(It's my first time PR, hope that I didn't do anything wrong)

@cragwolfe
Copy link
Contributor

@YooshiJay , thanks for contributing this PR. is there any chance you have a 1-page .doc that has the issue that could be used in a unittest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants