Skip to content

Commit c99b84c

Browse files
committed
detect encoding method
1 parent 3b718ec commit c99b84c

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

Diff for: requirements/base.in

+1
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,4 @@ tqdm
2222
psutil
2323
python-oxmsg
2424
html5lib
25+
chardet

Diff for: unstructured/partition/common/common.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import chardet
34
import numbers
45
import subprocess
56
from io import BufferedReader, BytesIO, TextIOWrapper
@@ -296,7 +297,9 @@ def convert_office_doc(
296297
wait_time = 0
297298
sleep_time = 0.1
298299
output = subprocess.run(command, capture_output=True)
299-
message = output.stdout.decode().strip()
300+
detected_encoding = chardet.detect(output.stdout)
301+
encoding = detected_encoding['encoding'] or 'utf-8' # Default to utf-8 if detection fails
302+
message = output.stdout.decode(encoding).strip()
300303
# we can't rely on returncode unfortunately because on macOS it would return 0 even when the
301304
# command failed to run; instead we have to rely on the stdout being empty as a sign of the
302305
# process failed

0 commit comments

Comments
 (0)