Skip to content

fix: handle broken unicode#164

Merged
efiop merged 1 commit intomainfrom
ruslan/fix-decode-error
Feb 17, 2025
Merged

fix: handle broken unicode#164
efiop merged 1 commit intomainfrom
ruslan/fix-decode-error

Conversation

@efiop
Copy link
Contributor

@efiop efiop commented Feb 17, 2025

Working with arabic in the output I'm sometimes seeing stuff like

Exception in thread Thread-3 (_reader):
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.12.4/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/root/.pyenv/versions/3.12.4/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/venv/lib/python3.12/site-packages/isolate/backends/common.py", line 153, in _reader
    forward_lines(fd)
  File "/opt/venv/lib/python3.12/site-packages/isolate/backends/common.py", line 115, in forward_lines
    raw_data = stream.read()
               ^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4095: unexpected end of data

which breaks the rest of the output. This happens because we are reading from an unblocked pipe, which can spew incomplete unicode at us causing this error, which causes the job to appear stuck.

E.g.

Mac ➜  isolate git:(ruslan/fix-decode-error) ✗  cat test_unicode.py 
import os

# Create a pipe
r, w = os.pipe()
os.set_blocking(r, False)

# Write partial UTF-8 bytes
os.write(w, b'\xe2')  # This is an incomplete UTF-8 sequence

# Try to read it
with open(r) as f:
    print(f.read())  # This will raise the UnicodeDecodeError
Mac ➜  isolate git:(ruslan/fix-decode-error) ✗  python test_unicode.py
Traceback (most recent call last):
  File "/Users/efiop/git/efiop/isolate/test_unicode.py", line 12, in <module>
    print(f.read())  # This will raise the UnicodeDecodeError
          ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

@efiop efiop requested review from chamini2 and cmlad February 17, 2025 14:38
@chamini2
Copy link
Member

OK, found this interesting and tested something

import os

# Create a pipe
r, w = os.pipe()
os.set_blocking(r, False)

# Write partial UTF-8 bytes
os.write(w, b'\xe2')  # This is an incomplete UTF-8 sequence

# Try to read it
with open('out', 'wb') as w:
    with open(r, encoding='utf-8', errors='surrogateescape') as f:
        w.write(f.read().encode('utf-8'))
❯ python t/encoding.py
Traceback (most recent call last):
  File "/Users/matteo/Projects/fal-ai/cloud/t/encoding.py", line 13, in <module>
    w.write(f.read().encode('utf-8'))
            ^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 0: surrogates not allowed

this still fails for me, do you see different behavior or am I doing something wrong?

@efiop efiop force-pushed the ruslan/fix-decode-error branch from d961aee to 4198c54 Compare February 17, 2025 14:49
@efiop efiop changed the title fix: use 'surrogateescape' for encoding errors fix: handle broken unicode Feb 17, 2025
@efiop
Copy link
Contributor Author

efiop commented Feb 17, 2025

@chamini2 sorry was testing it out while the test ran here. Replaced with a different policy for handling errors.

@efiop efiop merged commit 12708ac into main Feb 17, 2025
6 checks passed
@efiop efiop deleted the ruslan/fix-decode-error branch February 17, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants