Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 25 additions & 2 deletions vunit/sim_if/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import sys
import os
from os import environ, listdir, pathsep
import locale
import subprocess
from pathlib import Path
from typing import List
Expand Down Expand Up @@ -360,14 +361,36 @@ def check_output(command, env=None):
"""
Wrapper arround subprocess.check_output
"""
def _decode(data: bytes) -> str:
"""Decode tool output robustly across platforms.

Some simulators on Windows emit output in a legacy code page (e.g. cp1252),
which can raise UnicodeDecodeError if decoded as strict UTF-8.
"""

encodings_to_try = (
"utf-8",
"utf-8-sig",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the string is encoded with utf-8-sig, utf-8 decoding will pass without errors. You do get some extra characters though:

>>> "hello".encode("utf-8-sig").decode("utf-8")
    '\ufeffhello'

If utf-8 is decoded as utf-8-sig, it will work most of the time as the decoder recognizes that the initial expected bytes are missing.

>>> "hello".encode("utf-8").decode("utf-8-sig")
    'hello'

The exception is if the utf-8 string starts with the bytes expected at the beginning of a utf-8-sig string. Rare but fully legal.

To be strict I suggest just using utf-8. Worst case is you get som extra characters being printed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That is a right comment and I will drop utf-8-sig from the chain:

locale.getpreferredencoding(False) or "utf-8",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will locale.getpreferredencoding(False) ever return None/empy string such that or "utf-8" makes a difference? If it does, we end up trying utf-8 again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. locale.getpreferredencoding(False) never returns None It could theoretically return "" on a broken system, but that's extremely unlikely.
I will remove or "utf-8" and deduplicate the list using dict.fromkeys (preserves insertion order).

"cp1252",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp1252 is a byte encoding so it will never fail and the code will never reach line 384. If the encoding was something else, there will be an information loss when the string is printed. I suggest you remove this line such that line 384 is reached.

I also suggest that you use errors="backslashreplace". That way, there is no information loss in the printed string. The user can copy it, convert back to bytes, and then try different decodings until the correct one is found.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. right points. I will do.

)

for encoding in encodings_to_try:
try:
return data.decode(encoding)
except UnicodeDecodeError:
continue

return data.decode("utf-8", errors="replace")

try:
output = subprocess.check_output( # pylint: disable=unexpected-keyword-arg
command, env=env, stderr=subprocess.STDOUT
)
except subprocess.CalledProcessError as err:
err.output = err.output.decode("utf-8")
err.output = _decode(err.output)
raise err
return output.decode("utf-8")
return _decode(output)


def check_executable(simulator_name, prefix, executable_name):
Expand Down