linter: Report non-printable characters as a syntax error#812
Conversation
Prevent yamllint crash upon unescaped (embedded) non-printable
characters:
$ printf 'key: val\000ue\n' | yamllint -
…
yaml.reader.ReaderError: unacceptable character #x0000: special characters are not allowed
in "<unicode string>", position 8
PyYAML raises a ReaderError for control characters such as NUL or DEL.
It is a YAMLError but, unlike the parser/scanner errors yamllint already
handles, not a MarkedYAMLError, so it escaped get_syntax_error() and
crashed the whole run. The same input also broke the token generator,
since BaseLoader() itself raises the error before any token is produced.
Catch ReaderError in get_syntax_error() and derive its line/column from
the flat buffer position, and tolerate it in token_or_comment_generator()
the same way ScannerError is, so cosmetic rules still run on the
printable lines. Such input is now reported as a regular syntax error.
|
Hello @sarathfrancis90, thanks for contributing! This change makes sense to me. @Jayman2000 and @jmknoble you contributed to the latest code that handles input encoding and non-printable characters: what do you think? |
|
Makes sense to me as well |
adrienverge
left a comment
There was a problem hiding this comment.
Sorry, my pending comment from yesterday wasn't posted, here it is ↓
| # BaseLoader() can already raise (e.g. a ReaderError on non-printable | ||
| # characters), so construct it inside the try too. Any such failure is | ||
| # surfaced separately as a syntax error by the linter. | ||
| yaml_loader = yaml.BaseLoader(buffer) |
There was a problem hiding this comment.
To better compartmentalise the logic, and to avoid side-effect, is it possible to use this code instead?
try:
yaml_loader = yaml.BaseLoader(buffer)
except yaml.reader.ReaderError:
# Failures like ReaderError on non-printable characters are surfaced
# separately as a syntax error by the linter, so ignore them here.
return
try:Construct the BaseLoader in a dedicated try/except so a ReaderError on non-printable input is handled on its own and the token loop keeps a separate try, which better compartmentalises the logic and avoids the side-effect of mixing the loader construction with the scanning loop.
|
Good call @adrienverge — moved the BaseLoader construction into its own try/except that returns on ReaderError, so the scanning loop is now in a separate block. Reads cleaner. Thanks! |
Well, my first reaction to seeing this pull request was: “Does YAML’s specification actually disallow the use on non-printable characters?” After doing some research, it looks like the answer is “yes”. Chapter 5.1 of revision 1.2.2 of the YAML Specification says:
So I think that it is definitely correct for yamllint to report this problem as a syntax error. I also think that the way that yamllint currently handles non-printable characters is subpar and that this pull request improves the situation: $ git switch --detach 30a25fe087e31d0345be0ffed4360e4651a44b6e # This is the tip of the master branch at the moment.
Previous HEAD position was 9da11cf parser: Move ReaderError handling into its own block
HEAD is now at 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
$ printf 'key0: "looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong line"\nkey1: "non\000printable line"\n' | yamllint -
Traceback (most recent call last):
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/bin/yamllint", line 8, in <module>
sys.exit(run())
~~~^^
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/cli.py", line 241, in run
prob_level = show_problems(problems, 'stdin', args_format=args.format,
no_warn=args.no_warnings)
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/cli.py", line 102, in show_problems
for problem in problems:
^^^^^^^^
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/linter.py", line 199, in _run
syntax_error = get_syntax_error(buffer)
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/linter.py", line 178, in get_syntax_error
list(yaml.parse(buffer, Loader=yaml.BaseLoader))
~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/__init__.py", line 44, in parse
loader = Loader(stream)
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/loader.py", line 14, in __init__
Reader.__init__(self, stream)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/reader.py", line 74, in __init__
self.check_printable(stream)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/reader.py", line 143, in check_printable
raise ReaderError(self.name, position, ord(character),
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0000: special characters are not allowed
in "<unicode string>", position 96
$ git switch --detach 9da11cf08b96455a66fc613325d8fe162711b3a3 # This is the tip of this pull request’s branch at the moment.
Previous HEAD position was 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
HEAD is now at 9da11cf parser: Move ReaderError handling into its own block
$ printf 'key0: "looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong line"\nkey1: "non\000printable line"\n' | yamllint -
stdin
1:81 error line too long (85 > 80 characters) (line-length)
2:11 error syntax error: special characters are not allowed (syntax)
$The error message produced by this pull request’s branch is more helpful and easier to understand than the error message produced by yamllint’s That being said, it looks like this pull request has a minor flaw: $ git switch --detach 30a25fe087e31d0345be0ffed4360e4651a44b6e # This is the tip of the master branch at the moment.
Previous HEAD position was 9da11cf parser: Move ReaderError handling into its own block
HEAD is now at 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
$ printf 'unrelated_syntax_error: "The quick brown fox jumps over' | yamllint -
stdin
1:1 warning missing document start "---" (document-start)
1:56 error syntax error: found unexpected end of stream (syntax)
$ git switch --detach 9da11cf08b96455a66fc613325d8fe162711b3a3 # This is the tip of this pull request’s branch at the moment.
Previous HEAD position was 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
HEAD is now at 9da11cf parser: Move ReaderError handling into its own block
$ printf 'unrelated_syntax_error: "The quick brown fox jumps over' | yamllint -
stdin
1:1 warning missing document start "---" (document-start)
1:56 error syntax error: found unexpected end of stream (syntax)
$ printf 'key: val\000ue\n' | yamllint -
stdin
1:9 error syntax error: special characters are not allowed (syntax)
$In both the |
|
Good catch @Jayman2000, and thanks for digging into the spec too — you're right that it's inconsistent. I looked into whether the non-printable case could also emit the
This holds even when the bad character is buried deep in an otherwise-valid document: the entire string is checked up front, so there's never a token stream to run token rules against. The only ways I found to make The syntax error itself is still reported correctly (with the right line/column), and the crash is gone, which was the goal of this PR. @adrienverge, would you prefer to keep this PR scoped to the crash fix and treat the |
|
Nice catch @Jayman2000, and clear analysis @sarathfrancis90 👍 If there was an easy and consistent solution, I would be all for it. But given the way PyYAML behaves and how complex it would be to implement a perfect behavior, I'm not sure it's worth digging more. Most users would see the first error, fix the non-printable character, then re-run yamllint and then see "normal" errors. I believe this PR is already a good move on its own: no more crash. @Jayman2000 what do you think? |
I agree that it’s probably not worth digging more. I don’t think that the minor flaw that I had noticed earlier should be a blocker that prevents this pull request from being merged. |
|
Thank you @sarathfrancis90 @Jayman2000 @jmknoble, let's merge 👍 |
Linting a file that contains an unescaped non-printable character (a NUL byte, DEL, or another control char) crashes yamllint with a raw traceback instead of reporting a problem:
PyYAML raises
ReaderErrorfor these characters. It is aYAMLErrorbut, unlike the scanner/parser errors yamllint already catches, not aMarkedYAMLError, so it slipped pastget_syntax_error(). The same input also broketoken_or_comment_generator(), becauseBaseLoader()raises during construction before any token is produced.This is the embedded-special-character case that #703 deliberately left raising (it fixed the backslash-escaped variant for
quoted-strings). I catchReaderErroringet_syntax_error()and compute its line/column from the flat buffer position, and tolerate it in the token generator the same wayScannerErroris. The character is now reported as an ordinary syntax error, and cosmetic rules still run on the printable lines before it.I updated the three
quoted-stringscases that previously asserted the crash, and added tests intests/test_syntax_errors.py. Full suite, flake8 and ruff pass.