Skip to content

linter: Report non-printable characters as a syntax error#812

Merged
adrienverge merged 2 commits into
adrienverge:masterfrom
sarathfrancis90:fix/reader-error-non-printable
Jun 18, 2026
Merged

linter: Report non-printable characters as a syntax error#812
adrienverge merged 2 commits into
adrienverge:masterfrom
sarathfrancis90:fix/reader-error-non-printable

Conversation

@sarathfrancis90

Copy link
Copy Markdown
Contributor

Linting a file that contains an unescaped non-printable character (a NUL byte, DEL, or another control char) crashes yamllint with a raw traceback instead of reporting a problem:

$ printf 'key: val\000ue\n' | yamllint -
...
yaml.reader.ReaderError: unacceptable character #x0000: special characters are not allowed
  in "<unicode string>", position 8

PyYAML raises ReaderError for these characters. It is a YAMLError but, unlike the scanner/parser errors yamllint already catches, not a MarkedYAMLError, so it slipped past get_syntax_error(). The same input also broke token_or_comment_generator(), because BaseLoader() raises during construction before any token is produced.

This is the embedded-special-character case that #703 deliberately left raising (it fixed the backslash-escaped variant for quoted-strings). I catch ReaderError in get_syntax_error() and compute its line/column from the flat buffer position, and tolerate it in the token generator the same way ScannerError is. The character is now reported as an ordinary syntax error, and cosmetic rules still run on the printable lines before it.

I updated the three quoted-strings cases that previously asserted the crash, and added tests in tests/test_syntax_errors.py. Full suite, flake8 and ruff pass.

Prevent yamllint crash upon unescaped (embedded) non-printable
characters:

    $ printf 'key: val\000ue\n' | yamllint -
    …
    yaml.reader.ReaderError: unacceptable character #x0000: special characters are not allowed
      in "<unicode string>", position 8

PyYAML raises a ReaderError for control characters such as NUL or DEL.
It is a YAMLError but, unlike the parser/scanner errors yamllint already
handles, not a MarkedYAMLError, so it escaped get_syntax_error() and
crashed the whole run. The same input also broke the token generator,
since BaseLoader() itself raises the error before any token is produced.

Catch ReaderError in get_syntax_error() and derive its line/column from
the flat buffer position, and tolerate it in token_or_comment_generator()
the same way ScannerError is, so cosmetic rules still run on the
printable lines. Such input is now reported as a regular syntax error.
@adrienverge

Copy link
Copy Markdown
Owner

Hello @sarathfrancis90, thanks for contributing!

This change makes sense to me.

@Jayman2000 and @jmknoble you contributed to the latest code that handles input encoding and non-printable characters: what do you think?
Unless you spot a problem, I'd like to merge this one.

@jmknoble

Copy link
Copy Markdown
Contributor

Makes sense to me as well

@adrienverge adrienverge left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my pending comment from yesterday wasn't posted, here it is ↓

Comment thread yamllint/parser.py Outdated
Comment on lines +124 to +127
# BaseLoader() can already raise (e.g. a ReaderError on non-printable
# characters), so construct it inside the try too. Any such failure is
# surfaced separately as a syntax error by the linter.
yaml_loader = yaml.BaseLoader(buffer)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To better compartmentalise the logic, and to avoid side-effect, is it possible to use this code instead?

    try:
        yaml_loader = yaml.BaseLoader(buffer)
    except yaml.reader.ReaderError:
        # Failures like ReaderError on non-printable characters are surfaced
        # separately as a syntax error by the linter, so ignore them here.
        return

    try:

Construct the BaseLoader in a dedicated try/except so a ReaderError on
non-printable input is handled on its own and the token loop keeps a
separate try, which better compartmentalises the logic and avoids the
side-effect of mixing the loader construction with the scanning loop.
@sarathfrancis90

Copy link
Copy Markdown
Contributor Author

Good call @adrienverge — moved the BaseLoader construction into its own try/except that returns on ReaderError, so the scanning loop is now in a separate block. Reads cleaner. Thanks!

@Jayman2000

Copy link
Copy Markdown
Contributor

@Jayman2000 and @jmknoble you contributed to the latest code that handles input encoding and non-printable characters: what do you think?

Well, my first reaction to seeing this pull request was: “Does YAML’s specification actually disallow the use on non-printable characters?” After doing some research, it looks like the answer is “yes”. Chapter 5.1 of revision 1.2.2 of the YAML Specification says:

To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block15 x00-x1F (except for TAB x09, LF x0A and CR x0D which are allowed), DEL x7F, the C1 control block x80-x9F (except for NEL x85 which is allowed), the surrogate block16 xD800-xDFFF, xFFFE and xFFFF.

On input, a YAML processor must accept all characters in this printable subset.

On output, a YAML processor must only produce only characters in this printable subset. Characters outside this set must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped.

So I think that it is definitely correct for yamllint to report this problem as a syntax error. I also think that the way that yamllint currently handles non-printable characters is subpar and that this pull request improves the situation:

$ git switch --detach 30a25fe087e31d0345be0ffed4360e4651a44b6e  # This is the tip of the master branch at the moment.
Previous HEAD position was 9da11cf parser: Move ReaderError handling into its own block
HEAD is now at 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
$ printf 'key0: "looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong line"\nkey1: "non\000printable line"\n' | yamllint -
Traceback (most recent call last):
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/bin/yamllint", line 8, in <module>
    sys.exit(run())
             ~~~^^
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/cli.py", line 241, in run
    prob_level = show_problems(problems, 'stdin', args_format=args.format,
                               no_warn=args.no_warnings)
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/cli.py", line 102, in show_problems
    for problem in problems:
                   ^^^^^^^^
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/linter.py", line 199, in _run
    syntax_error = get_syntax_error(buffer)
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/repo/yamllint/linter.py", line 178, in get_syntax_error
    list(yaml.parse(buffer, Loader=yaml.BaseLoader))
    ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/__init__.py", line 44, in parse
    loader = Loader(stream)
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/loader.py", line 14, in __init__
    Reader.__init__(self, stream)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/reader.py", line 74, in __init__
    self.check_printable(stream)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/jayman/Documents/Home/VC/Git/Partially mine/yamllint/venv314/lib/python3.14/site-packages/yaml/reader.py", line 143, in check_printable
    raise ReaderError(self.name, position, ord(character),
            'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0000: special characters are not allowed
  in "<unicode string>", position 96
$ git switch --detach 9da11cf08b96455a66fc613325d8fe162711b3a3  # This is the tip of this pull request’s branch at the moment.
Previous HEAD position was 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
HEAD is now at 9da11cf parser: Move ReaderError handling into its own block
$ printf 'key0: "looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong line"\nkey1: "non\000printable line"\n' | yamllint -
stdin
  1:81      error    line too long (85 > 80 characters)  (line-length)
  2:11      error    syntax error: special characters are not allowed (syntax)

$

The error message produced by this pull request’s branch is more helpful and easier to understand than the error message produced by yamllint’s master branch.

That being said, it looks like this pull request has a minor flaw:

$ git switch --detach 30a25fe087e31d0345be0ffed4360e4651a44b6e  # This is the tip of the master branch at the moment.
Previous HEAD position was 9da11cf parser: Move ReaderError handling into its own block
HEAD is now at 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
$ printf 'unrelated_syntax_error: "The quick brown fox jumps over' | yamllint -
stdin
  1:1       warning  missing document start "---"  (document-start)
  1:56      error    syntax error: found unexpected end of stream (syntax)

$ git switch --detach 9da11cf08b96455a66fc613325d8fe162711b3a3  # This is the tip of this pull request’s branch at the moment.
Previous HEAD position was 30a25fe build: Use 'dev' dependency group for PEP 735 compliance
HEAD is now at 9da11cf parser: Move ReaderError handling into its own block
$ printf 'unrelated_syntax_error: "The quick brown fox jumps over' | yamllint -
stdin
  1:1       warning  missing document start "---"  (document-start)
  1:56      error    syntax error: found unexpected end of stream (syntax)

$ printf 'key: val\000ue\n' | yamllint -
stdin
  1:9       error    syntax error: special characters are not allowed (syntax)

$

In both the master branch and in this pull request’s branch, we get a ‘missing document start "---"’ warning when we lint a YAML stream that has a different type of syntax error. I would expect yamllint to also give us a ‘missing document start "---"’ warning when we lint a YAML stream that has a non-printable character–syntax error, but that doesn’t seem to work at the moment.

@sarathfrancis90

Copy link
Copy Markdown
Contributor Author

Good catch @Jayman2000, and thanks for digging into the spec too — you're right that it's inconsistent.

I looked into whether the non-printable case could also emit the document-start warning, and unfortunately it can't without faking input. The difference comes down to when PyYAML raises:

  • A "normal" syntax error is a ScannerError, raised partway through scanning. By that point the scanner has already produced a StreamStartToken (and usually a few more), so yamllint's token-based rules — including document-start — get to run before the error surfaces.
  • A non-printable character is a ReaderError, and PyYAML validates printability eagerly in Reader.__init__ (check_printable() scans the whole buffer at construction time, before self.buffer is even assigned). So yaml.BaseLoader(buffer) raises before get_token() is ever reachable — the scanner never runs and no token is produced, not even StreamStartToken.

This holds even when the bad character is buried deep in an otherwise-valid document: the entire string is checked up front, so there's never a token stream to run token rules against. The only ways I found to make document-start fire here would be to synthesize a fake StreamStartToken or to truncate the buffer at the bad character and re-parse the prefix — both misrepresent the real document, so I'd rather not.

The syntax error itself is still reported correctly (with the right line/column), and the crash is gone, which was the goal of this PR. @adrienverge, would you prefer to keep this PR scoped to the crash fix and treat the document-start consistency as a separate follow-up? Happy to go whichever way you'd like.

@adrienverge

Copy link
Copy Markdown
Owner

Nice catch @Jayman2000, and clear analysis @sarathfrancis90 👍

If there was an easy and consistent solution, I would be all for it. But given the way PyYAML behaves and how complex it would be to implement a perfect behavior, I'm not sure it's worth digging more. Most users would see the first error, fix the non-printable character, then re-run yamllint and then see "normal" errors. I believe this PR is already a good move on its own: no more crash. @Jayman2000 what do you think?

@Jayman2000

Copy link
Copy Markdown
Contributor

If there was an easy and consistent solution, I would be all for it. But given the way PyYAML behaves and how complex it would be to implement a perfect behavior, I'm not sure it's worth digging more. Most users would see the first error, fix the non-printable character, then re-run yamllint and then see "normal" errors. I believe this PR is already a good move on its own: no more crash. @Jayman2000 what do you think?

I agree that it’s probably not worth digging more. I don’t think that the minor flaw that I had noticed earlier should be a blocker that prevents this pull request from being merged.

@adrienverge

Copy link
Copy Markdown
Owner

Thank you @sarathfrancis90 @Jayman2000 @jmknoble, let's merge 👍

@adrienverge adrienverge merged commit 122853f into adrienverge:master Jun 18, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants