Skip to content

Doesn't process unassigned codepoints #103

@arp242

Description

@arp242

A few "valid" cases from toml-test currently fail:

FAIL valid/comment/nonascii
     System.InvalidOperationException: The document has errors: (1,17) : error : The character `�` is an invalid UTF8 character
 
        at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
        at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
        at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17
 
     Exit 1

     input sent to parser-cmd:
       # ~ � ÿ ퟿  � 𐀀 �

     output from parser-cmd (stderr):
       System.InvalidOperationException: The document has errors: (1,17) : error : The character `�` is an invalid UTF8 character

          at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
          at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
          at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17

       Exit 1

     want:
          

FAIL valid/key/quoted-unicode
     System.InvalidOperationException: The document has errors: (4,81) : error : Invalid Unicode scalar value [10FFFF]
     (6,16) : error : The character `�` is an invalid UTF8 character
     (7,18) : error : The character `�` is an invalid UTF8 character
 
        at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
        at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
        at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17
 
     Exit 1

     input sent to parser-cmd:

       "\u0000" = "null"
       '\u0000' = "different key"
       "\u0008 \u000c \U00000041 \u007f \u0080 \u00ff \ud7ff \ue000 \uffff \U00010000 \U0010ffff" = "escaped key"

       "~ � ÿ ퟿  � 𐀀 �" = "basic key"
       'l ~ � ÿ ퟿  � 𐀀 �' = "literal key"

     output from parser-cmd (stderr):
       System.InvalidOperationException: The document has errors: (4,81) : error : Invalid Unicode scalar value [10FFFF]
       (6,16) : error : The character `�` is an invalid UTF8 character
       (7,18) : error : The character `�` is an invalid UTF8 character

          at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
          at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
          at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17

       Exit 1

     want:
          

FAIL valid/string/quoted-unicode
     System.InvalidOperationException: The document has errors: (2,105) : error : Invalid Unicode scalar value [10FFFF]
     (5,31) : error : The character `�` is an invalid UTF8 character
     (6,33) : error : The character `�` is an invalid UTF8 character
 
        at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
        at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
        at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17
 
     Exit 1

     input sent to parser-cmd:

       escaped_string = "\u0000 \u0008 \u000c \U00000041 \u007f \u0080 \u00ff \ud7ff \ue000 \uffff \U00010000 \U0010ffff"
       not_escaped_string = '\u0000 \u0008 \u000c \U00000041 \u007f \u0080 \u00ff \ud7ff \ue000 \uffff \U00010000 \U0010ffff'

       basic_string = "~ � ÿ ퟿  � 𐀀 �"
       literal_string = '~ � ÿ ퟿  � 𐀀 �'

     output from parser-cmd (stderr):
       System.InvalidOperationException: The document has errors: (2,105) : error : Invalid Unicode scalar value [10FFFF]
       (5,31) : error : The character `�` is an invalid UTF8 character
       (6,33) : error : The character `�` is an invalid UTF8 character

          at Tomlyn.Model.TomlTable.From(DocumentSyntax documentSyntax)
          at Tomlyn.Toml.ToModel(DocumentSyntax syntax)
          at TomlynDecoder.Main(String[] args) in /home/martin/code/Toml/toml-test-matrix/src/cs-tomlyn/cs-tomlyn-decoder/cs-tomlyn-decoder.cs:line 17

       Exit 1

     want:
          

I'm not sure why they don't fail with the toml-test integration; a little binary I wrote so I can use it with toml-test tool: https://github.com/toml-lang/toml-test-matrix/blob/main/scripts/cs-tomlyn-decoder.cs (built with: https://github.com/toml-lang/toml-test-matrix/blob/main/parsers/cs-tomlyn.zsh#L8). Can reproduce with:

% toml-test ./cs-tomlyn-decoder/bin/Release/net8.0/cs-tomlyn-decoder

As for the errors:

  • "invalid UTF-8 character" is not really accurate, as it's valid UTF-8 – it's just that U+FFFF is not currently assgined in Unicode. Looking at the code, can probably just remove the CharHelper.IsValidUnicodeScalarValue(c) call in Tomlyn.Parsing.CheckCharacter()?

  • The "Invalid Unicode scalar value" error is similar: U+10FFFF is not currently assigned in Unicode. There is no requirement that \u... and \U... escapes only encode currently assigned codepoints. Can probably just remove the IsValidUnicodeScalarValue() call here too?

Metadata

Metadata

Assignees

No one assigned

    Labels

    PR welcomeUser contribution/PR is welcomebugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions