Skip to content

feat(yaml_parser): lex plain token #5870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

vohoanglong0107
Copy link
Contributor

@vohoanglong0107 vohoanglong0107 commented May 3, 2025

Summary

Allow the YAML lexer to lex plain token.

Test Plan

Added new tests for the lexer.

@github-actions github-actions bot added the A-Parser Area: parser label May 3, 2025
Copy link

codspeed-hq bot commented May 3, 2025

CodSpeed Performance Report

Merging #5870 will not alter performance

Comparing vohoanglong0107:lex_plain_token_with_context (50d8ce3) with main (14551ba)

Summary

✅ 95 untouched benchmarks

@vohoanglong0107 vohoanglong0107 force-pushed the lex_plain_token_with_context branch from d324c06 to 9627d10 Compare May 3, 2025 15:32
@vohoanglong0107 vohoanglong0107 marked this pull request as ready for review May 3, 2025 23:35
@vohoanglong0107 vohoanglong0107 requested review from a team May 3, 2025 23:36
Copy link
Contributor

@dyc3 dyc3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see work on yaml! While I have mostly compliments, I do have some things that need to be addressed.

Comment on lines +361 to +374
// https://yaml.org/spec/1.2.2/#rule-ns-char
fn is_non_space_char(c: u8) -> bool {
!is_space(c) && !is_break(c)
}

// https://yaml.org/spec/1.2.2/#rule-s-white
fn is_space(c: u8) -> bool {
c == b' ' || c == b'\t'
}

// https://yaml.org/spec/1.2.2/#rule-b-char
fn is_break(c: u8) -> bool {
c == b'\n' || c == b'\r'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the attention to detail here with the spec links. It's nice that these things are well defined.

Comment on lines 331 to 339
// Inside block key context
BlockKey,
// Outside flow context
FlowOut,
// Inside flow context
FlowIn,
// Inside flow key context
FlowKey,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be doc comments, and these should have some code samples to visually explain what these mean.

@@ -12,17 +12,16 @@ use std::time::Duration;
// Assert the result of lexing a piece of source code,
// and make sure the tokens yielded are fully lossless and the source can be reconstructed from only the tokens
macro_rules! assert_lex {
($src:expr, $($kind:ident:$len:expr $(,)?)*) => {{
($context:expr, $src:expr, $($kind:ident:$len:expr $(,)?)*) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've taken inspiration from the HTML lexer tests. :)

@@ -76,7 +75,7 @@ fn losslessness(string: String) -> bool {
let mut lexer = YamlLexer::from_str(&cloned);
let mut tokens = vec![];

while lexer.next_token(()) != EOF {
while lexer.next_token(YamlLexContext::default()) != EOF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change correct? I'm getting mixed signals from other occurrences of this test for other languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could you elaborate on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, the change doesn't seem correct. We don't do this in other lexer tests, which means there's something incorrect in the implementation of the lexer

source
}

fn next_non_trivia_token(&mut self, first_token: bool) {
fn next_non_trivia_token(&mut self, first_token: bool, context: YamlLexContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the equivalent for this in the html parser has these arguments reversed: context: HtmlLexContext, first_token: bool and it would be better if we stay consistent. (so if the html parser is actually the one at fault we would change it to match the other languages, of course)

Comment on lines 76 to 80
fn bump(&mut self) {
if self.current() != EOF {
self.next_non_trivia_token(false)
self.next_non_trivia_token(false, YamlLexContext::Regular)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really related to this PR, but this should be a call to bump_with_context

fn bump(&mut self) {
    self.bump_with_context(HtmlLexContext::Regular)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I have another PR in store which implements BumpWithContext for the Token Source. Let me address this then.

Comment on lines 375 to 376

fn is_indicator(c: u8) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn is_indicator(c: u8) -> bool {
// https://yaml.org/spec/1.2.2/#53-indicator-characters
fn is_indicator(c: u8) -> bool {

Would it be useful to add the YAML spec link here as well?

@vohoanglong0107 vohoanglong0107 requested a review from dyc3 May 4, 2025 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Parser Area: parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants