-
Notifications
You must be signed in to change notification settings - Fork 0
feature/implement lexer - part 2 #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| fn backtrack_if_needed<F, T>(&mut self, f: F) -> Option<T> | ||
| where | ||
| F: FnOnce(&mut Self) -> Option<T>, | ||
| { | ||
| let saved = self.position; | ||
| let result = f(self); | ||
| if result.is_none() { | ||
| self.position = saved; | ||
| } | ||
| result | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is a parser job, not a lexer job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No i don't think so. The paper implies these are still lexical tokens (grammer not as much ambiguity)
| // Could be a cell reference like $A$1 or range like $A:$B or $1:$10 | ||
| // Try cell/vertical range first | ||
| if let Some(token) = self.try_read_cell_or_vertical_range() { | ||
| Ok(token) | ||
| } else if let Some(token) = self.try_read_horizontal_range() { | ||
| Ok(token) | ||
| } else { | ||
| // Invalid $ usage | ||
| let c = self.current().unwrap_or('$'); | ||
| Err(LexerError::UnexpectedChar(c)) | ||
| } | ||
| } | ||
| Some(c) if c.is_ascii_uppercase() => { | ||
| // Try cell/vertical range first (e.g., A1, A:Z) | ||
| if let Some(token) = self.try_read_cell_or_vertical_range() { | ||
| Ok(token) | ||
| } else { | ||
| // Try identifier for TRUE/FALSE | ||
| let ident = self.read_identifier(); | ||
| match ident.to_uppercase().as_str() { | ||
| "TRUE" => Ok(Token::Bool(true)), | ||
| "FALSE" => Ok(Token::Bool(false)), | ||
| _ => Err(LexerError::UnexpectedChar(c)), | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yeah I feel like this is a parser job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe not, but I think the backtracking can be removed by doing some sort of precedence, see: https://github.com/spreadsheetlab/XLParser/blob/master/src/XLParser/ExcelFormulaGrammar.cs
^above library references the paper too/the paper also references that library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that code is just defining some grammer and letting some other library to do lexing and parsing
var assembly = typeof(ExcelFormulaGrammar).GetTypeInfo().Assembly;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my naming maybe bad? By "backtrack" function name, I was just implementing kind of like context management in python where the position gets reset if there was no match.
But yes time complexity is higher but we could get code like this which would be a little more maintainable and maybe easier to reason about for priorities;
fn parse_reference(&mut self) -> Token {
if let Some(cell) = self.try_read_cell() {
return cell;
}
if let Some(name) = self.try_read_named_range() {
return name;
}
<more tokens depending on priorities>
...
}
If we are trying for pure performance, i can refactor it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is the best way, leaning towards cleaner code. The last time I did was just writing regex and passing to a library for lexing
What is in the PR
Implement part 2 of lexer. From this paper.
Tokens in the table implemented in the PR
$? [A-Z]+ $? [1-9][0-9]*$? [0-9]+ : $? [0-9]+$? [A-Z]+ : $? [A-Z]+How is it tested
For each entry, added unit tests in
tests/lexer/test_**.rs.