Skip to content

Conversation

@yarkhinephyo
Copy link
Member

What is in the PR

Implement part 2 of lexer. From this paper.

Tokens in the table implemented in the PR

  • CELL - Cell reference $? [A-Z]+ $? [1-9][0-9]*
  • HORIZONTAL-RANGE - Range of rows $? [0-9]+ : $? [0-9]+
  • VERTICAL-RANGE - Range of columns $? [A-Z]+ : $? [A-Z]+

How is it tested

For each entry, added unit tests in tests/lexer/test_**.rs.

cargo test

Comment on lines +10 to +20
fn backtrack_if_needed<F, T>(&mut self, f: F) -> Option<T>
where
F: FnOnce(&mut Self) -> Option<T>,
{
let saved = self.position;
let result = f(self);
if result.is_none() {
self.position = saved;
}
result
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a parser job, not a lexer job

Copy link
Member Author

@yarkhinephyo yarkhinephyo Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No i don't think so. The paper implies these are still lexical tokens (grammer not as much ambiguity)

Comment on lines +396 to +420
// Could be a cell reference like $A$1 or range like $A:$B or $1:$10
// Try cell/vertical range first
if let Some(token) = self.try_read_cell_or_vertical_range() {
Ok(token)
} else if let Some(token) = self.try_read_horizontal_range() {
Ok(token)
} else {
// Invalid $ usage
let c = self.current().unwrap_or('$');
Err(LexerError::UnexpectedChar(c))
}
}
Some(c) if c.is_ascii_uppercase() => {
// Try cell/vertical range first (e.g., A1, A:Z)
if let Some(token) = self.try_read_cell_or_vertical_range() {
Ok(token)
} else {
// Try identifier for TRUE/FALSE
let ident = self.read_identifier();
match ident.to_uppercase().as_str() {
"TRUE" => Ok(Token::Bool(true)),
"FALSE" => Ok(Token::Bool(false)),
_ => Err(LexerError::UnexpectedChar(c)),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yeah I feel like this is a parser job

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe not, but I think the backtracking can be removed by doing some sort of precedence, see: https://github.com/spreadsheetlab/XLParser/blob/master/src/XLParser/ExcelFormulaGrammar.cs

^above library references the paper too/the paper also references that library

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that code is just defining some grammer and letting some other library to do lexing and parsing

var assembly = typeof(ExcelFormulaGrammar).GetTypeInfo().Assembly;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my naming maybe bad? By "backtrack" function name, I was just implementing kind of like context management in python where the position gets reset if there was no match.

But yes time complexity is higher but we could get code like this which would be a little more maintainable and maybe easier to reason about for priorities;

 fn parse_reference(&mut self) -> Token {        
                                                                                                                                             
      if let Some(cell) = self.try_read_cell() {                                                                        
          return cell;                                                          
      }                                                                                                                 
                                                      
      if let Some(name) = self.try_read_named_range() {                                                                 
          return name;                                                                                                  
      }

     <more tokens depending on priorities>                                 
      ...                                                                                 
  }      

If we are trying for pure performance, i can refactor it though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is the best way, leaning towards cleaner code. The last time I did was just writing regex and passing to a library for lexing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants