Lexical and Syntax Analysis for a Toy Programming Language
This project implements the first stage of a compiler front-end using GCC version 11.4.0 on Ubuntu 22.04 (via WSL). It includes the design and implementation of:
- A lexical analyzer to tokenize source code efficiently
- A syntax analyzer (predictive parser) that verifies the syntactic structure of input programs using a parse table
- A basic comment remover for pre-processing
- Support utilities including automated computation of FIRST and FOLLOW sets and parse tree generation
- Efficiently loads source code from the file
fpusing a twin buffer mechanism. - Reduces I/O overhead by buffering a fixed-size block of code into memory.
- Maintains the file pointer for sequential access and future reads.
- Reads the buffered character stream to recognize and tokenize lexemes.
- Returns relevant token information encapsulated in a
tokenInfostruct. - Detects and reports lexical errors with line numbers.
- Removes comments from the source file and writes the cleaned output to
cleanFile. - Used once via the driver to showcase functionality.
- Lexer does not rely on the cleaned file—comments are ignored during tokenization.
- Computes the FIRST and FOLLOW sets for the provided grammar.
- Automates the process when possible. Manual entry is supported if required.
- Builds a predictive parse table using computed FIRST and FOLLOW sets.
- Parses the input file using top-down predictive parsing.
- Constructs and returns a parse tree.
- Displays:
- Detailed syntax errors with line numbers.
- Confirmation message:
"Input source code is syntactically correct..........."if no syntax errors are found.
- Prints the parse tree inorder to
outfilein the following format:
lexeme CurrentNode lineno tokenName valueIfNumber parentNodeSymbol isLeafNode NodeSymbol
-------- ------------ ------ ------------ ------------- ---------------- ---------- -----------
id --- 2 ID --- <var> yes ID
---- --- --- --- --- ROOT no <program>
lexerDef.h: Data structures forlexer.clexer.h: Function declarations forlexer.cparserDef.h: Data structures for grammar, parse tree, etc.parser.h: Function declarations forparser.cdriver.c: Drives the flow of the compiler front-end
Ensure you're using GCC 11.4.0 under Ubuntu 22.04 (WSL).
make./stage1exe testfile.txt output.txttestfile.txt: Input source code fileoutput.txt: File where parse tree is printed
- Twin buffer for optimized lexical analysis
- Token data structure with detailed metadata
- FIRST and FOLLOW computation (auto/manual)
- Predictive parsing using a parsing table
- Inorder parse tree generation
- Modular code design following clean compiler architecture
- The
removeComments()function is a utility only for demonstration; the lexer processes the original source file. - This is Stage 1 of a multi-stage compiler pipeline. Future enhancements may include semantic analysis, intermediate code generation, and optimization.
After a successful syntax check:
Input source code is syntactically correct...........
Output parse tree (partial):
read --- 1 READ --- <ioStmt> yes READ
---- --- --- --- --- <stmt> no <ioStmt>
Developed as part of the Curiosity inside me.