A comprehensive C lexical analyzer and syntax error detector for C source code files. Lexical-Analyzer performs tokenization and identifies common syntax errors in a single pass, providing detailed error reporting with line numbers.
Lexical-Analyzer reads C source code files and performs two critical functions:
- Lexical Analysis: Breaks down source code into tokens (keywords, operators, identifiers, constants, etc.)
- Error Detection: Identifies syntax errors like unclosed strings, mismatched brackets, missing semicolons, and more
The analyzer handles comments, preprocessor directives, and provides comprehensive error messages to help developers quickly locate and fix issues.
- Complete Tokenization: Recognizes keywords, operators, identifiers, constants, and special characters
- Multi-Error Detection: Identifies 7 types of common syntax errors
- Comment Handling: Properly processes both single-line (
//) and multi-line (/* */) comments across multiple lines - Preprocessor Support: Skips preprocessor directives during analysis
- Line-by-Line Processing: Memory-efficient streaming approach
- Cross-Platform Line Endings: Handles both CRLF (Windows
\r\n) and LF (Unix/Linux\n) formats by stripping\rcharacters during tokenization - Detailed Error Reporting: Provides error type, line number, and description
- Single-Pass Analysis: Performs both lexical analysis and error detection in one pass
The analyzer is cross-platform compatible and handles different line ending conventions:
- LF (Line Feed -
\n): Unix/Linux/macOS standard - CRLF (Carriage Return + Line Feed -
\r\n): Windows standard
Implementation Details:
- The
getNextToken()function inlexer.cexplicitly skips\rcharacters during whitespace processing - The tokenizer treats
\ras whitespace and continues to the next token - This ensures consistent behavior across different operating systems
- Files created on any platform can be analyzed without conversion
Lexical-Analyzer identifies the following syntax errors:
| Error Type | Description | Example | Status |
|---|---|---|---|
| Unclosed String | String literal missing closing quote | "Hello World |
✅ Implemented |
| Unclosed Character | Character literal missing closing quote | 'a |
✅ Implemented |
| Invalid Character Literal | Multiple characters or invalid escape sequences | 'abc', '\z' |
✅ Implemented |
| Missing Closing Brace | Unmatched opening { |
void func() { |
✅ Implemented |
| Missing Closing Parenthesis | Unmatched opening ( |
if (x > 5 |
✅ Implemented |
| Mismatched Bracket | Unmatched opening [ |
int arr[10; |
✅ Implemented |
| Missing Semicolon | Statement without terminating ; |
int x = 5 |
✅ Implemented |
From error_detection_rules.txt:
✅ Fully Implemented:
- Unclosed string literals
- Unclosed character literals
- Missing closing brace
} - Missing closing parenthesis
) - Mismatched brackets
] - Missing semicolons
;(with smart detection for control structures) - Invalid character literals (too many chars, empty literals)
- Valid escape sequences:
\n,\t,\r,\b,\f,\a,\v,\\,\',\",\0
❌ Not Implemented:
- Invalid identifiers (e.g.,
2cool,$var,int#var) - Invalid number formats (e.g.,
12.12.12,0xGHI) - Variable/function redeclaration detection
- Type conflict detection
- Keywords used as identifiers (e.g.,
int for = 5;) - Typos in keywords (e.g.,
retrun,els) - Preprocessor syntax validation
- Invalid escape sequences (explicitly excluded per requirements)
- Unclosed multi-line comments (explicitly excluded per requirements)
Note on Semicolon Detection: The analyzer intelligently handles semicolon requirements:
- Detects missing semicolons after regular statements
- Correctly skips checking for statements followed by braces (e.g.,
if,while,for, function definitions) - Handles both same-line and next-line brace patterns
- Skips empty lines and lines with only braces
int, float, return, if, else, while, for, do, break, continue,
char, double, void, switch, case, default, const, static, sizeof, structSingle Character: + - * / % = ! < > | & ^
Multi-Character: == != <= >= && || ++ -- += -= *= /= %= &= |= ^= << >>
, ; { } ( ) [ ]
String Literals: "Hello, World!" (including strings with escape sequences and format specifiers)
Character Literals:
- Single characters:
'a','Z','5' - Valid escape sequences:
'\n','\t','\r','\b','\f','\a','\v','\\','\'','\"','\0'
Numbers:
- Integers:
42,0,123 - Floating Point:
3.14,.99,10.5 - Octal Numbers:
012,0777
Valid identifiers must:
- Start with letter (A-Z, a-z) or underscore
_ - Be followed by letters, digits (0-9), or underscores
- Not be a keyword
- Examples:
count,_temp,value123,totalSum
- Single-line:
// comment text - Multi-line:
/* comment text */(tracked across multiple lines)
Preprocessor directives starting with # are skipped during error checking and tokenization.
From lexical_rules.txt:
Keywords: All 20 C keywords recognized
Operators:
- ✅ Single character operators:
+ - * / % = ! < > | & ^ - ✅ Two character operators:
==,!=,<=,>=,&&,||,++,--,+=,-=,*=,/=,%=,&=,|=,^=,<<,>> - ❌ Three character operators: None implemented (not required)
Special Characters: ✅ All implemented: , ; { } ( ) [ ]
Constants:
- ✅ String literals:
"Hello world" - ✅ Strings with escape sequences/format specifiers:
"num is %d\n" - ✅ Character literals:
'm' - ✅ Integers:
10 - ✅ Floating point:
10.99,.99 - ✅ Octal numbers:
012 - ❌ Signed numbers:
-10,+10(treated as operator + number) - ❌ Hexadecimal:
0xA(not implemented) - ❌ Binary:
0b1010,0B1010(not implemented)
Identifiers:
- ✅ Must start with letter or underscore
- ✅ Followed by alphanumeric or underscore
- ✅ Cannot be a keyword
- ✅ No special characters allowed
Preprocessor Directives:
- ✅ Basic detection and skipping:
#include,#define, etc. - ✅ Handles preprocessor directives with spaces
- ❌ Keyword validation for preprocessor directives (not checked)
Comments:
- ✅ Single-line comments:
// - ✅ Multi-line comments:
/* */(with cross-line tracking)
- Hexadecimal constants (
0xA,0xFF) - Binary constants (
0b1010,0B1010) - Signed number literals (negative/positive signs are treated as separate operator tokens)
- Three character operators
- Preprocessor directive keyword validation
- GCC Compiler or any C99-compatible compiler
- Make (optional, for using Makefile)
- Basic command-line knowledge
git clone https://github.com/SLADE0261/Lexical-Analyzer.git
cd Lexical-AnalyzerOption 1: Using GCC directly
gcc main.c lexer.c error_det.c -o lexiscanOption 2: Using Make (if Makefile provided)
makeOption 3: Manual Compilation
gcc -c main.c -o main.o
gcc -c lexer.c -o lexer.o
gcc -c error_det.c -o error_det.o
gcc main.o lexer.o error_det.o -o lexiscan./lexiscan <filename.c>Analyze a C source file:
./lexiscan program.cSample Output (No Errors):
KEYWORD : int
IDENTIFIER : main
SPECIAL_CHARACTER : (
SPECIAL_CHARACTER : )
SPECIAL_CHARACTER : {
KEYWORD : return
CONSTANT : 0
SPECIAL_CHARACTER : ;
SPECIAL_CHARACTER : }
Sample Output (With Error):
ERROR at line 5: Unclosed string literal
Lexical-Analyzer/
├── main.c # Main program entry point and orchestration
├── lexer.c # Lexical analysis implementation
├── lexer.h # Lexer function declarations and token types
├── error_det.c # Error detection implementation
├── error_det.h # Error detection function declarations
└── README.md # This file
- Coordinates the analysis process
- Reads input file line by line
- Integrates error detection and tokenization
- Outputs results to console
- Performs final validation (e.g., unmatched braces at EOF)
- Implements token recognition logic
- Handles multi-character operators (checks two-character combinations)
- Processes string and character literals
- Skips single-line (
//) and multi-line (/* */) comments during tokenization - Handles CRLF and LF line endings by treating
\ras whitespace - Categorizes tokens into appropriate types
- Maintains line number tracking for error reporting
- Tracks bracket/brace/parenthesis matching with counters
- Validates string and character literals
- Checks for missing semicolons with intelligent control structure detection
- Handles multi-line comment tracking across lines
- Removes comments for clean analysis
- Validates character literal content (length, escape sequences)
- Skips preprocessor directives during error checking
#include <stdio.h>
int main() {
int x = 10;
printf("Value: %d\n", x);
return 0;
}./lexiscan sample.cKEYWORD : int
IDENTIFIER : main
SPECIAL_CHARACTER : (
SPECIAL_CHARACTER : )
SPECIAL_CHARACTER : {
KEYWORD : int
IDENTIFIER : x
OPERATOR : =
CONSTANT : 10
SPECIAL_CHARACTER : ;
IDENTIFIER : printf
SPECIAL_CHARACTER : (
CONSTANT : "Value: %d\n"
SPECIAL_CHARACTER : ,
IDENTIFIER : x
SPECIAL_CHARACTER : )
SPECIAL_CHARACTER : ;
KEYWORD : return
CONSTANT : 0
SPECIAL_CHARACTER : ;
SPECIAL_CHARACTER : }
int main() {
printf("Hello World; // Missing closing quote
return 0;
}Output: ERROR at line 2: Unclosed string literal
int main() {
int x = 5 // Missing semicolon
return 0;
}Output: ERROR at line 2: Missing semicolon ';' at end of statement
int main() {
if (x > 0) {
printf("Positive");
// Missing closing brace for main
}Output: ERROR at end of file: Missing 1 closing brace(s) '}'
int main() {
char c = 'ab'; // Too many characters
return 0;
}Output: ERROR at line 2: Invalid character literal - too many characters
int main() {
if (x > 0) { // No error - brace follows
printf("Positive");
}
while (y < 10) // No error - brace on next line
{
y++;
}
int z = 5 // Error - missing semicolon
}Output: ERROR at line 10: Missing semicolon ';' at end of statement
Single-line comments (//):
- Detected when
//is encountered outside strings/character literals - Entire remaining line is ignored from the
//onwards - Comments within strings are preserved as part of the string
Multi-line comments (/* */):
- Tracked with a static
in_multiline_commentflag - State persists across multiple lines
- Properly handles nesting within the same comment block
- Opening
/*sets the flag, closing*/clears it - All content between markers is ignored during tokenization
- Comments within strings are preserved
Implementation in code:
removeComments()function inerror_det.cstrips comments while preserving string/char contentskipCommentsAndPreprocessors()function inlexer.chandles comment detection during tokenization- Both functions respect string and character literal boundaries
- Lines starting with
#(after trimming whitespace) are identified as preprocessor directives - Preprocessor lines are completely skipped during error checking
- Includes, defines, and macros are not analyzed for syntax errors
- The tokenizer skips the entire line when
#is encountered
Valid escape sequences in character literals:
\n- Newline\t- Tab (horizontal)\r- Carriage return\b- Backspace\f- Form feed\a- Alert (bell)\v- Vertical tab\\- Backslash\'- Single quote\"- Double quote\0- Null character
Validation: The validateCharLiteral() function checks:
- Character literal has proper quote wrapping
- Single characters are always valid
- Two-character sequences starting with
\must be valid escape sequences - Anything else is flagged as an error
Implementation:
- Maintains three separate counters:
brace_count,paren_count,bracket_count - Counters are static variables persisting across function calls
- Only counts brackets/braces/parentheses outside strings and character literals
Logic:
- Opening bracket (
{,(,[) → Increment respective counter - Closing bracket (
},),]) → Decrement respective counter - If counter goes negative → Extra closing bracket detected (immediate error)
- At end of file, if counter > 0 → Missing closing brackets (reported by
checkFinalBraceCount())
Per-line tracking:
- Also maintains
line_brace_count,line_paren_count,line_bracket_count - Used to help determine if a line should require a semicolon
The analyzer uses intelligent semicolon detection:
- Skip if line has braces: Lines that open a brace block don't need semicolons
- Check next line: Uses
isStatementWithBrace()to peek ahead for opening braces - Control structure detection: Statements ending with
)followed by{don't need semicolons - Content validation: Empty lines or lines with only braces are skipped
- Preprocessor exemption: Lines starting with
#are ignored
Algorithm flow:
IF line has content AND no semicolon AND no brace opened:
IF NOT (line ends with ')' and next line has '{'):
IF line has actual code (not just whitespace/braces):
ERROR: Missing semicolon
- Single File Analysis: Analyzes one file at a time
- No Semantic Analysis: Only checks syntax, not logic or type correctness
- C Source Only: Optimized for C source files
- Limited Preprocessor Support: Skips preprocessor directives entirely without validating syntax
- No Type Checking: Does not validate variable types or declarations
- No Hexadecimal/Binary: Does not recognize hex (
0x) or binary (0b) number formats - No Signed Literals: Treats
-10as two tokens (operator-and constant10) - No Redeclaration Detection: Does not track variable/function declarations
- No Keyword Typo Detection: Does not suggest corrections for misspelled keywords
- No Scope Analysis: Does not validate identifier usage or scope rules
| Issue | Possible Cause | Solution |
|---|---|---|
| Compilation error | Missing source files | Ensure all .c and .h files are in directory |
| "Could not open file" | Wrong file path | Check file path and permissions |
| No output | Empty file | Verify the input file has content |
| False semicolon errors | Complex control flow | Check for proper brace placement after control statements |
| Undetected errors | Error type not implemented | Refer to "Error Detection Implementation Status" section |
Planned Features:
- Hexadecimal number support (
0xABC) - Binary number support (
0b1010) - Symbol table generation
- Syntax tree construction
- Semantic analysis support
- Multiple file analysis
- Invalid identifier detection
- Keyword typo suggestions
- Variable redeclaration detection
- Type conflict detection
- Configuration file for custom keywords/operators
- JSON/XML output format option
- Integration with IDE plugins
- Warning levels (errors vs warnings)
- Code metrics (lines of code, complexity)
- Preprocessor macro expansion
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Follow K&R C coding style
- Use meaningful variable names
- Comment complex logic
- Keep functions focused and concise
Create test files in a tests/ directory:
# Test successful analysis
./lexiscan tests/valid_program.c
# Test error detection
./lexiscan tests/unclosed_string.c
./lexiscan tests/missing_semicolon.c
./lexiscan tests/mismatched_braces.c
./lexiscan tests/invalid_char_literal.cValid Code Tests:
- Programs with all token types
- Complex expressions with operators
- Multi-line comments
- Mixed line endings (CRLF and LF)
- Control structures with various brace styles
Error Detection Tests:
- Each error type individually
- Multiple errors in one file
- Edge cases (empty files, comment-only files)
- Nested structures with mismatched brackets
This project is open source and available for educational purposes.
Krishnakant C. Pore
- Compiler Design principles and techniques
- Dragon Book (Compilers: Principles, Techniques, and Tools)
- C programming community
- Open source lexer implementations for reference
GitHub: https://github.com/SLADE0261/Lexical-Analyzer
★ If you find this project useful, please consider giving it a star on GitHub!