Lexical-Analyzer

A comprehensive C lexical analyzer and syntax error detector for C source code files. Lexical-Analyzer performs tokenization and identifies common syntax errors in a single pass, providing detailed error reporting with line numbers.

Overview

Lexical-Analyzer reads C source code files and performs two critical functions:

Lexical Analysis: Breaks down source code into tokens (keywords, operators, identifiers, constants, etc.)
Error Detection: Identifies syntax errors like unclosed strings, mismatched brackets, missing semicolons, and more

The analyzer handles comments, preprocessor directives, and provides comprehensive error messages to help developers quickly locate and fix issues.

Features

Complete Tokenization: Recognizes keywords, operators, identifiers, constants, and special characters
Multi-Error Detection: Identifies 7 types of common syntax errors
Comment Handling: Properly processes both single-line (//) and multi-line (/* */) comments across multiple lines
Preprocessor Support: Skips preprocessor directives during analysis
Line-by-Line Processing: Memory-efficient streaming approach
Cross-Platform Line Endings: Handles both CRLF (Windows \r\n) and LF (Unix/Linux \n) formats by stripping \r characters during tokenization
Detailed Error Reporting: Provides error type, line number, and description
Single-Pass Analysis: Performs both lexical analysis and error detection in one pass

Line Ending Handling

The analyzer is cross-platform compatible and handles different line ending conventions:

LF (Line Feed - \n): Unix/Linux/macOS standard
CRLF (Carriage Return + Line Feed - \r\n): Windows standard

Implementation Details:

The getNextToken() function in lexer.c explicitly skips \r characters during whitespace processing
The tokenizer treats \r as whitespace and continues to the next token
This ensures consistent behavior across different operating systems
Files created on any platform can be analyzed without conversion

Detected Errors

Lexical-Analyzer identifies the following syntax errors:

Error Type	Description	Example	Status
Unclosed String	String literal missing closing quote	`"Hello World`	✅ Implemented
Unclosed Character	Character literal missing closing quote	`'a`	✅ Implemented
Invalid Character Literal	Multiple characters or invalid escape sequences	`'abc'`, `'\z'`	✅ Implemented
Missing Closing Brace	Unmatched opening `{`	`void func() {`	✅ Implemented
Missing Closing Parenthesis	Unmatched opening `(`	`if (x > 5`	✅ Implemented
Mismatched Bracket	Unmatched opening `[`	`int arr[10;`	✅ Implemented
Missing Semicolon	Statement without terminating `;`	`int x = 5`	✅ Implemented

Error Detection Implementation Status

From error_detection_rules.txt:

✅ Fully Implemented:

Unclosed string literals
Unclosed character literals
Missing closing brace }
Missing closing parenthesis )
Mismatched brackets ]
Missing semicolons ; (with smart detection for control structures)
Invalid character literals (too many chars, empty literals)
Valid escape sequences: \n, \t, \r, \b, \f, \a, \v, \\, \', \", \0

❌ Not Implemented:

Invalid identifiers (e.g., 2cool, $var, int#var)
Invalid number formats (e.g., 12.12.12, 0xGHI)
Variable/function redeclaration detection
Type conflict detection
Keywords used as identifiers (e.g., int for = 5;)
Typos in keywords (e.g., retrun, els)
Preprocessor syntax validation
Invalid escape sequences (explicitly excluded per requirements)
Unclosed multi-line comments (explicitly excluded per requirements)

Note on Semicolon Detection: The analyzer intelligently handles semicolon requirements:

Detects missing semicolons after regular statements
Correctly skips checking for statements followed by braces (e.g., if, while, for, function definitions)
Handles both same-line and next-line brace patterns
Skips empty lines and lines with only braces

Token Types

Keywords (20)

int, float, return, if, else, while, for, do, break, continue,
char, double, void, switch, case, default, const, static, sizeof, struct

Operators

Single Character: + - * / % = ! < > | & ^

Multi-Character: == != <= >= && || ++ -- += -= *= /= %= &= |= ^= << >>

Special Characters

, ; { } ( ) [ ]

Constants

String Literals: "Hello, World!" (including strings with escape sequences and format specifiers)

Character Literals:

Single characters: 'a', 'Z', '5'
Valid escape sequences: '\n', '\t', '\r', '\b', '\f', '\a', '\v', '\\', '\'', '\"', '\0'

Numbers:

Integers: 42, 0, 123
Floating Point: 3.14, .99, 10.5
Octal Numbers: 012, 0777

Identifiers

Valid identifiers must:

Start with letter (A-Z, a-z) or underscore _
Be followed by letters, digits (0-9), or underscores
Not be a keyword
Examples: count, _temp, value123, totalSum

Comments

Single-line: // comment text
Multi-line: /* comment text */ (tracked across multiple lines)

Preprocessor Directives

Preprocessor directives starting with # are skipped during error checking and tokenization.

Lexical Rules Implementation Status

From lexical_rules.txt:

✅ Fully Implemented

Keywords: All 20 C keywords recognized

Operators:

✅ Single character operators: + - * / % = ! < > | & ^
✅ Two character operators: ==, !=, <=, >=, &&, ||, ++, --, +=, -=, *=, /=, %=, &=, |=, ^=, <<, >>
❌ Three character operators: None implemented (not required)

Special Characters: ✅ All implemented: , ; { } ( ) [ ]

Constants:

✅ String literals: "Hello world"
✅ Strings with escape sequences/format specifiers: "num is %d\n"
✅ Character literals: 'm'
✅ Integers: 10
✅ Floating point: 10.99, .99
✅ Octal numbers: 012
❌ Signed numbers: -10, +10 (treated as operator + number)
❌ Hexadecimal: 0xA (not implemented)
❌ Binary: 0b1010, 0B1010 (not implemented)

Identifiers:

✅ Must start with letter or underscore
✅ Followed by alphanumeric or underscore
✅ Cannot be a keyword
✅ No special characters allowed

Preprocessor Directives:

✅ Basic detection and skipping: #include, #define, etc.
✅ Handles preprocessor directives with spaces
❌ Keyword validation for preprocessor directives (not checked)

Comments:

✅ Single-line comments: //
✅ Multi-line comments: /* */ (with cross-line tracking)

❌ Not Implemented

Hexadecimal constants (0xA, 0xFF)
Binary constants (0b1010, 0B1010)
Signed number literals (negative/positive signs are treated as separate operator tokens)
Three character operators
Preprocessor directive keyword validation

Prerequisites

GCC Compiler or any C99-compatible compiler
Make (optional, for using Makefile)
Basic command-line knowledge

Installation

Clone the Repository

git clone https://github.com/SLADE0261/Lexical-Analyzer.git
cd Lexical-Analyzer

Compile the Project

Option 1: Using GCC directly

gcc main.c lexer.c error_det.c -o lexiscan

Option 2: Using Make (if Makefile provided)

make

Option 3: Manual Compilation

gcc -c main.c -o main.o
gcc -c lexer.c -o lexer.o
gcc -c error_det.c -o error_det.o
gcc main.o lexer.o error_det.o -o lexiscan

Usage

Basic Syntax

./lexiscan <filename.c>

Example Usage

Analyze a C source file:

./lexiscan program.c

Sample Output (No Errors):

KEYWORD              : int
IDENTIFIER           : main
SPECIAL_CHARACTER    : (
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : {
KEYWORD              : return
CONSTANT             : 0
SPECIAL_CHARACTER    : ;
SPECIAL_CHARACTER    : }

Sample Output (With Error):

ERROR at line 5: Unclosed string literal

Project Structure

Lexical-Analyzer/
├── main.c              # Main program entry point and orchestration
├── lexer.c             # Lexical analysis implementation
├── lexer.h             # Lexer function declarations and token types
├── error_det.c         # Error detection implementation
├── error_det.h         # Error detection function declarations
└── README.md           # This file

Detailed Component Description

main.c

Coordinates the analysis process
Reads input file line by line
Integrates error detection and tokenization
Outputs results to console
Performs final validation (e.g., unmatched braces at EOF)

lexer.c / lexer.h

Implements token recognition logic
Handles multi-character operators (checks two-character combinations)
Processes string and character literals
Skips single-line (//) and multi-line (/* */) comments during tokenization
Handles CRLF and LF line endings by treating \r as whitespace
Categorizes tokens into appropriate types
Maintains line number tracking for error reporting

error_det.c / error_det.h

Tracks bracket/brace/parenthesis matching with counters
Validates string and character literals
Checks for missing semicolons with intelligent control structure detection
Handles multi-line comment tracking across lines
Removes comments for clean analysis
Validates character literal content (length, escape sequences)
Skips preprocessor directives during error checking

Example Workflow

Test File: `sample.c`

#include <stdio.h>

int main() {
    int x = 10;
    printf("Value: %d\n", x);
    return 0;
}

Running Analysis

./lexiscan sample.c

Expected Output

KEYWORD              : int
IDENTIFIER           : main
SPECIAL_CHARACTER    : (
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : {
KEYWORD              : int
IDENTIFIER           : x
OPERATOR             : =
CONSTANT             : 10
SPECIAL_CHARACTER    : ;
IDENTIFIER           : printf
SPECIAL_CHARACTER    : (
CONSTANT             : "Value: %d\n"
SPECIAL_CHARACTER    : ,
IDENTIFIER           : x
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : ;
KEYWORD              : return
CONSTANT             : 0
SPECIAL_CHARACTER    : ;
SPECIAL_CHARACTER    : }

Error Examples

Example 1: Unclosed String

int main() {
    printf("Hello World;  // Missing closing quote
    return 0;
}

Output: ERROR at line 2: Unclosed string literal

Example 2: Missing Semicolon

int main() {
    int x = 5  // Missing semicolon
    return 0;
}

Output: ERROR at line 2: Missing semicolon ';' at end of statement

Example 3: Mismatched Braces

int main() {
    if (x > 0) {
        printf("Positive");
    // Missing closing brace for main
}

Output: ERROR at end of file: Missing 1 closing brace(s) '}'

Example 4: Invalid Character Literal

int main() {
    char c = 'ab';  // Too many characters
    return 0;
}

Output: ERROR at line 2: Invalid character literal - too many characters

Example 5: Smart Semicolon Detection

int main() {
    if (x > 0) {  // No error - brace follows
        printf("Positive");
    }
    
    while (y < 10)  // No error - brace on next line
    {
        y++;
    }
    
    int z = 5  // Error - missing semicolon
}

Output: ERROR at line 10: Missing semicolon ';' at end of statement

Technical Implementation Details

Comment Handling

Single-line comments (//):

Detected when // is encountered outside strings/character literals
Entire remaining line is ignored from the // onwards
Comments within strings are preserved as part of the string

Multi-line comments (/* */):

Tracked with a static in_multiline_comment flag
State persists across multiple lines
Properly handles nesting within the same comment block
Opening /* sets the flag, closing */ clears it
All content between markers is ignored during tokenization
Comments within strings are preserved

Implementation in code:

removeComments() function in error_det.c strips comments while preserving string/char content
skipCommentsAndPreprocessors() function in lexer.c handles comment detection during tokenization
Both functions respect string and character literal boundaries

Preprocessor Handling

Lines starting with # (after trimming whitespace) are identified as preprocessor directives
Preprocessor lines are completely skipped during error checking
Includes, defines, and macros are not analyzed for syntax errors
The tokenizer skips the entire line when # is encountered

Escape Sequence Support

Valid escape sequences in character literals:

\n - Newline
\t - Tab (horizontal)
\r - Carriage return
\b - Backspace
\f - Form feed
\a - Alert (bell)
\v - Vertical tab
\\ - Backslash
\' - Single quote
\" - Double quote
\0 - Null character

Validation: The validateCharLiteral() function checks:

Character literal has proper quote wrapping
Single characters are always valid
Two-character sequences starting with \ must be valid escape sequences
Anything else is flagged as an error

Bracket Matching Algorithm

Implementation:

Maintains three separate counters: brace_count, paren_count, bracket_count
Counters are static variables persisting across function calls
Only counts brackets/braces/parentheses outside strings and character literals

Logic:

Opening bracket ({, (, [) → Increment respective counter
Closing bracket (}, ), ]) → Decrement respective counter
If counter goes negative → Extra closing bracket detected (immediate error)
At end of file, if counter > 0 → Missing closing brackets (reported by checkFinalBraceCount())

Per-line tracking:

Also maintains line_brace_count, line_paren_count, line_bracket_count
Used to help determine if a line should require a semicolon

Semicolon Detection Logic

The analyzer uses intelligent semicolon detection:

Skip if line has braces: Lines that open a brace block don't need semicolons
Check next line: Uses isStatementWithBrace() to peek ahead for opening braces
Control structure detection: Statements ending with ) followed by { don't need semicolons
Content validation: Empty lines or lines with only braces are skipped
Preprocessor exemption: Lines starting with # are ignored

Algorithm flow:

IF line has content AND no semicolon AND no brace opened:
    IF NOT (line ends with ')' and next line has '{'):
        IF line has actual code (not just whitespace/braces):
            ERROR: Missing semicolon

Limitations

Single File Analysis: Analyzes one file at a time
No Semantic Analysis: Only checks syntax, not logic or type correctness
C Source Only: Optimized for C source files
Limited Preprocessor Support: Skips preprocessor directives entirely without validating syntax
No Type Checking: Does not validate variable types or declarations
No Hexadecimal/Binary: Does not recognize hex (0x) or binary (0b) number formats
No Signed Literals: Treats -10 as two tokens (operator - and constant 10)
No Redeclaration Detection: Does not track variable/function declarations
No Keyword Typo Detection: Does not suggest corrections for misspelled keywords
No Scope Analysis: Does not validate identifier usage or scope rules

Common Issues and Solutions

Issue	Possible Cause	Solution
Compilation error	Missing source files	Ensure all .c and .h files are in directory
"Could not open file"	Wrong file path	Check file path and permissions
No output	Empty file	Verify the input file has content
False semicolon errors	Complex control flow	Check for proper brace placement after control statements
Undetected errors	Error type not implemented	Refer to "Error Detection Implementation Status" section

Future Enhancements

Planned Features:

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Code Style

Follow K&R C coding style
Use meaningful variable names
Comment complex logic
Keep functions focused and concise

Testing

Create test files in a tests/ directory:

# Test successful analysis
./lexiscan tests/valid_program.c

# Test error detection
./lexiscan tests/unclosed_string.c
./lexiscan tests/missing_semicolon.c
./lexiscan tests/mismatched_braces.c
./lexiscan tests/invalid_char_literal.c

Suggested Test Cases

Valid Code Tests:

Programs with all token types
Complex expressions with operators
Multi-line comments
Mixed line endings (CRLF and LF)
Control structures with various brace styles

Error Detection Tests:

Each error type individually
Multiple errors in one file
Edge cases (empty files, comment-only files)
Nested structures with mismatched brackets

License

This project is open source and available for educational purposes.

Author

Krishnakant C. Pore

Acknowledgments

Compiler Design principles and techniques
Dragon Book (Compilers: Principles, Techniques, and Tools)
C programming community
Open source lexer implementations for reference

Repository

GitHub: https://github.com/SLADE0261/Lexical-Analyzer

★ If you find this project useful, please consider giving it a star on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
complete_example		complete_example
error_det.c		error_det.c
error_det.h		error_det.h
example		example
lexer.c		lexer.c
lexer.h		lexer.h
main.c		main.c

SLADE0261/Lexical-Analyzer

Folders and files

Latest commit

History

Repository files navigation