Skip to content

A comprehensive C lexical analyzer and syntax error detector for C source code files. LexiScan performs tokenization and identifies common syntax errors in a single pass, providing detailed error reporting with line numbers.

Notifications You must be signed in to change notification settings

SLADE0261/Lexical-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lexical-Analyzer

A comprehensive C lexical analyzer and syntax error detector for C source code files. Lexical-Analyzer performs tokenization and identifies common syntax errors in a single pass, providing detailed error reporting with line numbers.

Overview

Lexical-Analyzer reads C source code files and performs two critical functions:

  1. Lexical Analysis: Breaks down source code into tokens (keywords, operators, identifiers, constants, etc.)
  2. Error Detection: Identifies syntax errors like unclosed strings, mismatched brackets, missing semicolons, and more

The analyzer handles comments, preprocessor directives, and provides comprehensive error messages to help developers quickly locate and fix issues.

Features

  • Complete Tokenization: Recognizes keywords, operators, identifiers, constants, and special characters
  • Multi-Error Detection: Identifies 7 types of common syntax errors
  • Comment Handling: Properly processes both single-line (//) and multi-line (/* */) comments across multiple lines
  • Preprocessor Support: Skips preprocessor directives during analysis
  • Line-by-Line Processing: Memory-efficient streaming approach
  • Cross-Platform Line Endings: Handles both CRLF (Windows \r\n) and LF (Unix/Linux \n) formats by stripping \r characters during tokenization
  • Detailed Error Reporting: Provides error type, line number, and description
  • Single-Pass Analysis: Performs both lexical analysis and error detection in one pass

Line Ending Handling

The analyzer is cross-platform compatible and handles different line ending conventions:

  • LF (Line Feed - \n): Unix/Linux/macOS standard
  • CRLF (Carriage Return + Line Feed - \r\n): Windows standard

Implementation Details:

  • The getNextToken() function in lexer.c explicitly skips \r characters during whitespace processing
  • The tokenizer treats \r as whitespace and continues to the next token
  • This ensures consistent behavior across different operating systems
  • Files created on any platform can be analyzed without conversion

Detected Errors

Lexical-Analyzer identifies the following syntax errors:

Error Type Description Example Status
Unclosed String String literal missing closing quote "Hello World ✅ Implemented
Unclosed Character Character literal missing closing quote 'a ✅ Implemented
Invalid Character Literal Multiple characters or invalid escape sequences 'abc', '\z' ✅ Implemented
Missing Closing Brace Unmatched opening { void func() { ✅ Implemented
Missing Closing Parenthesis Unmatched opening ( if (x > 5 ✅ Implemented
Mismatched Bracket Unmatched opening [ int arr[10; ✅ Implemented
Missing Semicolon Statement without terminating ; int x = 5 ✅ Implemented

Error Detection Implementation Status

From error_detection_rules.txt:

✅ Fully Implemented:

  • Unclosed string literals
  • Unclosed character literals
  • Missing closing brace }
  • Missing closing parenthesis )
  • Mismatched brackets ]
  • Missing semicolons ; (with smart detection for control structures)
  • Invalid character literals (too many chars, empty literals)
  • Valid escape sequences: \n, \t, \r, \b, \f, \a, \v, \\, \', \", \0

❌ Not Implemented:

  • Invalid identifiers (e.g., 2cool, $var, int#var)
  • Invalid number formats (e.g., 12.12.12, 0xGHI)
  • Variable/function redeclaration detection
  • Type conflict detection
  • Keywords used as identifiers (e.g., int for = 5;)
  • Typos in keywords (e.g., retrun, els)
  • Preprocessor syntax validation
  • Invalid escape sequences (explicitly excluded per requirements)
  • Unclosed multi-line comments (explicitly excluded per requirements)

Note on Semicolon Detection: The analyzer intelligently handles semicolon requirements:

  • Detects missing semicolons after regular statements
  • Correctly skips checking for statements followed by braces (e.g., if, while, for, function definitions)
  • Handles both same-line and next-line brace patterns
  • Skips empty lines and lines with only braces

Token Types

Keywords (20)

int, float, return, if, else, while, for, do, break, continue,
char, double, void, switch, case, default, const, static, sizeof, struct

Operators

Single Character: + - * / % = ! < > | & ^

Multi-Character: == != <= >= && || ++ -- += -= *= /= %= &= |= ^= << >>

Special Characters

, ; { } ( ) [ ]

Constants

String Literals: "Hello, World!" (including strings with escape sequences and format specifiers)

Character Literals:

  • Single characters: 'a', 'Z', '5'
  • Valid escape sequences: '\n', '\t', '\r', '\b', '\f', '\a', '\v', '\\', '\'', '\"', '\0'

Numbers:

  • Integers: 42, 0, 123
  • Floating Point: 3.14, .99, 10.5
  • Octal Numbers: 012, 0777

Identifiers

Valid identifiers must:

  • Start with letter (A-Z, a-z) or underscore _
  • Be followed by letters, digits (0-9), or underscores
  • Not be a keyword
  • Examples: count, _temp, value123, totalSum

Comments

  • Single-line: // comment text
  • Multi-line: /* comment text */ (tracked across multiple lines)

Preprocessor Directives

Preprocessor directives starting with # are skipped during error checking and tokenization.

Lexical Rules Implementation Status

From lexical_rules.txt:

✅ Fully Implemented

Keywords: All 20 C keywords recognized

Operators:

  • ✅ Single character operators: + - * / % = ! < > | & ^
  • ✅ Two character operators: ==, !=, <=, >=, &&, ||, ++, --, +=, -=, *=, /=, %=, &=, |=, ^=, <<, >>
  • ❌ Three character operators: None implemented (not required)

Special Characters: ✅ All implemented: , ; { } ( ) [ ]

Constants:

  • ✅ String literals: "Hello world"
  • ✅ Strings with escape sequences/format specifiers: "num is %d\n"
  • ✅ Character literals: 'm'
  • ✅ Integers: 10
  • ✅ Floating point: 10.99, .99
  • ✅ Octal numbers: 012
  • ❌ Signed numbers: -10, +10 (treated as operator + number)
  • ❌ Hexadecimal: 0xA (not implemented)
  • ❌ Binary: 0b1010, 0B1010 (not implemented)

Identifiers:

  • ✅ Must start with letter or underscore
  • ✅ Followed by alphanumeric or underscore
  • ✅ Cannot be a keyword
  • ✅ No special characters allowed

Preprocessor Directives:

  • ✅ Basic detection and skipping: #include, #define, etc.
  • ✅ Handles preprocessor directives with spaces
  • ❌ Keyword validation for preprocessor directives (not checked)

Comments:

  • ✅ Single-line comments: //
  • ✅ Multi-line comments: /* */ (with cross-line tracking)

❌ Not Implemented

  • Hexadecimal constants (0xA, 0xFF)
  • Binary constants (0b1010, 0B1010)
  • Signed number literals (negative/positive signs are treated as separate operator tokens)
  • Three character operators
  • Preprocessor directive keyword validation

Prerequisites

  • GCC Compiler or any C99-compatible compiler
  • Make (optional, for using Makefile)
  • Basic command-line knowledge

Installation

Clone the Repository

git clone https://github.com/SLADE0261/Lexical-Analyzer.git
cd Lexical-Analyzer

Compile the Project

Option 1: Using GCC directly

gcc main.c lexer.c error_det.c -o lexiscan

Option 2: Using Make (if Makefile provided)

make

Option 3: Manual Compilation

gcc -c main.c -o main.o
gcc -c lexer.c -o lexer.o
gcc -c error_det.c -o error_det.o
gcc main.o lexer.o error_det.o -o lexiscan

Usage

Basic Syntax

./lexiscan <filename.c>

Example Usage

Analyze a C source file:

./lexiscan program.c

Sample Output (No Errors):

KEYWORD              : int
IDENTIFIER           : main
SPECIAL_CHARACTER    : (
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : {
KEYWORD              : return
CONSTANT             : 0
SPECIAL_CHARACTER    : ;
SPECIAL_CHARACTER    : }

Sample Output (With Error):

ERROR at line 5: Unclosed string literal

Project Structure

Lexical-Analyzer/
├── main.c              # Main program entry point and orchestration
├── lexer.c             # Lexical analysis implementation
├── lexer.h             # Lexer function declarations and token types
├── error_det.c         # Error detection implementation
├── error_det.h         # Error detection function declarations
└── README.md           # This file

Detailed Component Description

main.c

  • Coordinates the analysis process
  • Reads input file line by line
  • Integrates error detection and tokenization
  • Outputs results to console
  • Performs final validation (e.g., unmatched braces at EOF)

lexer.c / lexer.h

  • Implements token recognition logic
  • Handles multi-character operators (checks two-character combinations)
  • Processes string and character literals
  • Skips single-line (//) and multi-line (/* */) comments during tokenization
  • Handles CRLF and LF line endings by treating \r as whitespace
  • Categorizes tokens into appropriate types
  • Maintains line number tracking for error reporting

error_det.c / error_det.h

  • Tracks bracket/brace/parenthesis matching with counters
  • Validates string and character literals
  • Checks for missing semicolons with intelligent control structure detection
  • Handles multi-line comment tracking across lines
  • Removes comments for clean analysis
  • Validates character literal content (length, escape sequences)
  • Skips preprocessor directives during error checking

Example Workflow

Test File: sample.c

#include <stdio.h>

int main() {
    int x = 10;
    printf("Value: %d\n", x);
    return 0;
}

Running Analysis

./lexiscan sample.c

Expected Output

KEYWORD              : int
IDENTIFIER           : main
SPECIAL_CHARACTER    : (
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : {
KEYWORD              : int
IDENTIFIER           : x
OPERATOR             : =
CONSTANT             : 10
SPECIAL_CHARACTER    : ;
IDENTIFIER           : printf
SPECIAL_CHARACTER    : (
CONSTANT             : "Value: %d\n"
SPECIAL_CHARACTER    : ,
IDENTIFIER           : x
SPECIAL_CHARACTER    : )
SPECIAL_CHARACTER    : ;
KEYWORD              : return
CONSTANT             : 0
SPECIAL_CHARACTER    : ;
SPECIAL_CHARACTER    : }

Error Examples

Example 1: Unclosed String

int main() {
    printf("Hello World;  // Missing closing quote
    return 0;
}

Output: ERROR at line 2: Unclosed string literal

Example 2: Missing Semicolon

int main() {
    int x = 5  // Missing semicolon
    return 0;
}

Output: ERROR at line 2: Missing semicolon ';' at end of statement

Example 3: Mismatched Braces

int main() {
    if (x > 0) {
        printf("Positive");
    // Missing closing brace for main
}

Output: ERROR at end of file: Missing 1 closing brace(s) '}'

Example 4: Invalid Character Literal

int main() {
    char c = 'ab';  // Too many characters
    return 0;
}

Output: ERROR at line 2: Invalid character literal - too many characters

Example 5: Smart Semicolon Detection

int main() {
    if (x > 0) {  // No error - brace follows
        printf("Positive");
    }
    
    while (y < 10)  // No error - brace on next line
    {
        y++;
    }
    
    int z = 5  // Error - missing semicolon
}

Output: ERROR at line 10: Missing semicolon ';' at end of statement

Technical Implementation Details

Comment Handling

Single-line comments (//):

  • Detected when // is encountered outside strings/character literals
  • Entire remaining line is ignored from the // onwards
  • Comments within strings are preserved as part of the string

Multi-line comments (/* */):

  • Tracked with a static in_multiline_comment flag
  • State persists across multiple lines
  • Properly handles nesting within the same comment block
  • Opening /* sets the flag, closing */ clears it
  • All content between markers is ignored during tokenization
  • Comments within strings are preserved

Implementation in code:

  • removeComments() function in error_det.c strips comments while preserving string/char content
  • skipCommentsAndPreprocessors() function in lexer.c handles comment detection during tokenization
  • Both functions respect string and character literal boundaries

Preprocessor Handling

  • Lines starting with # (after trimming whitespace) are identified as preprocessor directives
  • Preprocessor lines are completely skipped during error checking
  • Includes, defines, and macros are not analyzed for syntax errors
  • The tokenizer skips the entire line when # is encountered

Escape Sequence Support

Valid escape sequences in character literals:

  • \n - Newline
  • \t - Tab (horizontal)
  • \r - Carriage return
  • \b - Backspace
  • \f - Form feed
  • \a - Alert (bell)
  • \v - Vertical tab
  • \\ - Backslash
  • \' - Single quote
  • \" - Double quote
  • \0 - Null character

Validation: The validateCharLiteral() function checks:

  1. Character literal has proper quote wrapping
  2. Single characters are always valid
  3. Two-character sequences starting with \ must be valid escape sequences
  4. Anything else is flagged as an error

Bracket Matching Algorithm

Implementation:

  • Maintains three separate counters: brace_count, paren_count, bracket_count
  • Counters are static variables persisting across function calls
  • Only counts brackets/braces/parentheses outside strings and character literals

Logic:

  1. Opening bracket ({, (, [) → Increment respective counter
  2. Closing bracket (}, ), ]) → Decrement respective counter
  3. If counter goes negative → Extra closing bracket detected (immediate error)
  4. At end of file, if counter > 0 → Missing closing brackets (reported by checkFinalBraceCount())

Per-line tracking:

  • Also maintains line_brace_count, line_paren_count, line_bracket_count
  • Used to help determine if a line should require a semicolon

Semicolon Detection Logic

The analyzer uses intelligent semicolon detection:

  1. Skip if line has braces: Lines that open a brace block don't need semicolons
  2. Check next line: Uses isStatementWithBrace() to peek ahead for opening braces
  3. Control structure detection: Statements ending with ) followed by { don't need semicolons
  4. Content validation: Empty lines or lines with only braces are skipped
  5. Preprocessor exemption: Lines starting with # are ignored

Algorithm flow:

IF line has content AND no semicolon AND no brace opened:
    IF NOT (line ends with ')' and next line has '{'):
        IF line has actual code (not just whitespace/braces):
            ERROR: Missing semicolon

Limitations

  1. Single File Analysis: Analyzes one file at a time
  2. No Semantic Analysis: Only checks syntax, not logic or type correctness
  3. C Source Only: Optimized for C source files
  4. Limited Preprocessor Support: Skips preprocessor directives entirely without validating syntax
  5. No Type Checking: Does not validate variable types or declarations
  6. No Hexadecimal/Binary: Does not recognize hex (0x) or binary (0b) number formats
  7. No Signed Literals: Treats -10 as two tokens (operator - and constant 10)
  8. No Redeclaration Detection: Does not track variable/function declarations
  9. No Keyword Typo Detection: Does not suggest corrections for misspelled keywords
  10. No Scope Analysis: Does not validate identifier usage or scope rules

Common Issues and Solutions

Issue Possible Cause Solution
Compilation error Missing source files Ensure all .c and .h files are in directory
"Could not open file" Wrong file path Check file path and permissions
No output Empty file Verify the input file has content
False semicolon errors Complex control flow Check for proper brace placement after control statements
Undetected errors Error type not implemented Refer to "Error Detection Implementation Status" section

Future Enhancements

Planned Features:

  • Hexadecimal number support (0xABC)
  • Binary number support (0b1010)
  • Symbol table generation
  • Syntax tree construction
  • Semantic analysis support
  • Multiple file analysis
  • Invalid identifier detection
  • Keyword typo suggestions
  • Variable redeclaration detection
  • Type conflict detection
  • Configuration file for custom keywords/operators
  • JSON/XML output format option
  • Integration with IDE plugins
  • Warning levels (errors vs warnings)
  • Code metrics (lines of code, complexity)
  • Preprocessor macro expansion

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Code Style

  • Follow K&R C coding style
  • Use meaningful variable names
  • Comment complex logic
  • Keep functions focused and concise

Testing

Create test files in a tests/ directory:

# Test successful analysis
./lexiscan tests/valid_program.c

# Test error detection
./lexiscan tests/unclosed_string.c
./lexiscan tests/missing_semicolon.c
./lexiscan tests/mismatched_braces.c
./lexiscan tests/invalid_char_literal.c

Suggested Test Cases

Valid Code Tests:

  • Programs with all token types
  • Complex expressions with operators
  • Multi-line comments
  • Mixed line endings (CRLF and LF)
  • Control structures with various brace styles

Error Detection Tests:

  • Each error type individually
  • Multiple errors in one file
  • Edge cases (empty files, comment-only files)
  • Nested structures with mismatched brackets

License

This project is open source and available for educational purposes.

Author

Krishnakant C. Pore

Acknowledgments

  • Compiler Design principles and techniques
  • Dragon Book (Compilers: Principles, Techniques, and Tools)
  • C programming community
  • Open source lexer implementations for reference

Repository

GitHub: https://github.com/SLADE0261/Lexical-Analyzer


★ If you find this project useful, please consider giving it a star on GitHub!

About

A comprehensive C lexical analyzer and syntax error detector for C source code files. LexiScan performs tokenization and identifies common syntax errors in a single pass, providing detailed error reporting with line numbers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages