Line-watch is a versatile command-line interface (CLI) utility, developed in Python to emulate and expand upon the core functionalities of tools like grep. What makes Line-watch unique is its entirely custom-built regular expression engine. This project was an engaging exploration into fundamental computer science concepts, including compiler design principles, lexical analysis, parsing techniques, and finite automata theory, aiming to build a self-contained and efficient solution for text pattern matching.
It provides a practical tool for searching text while also serving as a clear demonstration of how a regex engine can be constructed from the ground up. The engine's core relies on a recursive backtracking algorithm, which effectively simulates the behavior of Non-deterministic Finite Automata (NFAs) to provide flexible and accurate pattern matching.
Line-watch offers a solid set of features, covering many common regular expression syntaxes:
- Extensive Regex Syntax Coverage:
- Literal Character Matching: Exact identification of character sequences.
- Wildcard (
.): Matches any single character (excluding newline). - Anchors (
^,$):^: Asserts position at the beginning of a line.$: Asserts position at the end of a line.
- Character Classes:
\d: Matches any decimal digit (0-9).\w: Matches any "word" character (alphanumeric[a-z, A-Z, 0-9]and underscore_).
- Character Sets (
[]):[abc]: Matches any single character from a specified explicit set.[^abc]: Matches any single character not in the specified set (negated character class).[a-z]: Supports character ranges for concise definition (e.g.,[a-z],[0-9]).
- Quantifiers: Control over the repetition of the preceding element:
*: Matches zero or more occurrences.+: Matches one or more occurrences.?: Matches zero or one occurrence (optional).{n}: Matches exactlynoccurrences.{n,}: Matchesnor more occurrences.{n,m}: Matches betweennandm(inclusive) occurrences.
- Escaped Special Characters: Handles matching of special regex characters as literals (e.g.,
\.to match a literal dot,\*to match a literal asterisk). - Robust Command-Line Interface (CLI): Provides a user-friendly interface for performing pattern searches against direct input strings or by processing content from specified files.
- Comprehensive Unit Testing: Supported by a rigorous test suite using
pytestandunittest, ensuring reliability and validating the engine's behavior across diverse scenarios, including complex patterns and edge cases.
The project maintains a well-defined and modular directory structure, promoting code readability and maintainability:
line-watch/
βββ .github/
β βββ workflows/
β βββ security.yml # GitHub Actions workflow for automated security analysis and CI/CD
βββ line_watch/
β βββ __init__.py # Python package initialization file
β βββ cli.py # Command-Line Interface (CLI) implementation
β βββ engine.py # Core Regular Expression Engine: contains parsing and matching logic
β βββ utils.py # Utility functions, including character check helpers and regex component parsing
βββ tests/
β βββ __init__.py
β βββ test_engine.py # Comprehensive unit tests for the regex engine and utility functions
βββ .gitignore # Specifies files and directories to be excluded from version control
βββ LICENSE # Project licensing information (MIT License)
βββ pyproject.toml # Project metadata, build configuration, and command-line entry point (PEP 517/621 standard)
βββ README.md # This project documentation file
To set up and run Line-watch on your local machine, follow these instructions:
-
Clone the Repository: Begin by cloning the project repository:
git clone https://github.com/SREEHARI-M-S/line-watch.git cd line-watch -
Create and Activate a Virtual Environment (Recommended): It's good practice to use a virtual environment for dependency management, ensuring project isolation.
python3 -m venv venv source venv/bin/activate # On Windows: `venv\Scripts\activate`
-
Install Project Package: Install Line-watch as an editable package. This allows you to run the
lwcommand directly from your terminal and ensures any changes you make to the source code are immediately reflected.pip install -e .This command will also install
pytest, which is used for testing.
The lw CLI utility provides a streamlined interface for pattern matching.
The general command structure is:
lw <regex_pattern> [-f <filename> | -s <string_to_match>]<regex_pattern>: The regular expression pattern you wish to match.-f <filename>/--file <filename>: Specify a file path. Line-watch will search for the pattern within each line of this file.-s <string_to_match>/--string <string_to_match>: Provide a direct string to match against. This is suitable for single-line pattern checks.- No arguments provided for
-for-s: If neither-fnor-sis specified,lwwill read input line by line from standard input (stdin).
-
Matching a Pattern Against a Direct String Input (
-s):lw "hello" -s "This is a hello world string." # Output: β Matched: 'This is a hello world string.' with pattern 'hello' lw "^start" -s "The line starts here." # Output: β Not Matched: 'The line starts here.' with pattern '^start'
-
Utilizing Character Classes:
lw "id\\d+" -s "User ID12345" # Output: β Matched: 'User ID12345' with pattern 'id\d+' lw "func_\\w+" -s "Calling func_initialize." # Output: β Matched: 'Calling func_initialize.' with pattern 'func_\w+'
-
Employing Character Sets and Ranges:
lw "[aeiou]pple" -s "apple" # Output: β Matched: 'apple' with pattern '[aeiou]pple' lw "[A-Z0-9]{3}" -s "ABCDEF" # Output: β Matched: 'ABCDEF' with pattern '[A-Z0-9]{3}'
-
Demonstrating Quantifiers (
*,+,?,{n,m}):lw "colou?r" -s "colour" # Output: β Matched: 'colour' with pattern 'colou?r' lw "a{2,4}b" -s "aaab" # Output: β Matched: 'aaab' with pattern 'a{2,4}b' lw "X.*Y" -s "This is XcontentY within text." # Output: β Matched: 'This is XcontentY within text.' with pattern 'X.*Y'
-
Processing Patterns Against a File (
-f): First, create a sampledata.logfile:echo "INFO: Application started." > data.log echo "DEBUG: Variable x = 10." >> data.log echo "ERROR: File not found in /path/to/data." >> data.log echo "WARNING: Disk space low (20GB left)." >> data.log
Then, execute Line-watch:
lw "^ERROR:" -f data.log # Output: # β Line 3: ERROR: File not found in /path/to/data. lw "\\d+GB" -f data.log # Output: # β Line 4: WARNING: Disk space low (20GB left).
-
Reading Input from Standard Input (stdin): If neither
-fnor-sis provided,lwwill read lines from standard input until an End-of-File (EOF) signal is received (Ctrl+Don Unix/macOS,Ctrl+ZthenEnteron Windows).lw "keyword" # Output: # Enter text to match (Ctrl+D or Ctrl+Z to finish input): # This line contains a keyword. # β Line 1: This line contains a keyword. # Another line without the word. # Final keyword test. # β Line 3: Final keyword test. # (Press Ctrl+D or Ctrl+Z then Enter to finish input) # β No matches found. (If no matches were found overall)
A comprehensive suite of unit tests is included to ensure the integrity, accuracy, and reliability of the regex engine's behavior across various scenarios.
To execute the test suite:
- Navigate to the project's root directory in your terminal.
- Ensure
pytestis installed (it should be if you installed withpip install -e .). - Run the tests using the
pytestcommand:A successful execution will indicate that all tests have passed, confirming the engine's functionality across a wide range of regex constructs and edge cases.pytest
While Line-watch currently offers robust functionality for fundamental regex operations, the domain of regular expressions is expansive. Future enhancements I'm considering include:
- Performance Optimizations:
- NFA to DFA Conversion: Investigating and implementing algorithms to convert the NFA-based matching to a Deterministic Finite Automaton (DFA) for potentially superior performance and more predictable matching times, particularly with complex or pathological regexes.
- Optimized NFA Simulation: Exploring and integrating more advanced NFA simulation techniques that can significantly reduce redundant backtracking and improve overall efficiency.
- Advanced Regex Constructs:
- Capturing Groups (
()): Extending functionality to capture and extract specific matched substrings for further processing. - Alternation (
|): Implementing "OR" logic within patterns (e.g.,(cat|dog)). - Non-greedy Quantifiers (
*?,+?,??,{n,m}?): Introducing support for matching the shortest possible string segment that satisfies the pattern. - Lookaheads and Lookbehinds: Implementing zero-width assertions that assert conditions about what follows or precedes the current matching position without consuming characters.
- Word Boundaries (
\b,\B): For precise matching at word edges. - Whitespace Classes (
\s,\S): For matching any whitespace character or non-whitespace character, respectively.
- Capturing Groups (
- Enhanced Error Handling and Reporting: Developing more granular, specific, and user-friendly error messages for syntactically incorrect or malformed regex patterns.
- Formal Parser Implementation: Transitioning to a more structured regex parser, potentially incorporating a dedicated lexer (tokenizer) and an Abstract Syntax Tree (AST) builder, to enhance extensibility, maintainability, and improve error recovery mechanisms.
- Full Unicode Support: Expanding character matching and character class definitions to fully support the vast Unicode character set, enabling global application.
I welcome and value contributions to the Line-watch project! Your insights and expertise can significantly help improve it. To contribute effectively, please adhere to the following guidelines:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix. Use descriptive names like
feature/add-alternation-operatororbugfix/resolve-anchor-issue.git checkout -b feature/your-feature-name
- Implement your changes, ensuring they align with the project's architectural principles and coding style.
- Write comprehensive unit tests for any new functionality or bug fixes. This is crucial for maintaining the project's stability and reliability.
- Commit your changes with a clear, concise, and descriptive commit message. Follow conventional commit guidelines where applicable (e.g.,
feat: Add new feature X,fix: Resolve issue with nested quantifiers). - Push your branch to your forked repository:
git push origin feature/your-feature-name
- Open a Pull Request (PR) against the
mainbranch of the original repository. Provide a detailed description of your changes, their rationale, and any relevant test results.
Before submitting a Pull Request, please ensure your code adheres to Python's best practices and successfully passes all existing tests.
This project is open-source and distributed under the terms of the MIT License. A copy of the license is available in the LICENSE file within the repository.