Forgex—Fortran Regular Expression—is a regular expression engine written entirely in Fortran.
This project is managed by Fortran Package Manager (FPM), providing basic processing of regular expression, and as a freely available under the MIT license. The engine's core algorithm uses a deterministic finite automaton (DFA) approach. This choice have been focused on runtime performance.
|Vertical bar, alternation*Asterisk, match zero or more+Plus, match one or more?Question, match zero or one\escape metacharacter.matches any character
- character class
[a-z] - inverted character class
[^a-z] - character class on UTF-8 code set
[α-ωぁ-ん] - shorthands in character class
[\d]
Note that inverted character class does not match the control characters.
{num},{,max},{min,},{min, max}, wherenumandmaxmust NOT be zero.
To use a literal left curly brace {, escape it with a backslash: \{.
^, matches the beginning of a line$, matches the end of a line
\ttab character\nnew line character (LF or CRLF)\rreturn character (CR)\sblank character (white space, TAB, CR, LF, FF, "Zenkaku" space U+3000)\Snon-blank character\w([a-zA-Z0-9_])\W([^a-zA-Z0-9_])\ddigit character ([0-9])\Dnon-digit character ([^0-9])\x..,\x{...}hexadecimal escape sequences, for instance,\x63matchesc.
Note: It is the user's responsibility to ensure that input text and the regular expression patterns passed to the API are composed of UTF-8 encoding.
Version 4.2 adds handling when non-UTF-8 characters are given as input strings. When the processor encounters a non-UTF-8 character, it replaces it byte by byte with U+FFFF and attempts to continue matching. Since this library is primarily focused on UTF-8 string processing, this feature should be considered experimental and preliminary. That is it is intended for simple purposes, such as searching for a pattern that matches ASCII and its class in a text file created with a character encoding that includes ASCII.
The documentation is available in English and Japanese at https://shinobuamasaki.github.io/forgex.
Operation has been confirmed with the following compilers:
- GNU Fortran (
gfortran) v11.4.0, v12.2.0, v13.2.1 - Intel Fortran Compiler (
ifx) 2024.2.1 20240711 - LLVM Flang (
flang-19,flang-20) v19.1.7, v20.1.0 (FPM v0.11.0 cannot build forgex with flang-19,20. Please use CMake instead for now).
Note: Not available for Flang 18 and earlier
It is assumed that you will use the Fortran Package Manager(fpm).
First of all, add the following to your project's fpm.toml:
[dependencies]
forgex = {git = "https://github.com/shinobuamasaki/forgex"}If you use macOS, you can install this library by using MacPorts with the following command:
sudo port install forgexIn this case, the .mod files will be placed in /opt/local/include/forgex and the library file will be placed in /opt/local/lib,
so to compile your source code, run the following command:
gfortran main.f90 -I/opt/local/include/forgex -L/opt/local/lib -lforgexIf you are using this installation method and want to build using fpm, make the following changes to fpm.toml:
[build]
external-modules = [ "forgex" ]
link = [ "forgex" ]Then you can build your program with the following command:
fpm build --flag "-I/opt/local/include/forgex" --link-flag "-L/opt/local/lib"See also https://ports.macports.org/port/forgex/details
If you want to build this library with CMake, execute the following command:
cd forgex
cmake -S . -B build
cmake --build buildThen, you can use codes in test/ directory to test the library with following command:
ctest -C Debug --test-dir buildWhen you write use forgex at the header on your program, .in. and .match. operators, regex subroutine, and regex_f function are introduced.
program main
use :: forgex
implicit noneThe .in. operator returns true if the pattern is contained in the string.
block
character(:), allocatable :: pattern, str
pattern = 'foo(bar|baz)'
str = "foobarbaz"
print *, pattern .in. str ! T
str = "foofoo"
print *, pattern .in. str ! F
end blockThe .match. operator returns true if the pattern exactly matches the string.
block
character(:), allocatable :: pattern, str
pattern = '\d{3}-\d{4}'
str = '100-0001'
print *, pattern .match. str ! T
str = '1234567'
print *, pattern .match. str ! F
end blockNote that the .in. and .match. operators return false for invalid pattern inputs.
The regex is a subroutine that returns the substring of a string that matches a pattern as its arguments.
block
character(:), allocatable :: pattern, str, res
integer :: length
pattern = 'foo(bar|baz)'
str = 'foobarbaz'
call regex(pattern, str, res)
print *, res ! foobar
! call regex(pattern, str, res, length)
! the value 6 stored in optional `length` variable.
end blockBy using the from/to arugments, you can extract substrings from the given string.
block
character(:), allocatable :: pattern, str, res
integer :: from, to
pattern = '[d-f]{3}'
str = 'abcdefghi'
call regex(pattern, str, res, from=from, to=to)
print *, res ! def
! The `from` and `to` variables store the indices of the start and end points
! of the matched part of the string `str`, respectively.
! Cut out before the matched part.
print *, str(1:from-1) ! abc
! Cut out the matched part that equivalent to the result of the `regex` function.
print *, str(from:to) ! def
! Cut out after the matched part.
print *, str(to+1:len(str)) ! ghi
end blockThe interface of regex subroutine is following:
interface regex
module procedure :: subroutine__regex
end interface
pure subroutine subroutine__regex(pattern, text, res, length, from, to, status, err_msg)
implicit none
character(*), intent(in) :: pattern, text
character(:), allocatable, intent(inout) :: res
integer, optional, intent(inout) :: length, from, to, status
character(*), optional, intent(inout) :: err_msgThe list of all status values is defined in the source file at src/ast/syntax_tree_error_m.f90.
If you want to the matched character string as the return value of a function,
consider using regex_f defined in the forgex module.
interface regex_f
module procedure :: function__regex
end interface regex_f
pure function function__regex(pattern, text) result(res)
implicit none
character(*), intent(in) :: pattern, text
character(:), allocatable :: resBefore calling APIs, you can validate a regex pattern using is_valid_regex function introduced in version 4.0 and later. The interface of is_valid_regex function is following:
interface is_valid_regex
module procedure :: is_valid_regex_pattern
end interfac
pure elemental function is_valid_regex_pattern (pattern) result(res)
implicit none
character(*), intent(in) :: pattern
logical :: resUTF-8 string can be matched using regular expression patterns just like ASCII strings.
The following example demonstrates matching Chinese characters.
In this example, the length variable stores the byte length, and in this case there
10 3-byte characters, so the length is 30.
block
character(:), allocatable :: pattern, str
integer :: length
pattern = "夢.{1,7}胡蝶"
str = "昔者莊周夢爲胡蝶 栩栩然胡蝶也"
print *, pattern .in. str ! T
call regex(pattern, str, res, length)
print *, res ! 夢爲胡蝶 栩栩然胡蝶
print *, length ! 30 (is 3-byte * 10 characters)
end blockVersion 3.2 introduces a command line tool that is called forgex-cli and uses the Forgex engine for debugging, testing, and benchmarking regex matches. It performs matching with commands such as the one shown in below, and outputs the results directly to standard output.
% forgex-cli find match lazy-dfa '([a-z]*g+)n?' .match. 'assign'
pattern: ([a-z]*g+)n?
text: 'assign'
parse time: 64.0μs
extract literal time: 6.9μs
runs engine: T
compile nfa time: 47.8μs
dfa initialize time: 5.4μs
search time: 704.9μs
matching result: T
memory (estimated): 10324
========== Thompson NFA ===========
state 1: (?, 5)
state 2: <Accepted>
state 3: (n, 2)(?, 2)
state 4: (g, 7)
state 5: (["a"-"f"], 6)(g, 6)(["h"-"m"], 6)(n, 6)(["o"-"z"], 6)(?, 4)
state 6: (?, 5)
state 7: (?, 8)
state 8: (g, 9)(?, 3)
state 9: (?, 8)
=============== DFA ===============
1 : ["a"-"f"]=>2
2 : ["o"-"z"]=>2 ["h"-"m"]=>2 g=>3
3A: n=>4
4A:
state 1 = ( 1 4 5 )
state 2 = ( 4 5 6 )
state 3A = ( 2 3 4 5 6 7 8 )
state 4A = ( 2 4 5 6 )
===================================
Starting with version 3.5, the command line tools are provided in a separate repository, see the link below:
- A program built by
gfortranon Windows and macOS may crash if an allocatable character is used in an OpenMP parallel block. - If you use the command line tool with PowerShell on Windows, use UTF-8 as your system locale to properly input and output Unicode characters.
- As internal changes to the API related to the addition of the
is_valid_regexfunction, the.in.and.match.operators now return False for invalid pattern input (in versions prior to 3.5 they would terminate processing by executing anerror stopstatement).
The following features are planned to be implemented in the future:
- Character class subtraction:
[a-z--b-d] - Add Unicode character class escape sequence:
\p{...} - Add hexadecimal escape sequence of Unicode:
\x.. - Deal with invalid byte strings in UTF-8
- Recovery from invalid patterns
- Optimize by literal searching method
- Add a CLI tool for debugging and benchmarking => ShinobuAmasaki/forgex-cli
- Make all operators
pure elementalattribute - Publish the documentation
- Support UTF-8 basic feature
- Construct DFA on-the-fly
- Support CMake building
- Add Time measurement tools (basic) => ShinobuAmasaki/forgex-cli
Parallelize on matching
All code contained herein shall be written with a three-space indentation.
For the algorithm of the power set construction method and syntax analysis, I referred to Russ Cox's article and Yoshiyuki Kondo's book.
For the algorithm of extracting literals, I refferred to the book of Navarro and Raffinot (2002).
The implementation of the priority queue was based on the code written by ue1221.
The idea of applying the .in. operator to strings was inspired by kazulagi's one.
The command-line interface design of forgex-cli was inspired in part by the package regex-cli of Rust language.
The MacPorts package of forgex are maintained by @barracuda156.
- Russ Cox "Regular Expression Matching Can Be Simple And Fast", 2007
- 近藤嘉雪 (Yoshiyuki Kondo), "定本 Cプログラマのためのアルゴリズムとデータ構造", 1998, SB Creative.
- ue1221/fortran-utilities
- Haruka Tomobe (kazulagi), https://github.com/kazulagi, his article in Japanese
- rust-lang/regex/regex-cli
- Gonzalo Navarro and Mathieu Raffinot, "Flexible Pattern Matching in Strings -- Practical On-Line Search Algorithms for Texts and Biological Sequences", 2002, Cambridge University Press
Forgex is as a freely available under the MIT license. See LICENSE.