diff --git a/docs/schema.md b/docs/schema.md index df5ae19a..cd455f6f 100644 --- a/docs/schema.md +++ b/docs/schema.md @@ -1,94 +1,87 @@ # Schema specification -A schema file specifies the delimiters and variables patterns (regular -expressions) necessary for `log-surgeon` to parse log events. `log-surgeon` uses -the delimiters to find tokens (as in, strings of non-delimiters) in -the input, and categorizes any token that matches a variable pattern as a -variable. Any tokens that are not categorized as variables are treated as static -text. In essence, this allows the user to parse variables out of their -unstructured log events. - -`log-surgeon` also assigns types to each variable based on the variable -pattern's name in the schema file. - - Internally, `log-surgeon`'s lexer also treats a string of -delimiters as a token, just not one that matches a variable pattern. +A schema file defines the **delimiters** and **variable patterns** (regular expressions) that +`log-surgeon` uses to parse log events. Delimiters conceptually divide the input into *tokens*, +where each token is either a variable (matched by a pattern) or **static text**. Variable tokens may +include delimiters and are treated as a single token. Static-text always begins and ends with a +delimiter. This structure enables `log-surgeon` to extract variables from otherwise unstructured log +events. ## Schema syntax -A schema file essentially contains a list of *rules*, each of which has a *name* -and a *pattern* (regular expression). +A schema file consists of a list of *rules*, each defined by a *name* and *pattern* (regular +expression). These rules dictate how `log-surgeon` identifies and categorizes parts of a log event. There are three types of rules in a schema file: -* [Variables](#variables) -* [Delimiters](#delimiters) -* [Timestamps](#timestamps) +* [Variables](#variables): Defines patterns for capturing specific pieces of the log. +* [Delimiters](#delimiters): Specifies the characters that separate tokens in the log. +* [Timestamps](#timestamps): Identifies the boundary between log events. Timestamps are also treated + as variables. + +For documentation, the schema allows for user comments by ignoring any text preceded by `//`. ### Variables **Syntax:** -``` +```txt : ``` -* `variable-name` may contain any alphanumeric characters, but may not be - the reserved names `delimiters` or `timestamp`. + +* `variable-name` may contain any alphanumeric characters, but may not be the reserved names + `delimiters` or `timestamp`. * `variable-pattern` is a regular expression using the supported - [syntax](#regular-expression-syntax), but it **cannot** contain characters - defined as [delimiters](#delimiters). + [syntax](#regular-expression-syntax). Note that: + * A schema file may contain zero or more variable rules. -* Repeating the same variable name in another rule will `OR` the regular - expressions (preform an alternation). -* If a token matches multiple patterns from multiple rules, the token will be - assigned the name of each rule, in the same order that they appear in the - schema file. +* Repeating the same variable name in another rule will `OR` the regular expressions (perform an + alternation). +* If a token matches multiple patterns from multiple rules, the token will be assigned the name of + each rule, in the same order that they appear in the schema file. ### Delimiters **Syntax:** -``` +```txt delimiters: ``` -* `delimiters` is a reserved name for this rule -* `characters` is a set of characters that should be treated as delimiters + +* `delimiters` is a reserved name for this rule. +* `characters` is a set of characters that should be treated as delimiters. These characters define + the boundaries between tokens in the log. Note that: -* A schema file must contain a single `delimiters` rule. If multiple - `delimiters` rules are specified, only the last one will be used. + +* A schema file must contain at least one `delimiters` rule. If multiple `delimiters` rules are + specified, only the last one will be used. ### Timestamps **Syntax:** -``` +```txt timestamp: ``` -* `timestamp` is a reserved name for this rule + +* `timestamp` is a reserved name for this rule. * `timestamp-pattern` is a regular expression using the supported - [syntax](#regular-expression-syntax) + [syntax](#regular-expression-syntax). Note that: -* Unlike [variable](#variables) patterns, timestamp patterns can contain - delimiters. -* The parser uses a timestamp to denote the start of a new log event if: - * ... it appears as the first token in the input, or - * ... it appears after a newline character. -* Until a timestamp is found, the parser will use a newline character to denote - the start of a new log event. -* The timestamp pattern is not used to match text inside a log event; since the - pattern can contain delimiters, no token can match it. -### Comments - -**Syntax:** Comments are any text preceded by `//`. +* The parser uses a timestamp to denote the start of a new log event if: + * It appears as the first token in the input, or + * It appears after a newline character. +* Until a timestamp is found, the parser will use a newline character to denote the start of a new + log event. ## Example schema file -``` +```txt // Delimiters delimiters: \t\r\n:,!;% @@ -101,35 +94,50 @@ float:\-{0,1}[0-9]+\.[0-9]+ // Custom variables hex:[a-fA-F]+ hasNumber:.*\d.* -equals:.*=.*[a-zA-Z0-9].* +equalsCapture:.*=(?.*[a-zA-Z0-9].*) ``` -* `delimiters: \t\r\n:,!;%` indicates that ` `, `\t`, `\r`, `\n`, `:`, `,`, - `!`, `;`, `%`, and `'` are delimiters. In a log file, consecutive delimiters, - e.g., N consecutive spaces, will be tokenized as static text. + +* `delimiters: \t\r\n:,!;%` indicates that ` `, `\t`, `\r`, `\n`, `:`, `,`, `!`, `;`, and `%` are + delimiters. * `timestamp` matches two different patterns: - * 2023-04-19 12:32:08.064 - * [20230419-12:32:08] -* `int`, `float`, `hex`, `hasNumber`, and `equals` all match different user - defined variables. + * `2023-04-19 12:32:08.064` + * `[20230419-12:32:08]` +* `int`, `float`, `hex`, `hasNumber`, and `equalsCapture` all match different user defined + variables. +* `equalsCapture` also contains the named capture group `equals`. This allows the user to extract + the substring following the equals sign. ## Regular Expression Syntax -The following regular expression rules are supported by the schema. When -building a regular expression, the rules are applied as they appear in this -list, from top to bottom. -``` -REGEX RULE DEFINITION -ab Match 'a' followed by 'b' -a|b Match a OR b -[a-z] Match any character in the brackets (e.g., any lowercase letter) - - special characters must be escaped, even in brackets (e.g., [\.\(\\]) -[^a-zA-Z] Match any character NOT in the brackets (e.g., non-alphabet character) -a* Match 'a' 0 or more times -a+ Match 'a' 1 or more times -a{N} Match 'a' exactly N times -a{N,M} Match 'a' between N and M times -(abc) Subexpression (concatenates abc) -\d Match any digit 0-9 -\s Match any whitespace character (' ', '\r', '\t', '\v', or '\f') -. Match any *non-delimiter* character +The following regular expression rules are supported by the schema. When building a regular +expression, the rules are applied as they appear in this list, from top to bottom. + +```txt +REGEX RULE EXAMPLE DEFINITION +Concatenation ab Match two expressions in sequence (e.g., 'a' + followed by 'b'). +Alternation a|b Match one of two expressions (e.g., 'a' or 'b'). +Range [a-z] Match any character within a specified range + (e.g., any lowercase letter). +Negated Range [^a-zA-Z] Match any character not within the specified + range (e.g., any non-alphabet character). +Kleene Star a* Match an expression zero or more times. +Kleene Plus a+ Match an expression one or more times. +Repetition a{N} Match an expression exactly N times. +Repetition Range a{N,M} Match an expression between N and M times. +Digit \d Match any digit (i.e., 0-9). +Whitespace \s Match any whitespace character (i.e., ' ', \r, + \t, \v, or \f). +Wildcard . Match any non-delimiter character. +Subexpression (ab) Match the expression in parentheses (e.g., ab). +Named Capture (?[01]+) Match an expression and assign it a name (e.g., + capture binary as "var"). + +* Special characters include: ( ) * + - . [ \ ] ^ { | } < > ? + - Escape these with '\' when used literally (e.g., \., \(, \\). + - Special characters must be escaped even in ranges. + +* For each regex rule, the expression(s) it contains can be formed by applying + any sequence of valid regex rules, including the rule itself, thus allowing + for recursive composition. ```