Skip to content

docs, lregex: treatment of newlines #3110

Open
@hirooih

Description

@hirooih

During working on PR #3109 I found description of the treatments of newlines might be wrong.
But I might be wrong. Let me know what I am missing.

From Regular expression (regex) engine:

A more subtle issue is this text from the Regular Expressions chapter: “the use of literal s or any escape sequence equivalent produces undefined results”. What that means is using a regex pattern with [^\n]+ is invalid, and indeed in glibc produces very odd results.

The description of the specification including before and after the quoted sentence is as follows.

In the functions processing regular expressions described in System Interfaces volume of POSIX.1-2017, the is regarded as an ordinary character and both a and a non-matching list can match one. In the functions processing regular expressions described in System Interfaces volume of POSIX.1-2017, the is regarded as an ordinary character and both a and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2017 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of characters; if not stated otherwise, the use of literal characters or any escape sequence equivalent in either patterns or matched text produces undefined results.

It does not say "What that means is using a regex pattern with [^\n]+ is invalid". I can find a description of special treatment of in the spec.
Does this describe about an issue specific to the implementation of glibc?

And the the next sentence follows;

Those utilities (like grep) that do not allow characters to match are responsible for eliminating any from strings before matching against the RE.

In the Universal Ctags case this is similar to --regex-<LANG> what processes input line by line. --regex-<LANG> does
not have to care setting of REG_NEWLINE, if I understand correctly. should be eliminated.

Never use \n in patterns for --regex-,

This is OK. But I don't understand the following senence;

and never use them in non-matching bracket expressions for --mline-regex- patterns.

First I don't understand what non-matching bracket expressions means. Of course brackets ([ and ]) should be paired. But I guess the sentence above means different things.

I think it is more portable to use ^ or $ than using \n because there are variations of line-break characters.

For the experimental --_mtable-regex- you can safely use \n because that regex is not compiled with REG_NEWLINE.

We can also say we have to use \n because that regex is not compiled with REG_NEWLINE.
If I understand correctly, it is better to set REG_NEWLINE for --_mtable-regex-<LANG>, too.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions