Skip to content

Commit 03be4d2

Browse files
authored
pcre2test: add support for \N{U+hh...} escapes in subject (#528)
When providing escaped values in the subject, the syntax can be ambiguous, so add support for a new escape that is always meant to refer to a Unicode character and that is already supported by the library in utf mode. While at it, refactor the code to support octal escapes and fix bugs with overlong numbers, as well to simplify the logic that decides if an escape is encoded as a code unit or as an Unicode character, that could require multiple code units.
1 parent b72cc97 commit 03be4d2

File tree

14 files changed

+233
-89
lines changed

14 files changed

+233
-89
lines changed

doc/pcre2test.1

Lines changed: 41 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,8 @@ possible to include binary zeros.
7676
.sp
7777
When testing the 16-bit or 32-bit libraries, there is a need to be able to
7878
generate character code points greater than 255 in the strings that are passed
79-
to the library. For subject lines, backslash escapes can be used. In addition,
80-
when the \fButf\fP modifier (see
79+
to the library. For subject lines and some patterns, backslash escapes can be
80+
used. In addition, when the \fButf\fP modifier (see
8181
.\" HTML <a href="#optionmodifiers">
8282
.\" </a>
8383
"Setting compilation options"
@@ -97,9 +97,8 @@ UTF-8 (in its original definition) is not capable of encoding values greater
9797
than 0x7fffffff, but such values can be handled by the 32-bit library. When
9898
testing this library in non-UTF mode with \fButf8_input\fP set, if any
9999
character is preceded by the byte 0xff (which is an invalid byte in UTF-8)
100-
0x80000000 is added to the character's value. This is the only way of passing
101-
such code points in a pattern string. For subject strings, using an escape
102-
sequence is preferable.
100+
0x80000000 is added to the character's value. For subject strings, using an
101+
escape sequence is preferable.
103102
.
104103
.
105104
.SH "COMMAND LINE OPTIONS"
@@ -493,36 +492,43 @@ space is removed, and the line is scanned for backslash escapes, unless the
493492
\fBsubject_literal\fP modifier was set for the pattern. The following provide a
494493
means of encoding non-printing characters in a visible way:
495494
.sp
496-
\ea alarm (BEL, \ex07)
497-
\eb backspace (\ex08)
498-
\ee escape (\ex27)
499-
\ef form feed (\ex0c)
500-
\en newline (\ex0a)
501-
\er carriage return (\ex0d)
502-
\et tab (\ex09)
503-
\ev vertical tab (\ex0b)
504-
\ennn octal character (up to 3 octal digits); always
505-
a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
506-
\eo{dd...} octal character (any number of octal digits}
507-
\exhh hexadecimal byte (up to 2 hex digits)
508-
\ex{hh...} hexadecimal character (any number of hex digits)
509-
.sp
510-
The use of \ex{hh...} is not dependent on the use of the \fButf\fP modifier on
511-
the pattern. It is recognized always. There may be any number of hexadecimal
512-
digits inside the braces; invalid values provoke error messages.
513-
.P
514-
Note that \exhh specifies one byte rather than one character in UTF-8 mode;
515-
this makes it possible to construct invalid UTF-8 sequences for testing
516-
purposes. On the other hand, \ex{hh} is interpreted as a UTF-8 character in
517-
UTF-8 mode, generating more than one byte if the value is greater than 127.
518-
When testing the 8-bit library not in UTF-8 mode, \ex{hh} generates one byte
519-
for values that could fit on it, and causes an error for greater values.
520-
.P
521-
In UTF-16 mode, all 4-digit \ex{hhhh} values are accepted. This makes it
522-
possible to construct invalid UTF-16 sequences for testing purposes.
523-
.P
524-
In UTF-32 mode, all 4- to 8-digit \ex{...} values are accepted. This makes it
525-
possible to construct invalid UTF-32 sequences for testing purposes.
495+
\ea alarm (BEL, \ex07)
496+
\eb backspace (\ex08)
497+
\ee escape (\ex27)
498+
\ef form feed (\ex0c)
499+
\en newline (\ex0a)
500+
\eN{U+hh...} unicode character (any number of hex digits)
501+
\er carriage return (\ex0d)
502+
\et tab (\ex09)
503+
\ev vertical tab (\ex0b)
504+
\eddd octal number (up to 3 octal digits); represent a single
505+
code point unless larger than 255 with the 8-bit library
506+
\eo{dd...} octal number (any number of octal digits} representing a
507+
character in UTF mode or a code point
508+
\exhh hexadecimal byte (up to 2 hex digits)
509+
\ex{hh...} hexadecimal number (up to 8 hex digits) representing a
510+
character in UTF mode or a code point
511+
.sp
512+
Invoking \eN{U+hh...} or \ex{hh...} doesn't require the use of the \fButf\fP
513+
modifier on the pattern. It is always recognized. There may be any number of
514+
hexadecimal digits inside the braces; invalid values provoke error messages.
515+
.P
516+
Note that even in UTF-8 mode, \exhh (and depending of how large, \eddd)
517+
describe one byte rather than one character; this makes it possible to
518+
construct invalid UTF-8 sequences for testing purposes. On the other hand,
519+
\ex{hh...} is interpreted as a UTF-8 character in UTF-8 mode, only generating
520+
more than one byte if the value is greater than 127. To avoid the ambiguity
521+
it is preferred to use \eN{U+hh...} when describing characters. When testing
522+
the 8-bit library not in UTF-8 mode, \ex{hh} generates one byte for values
523+
that could fit on it, and causes an error for greater values.
524+
.P
525+
When testing te 16-bit library, not in UTF-16 mode, all 4-digit \ex{hhhh}
526+
values are accepted. This makes it possible to construct invalid UTF-16
527+
sequences for testing purposes.
528+
.P
529+
When testing the 32-bit library, not In UTF-32 mode, all 4 to 8-digit \ex{...}
530+
values are accepted. This makes it possible to construct invalid UTF-32
531+
sequences for testing purposes.
526532
.P
527533
There is a special backslash sequence that specifies replication of one or more
528534
characters:

perltest.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,9 @@
3232
# Handle the shell script arguments.
3333

3434
perl=perl
35-
perlarg=''
35+
perlarg=""
3636
prefix=''
37+
spc=""
3738

3839
if [ $# -gt 0 -a "$1" = "-perl" ] ; then
3940
if [ $# -lt 2 ] ; then
@@ -47,11 +48,14 @@ fi
4748

4849
if [ $# -gt 0 -a "$1" = "-w" ] ; then
4950
perlarg="-w"
51+
spc=" "
5052
shift
5153
fi
5254

5355
if [ $# -gt 0 -a "$1" = "-utf8" ] ; then
5456
prefix="use utf8; require Encode;"
57+
perlarg="$perlarg$spc-CSD"
58+
5559
shift
5660
fi
5761

src/pcre2_compile.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1523,17 +1523,15 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
15231523

15241524
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
15251525
{
1526-
#ifdef EBCDIC
1527-
*errorcodeptr = ERR93;
1528-
#else
1526+
#ifndef EBCDIC
15291527
if (utf)
15301528
{
15311529
ptr = p + 2;
15321530
escape = 0; /* Not a fancy escape after all */
15331531
goto COME_FROM_NU;
15341532
}
1535-
else *errorcodeptr = ERR93;
15361533
#endif
1534+
*errorcodeptr = ERR93;
15371535
}
15381536

15391537
/* Give an error in contexts where quantifiers are not allowed

src/pcre2test.c

Lines changed: 65 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -963,6 +963,13 @@ static coptstruct coptlist[] = {
963963
#undef SUPPORT_32
964964
#undef SUPPORT_EBCDIC
965965

966+
/* Types for the parser, to be used in process_data() */
967+
968+
enum force_encoding {
969+
FORCE_NONE, /* No preference, follow utf modifier */
970+
FORCE_RAW, /* Encode as a code point or error if too wide */
971+
FORCE_UTF /* Encode as a character or error if too wide */
972+
};
966973

967974
/* ----------------------- Static variables ------------------------ */
968975

@@ -7134,8 +7141,9 @@ in 16- and 32-bit modes, it can be forced to UTF-8 by the utf8_input modifier.
71347141

71357142
while ((c = *p++) != 0)
71367143
{
7137-
int32_t i = 0;
7144+
int i = 0;
71387145
size_t replen;
7146+
enum force_encoding encoding = FORCE_NONE;
71397147

71407148
/* ] may mark the end of a replicated sequence */
71417149

@@ -7157,6 +7165,7 @@ while ((c = *p++) != 0)
71577165
fprintf(outfile, "** Repeat count too large\n");
71587166
return PR_OK;
71597167
}
7168+
i = (int)li;
71607169

71617170
p = (uint8_t *)endptr;
71627171
if (*p++ != '}')
@@ -7165,7 +7174,6 @@ while ((c = *p++) != 0)
71657174
return PR_OK;
71667175
}
71677176

7168-
i = (int32_t)li;
71697177
if (i-- <= 0)
71707178
{
71717179
fprintf(outfile, "** Zero or negative repeat not allowed\n");
@@ -7243,24 +7251,32 @@ while ((c = *p++) != 0)
72437251
case '0': case '1': case '2': case '3':
72447252
case '4': case '5': case '6': case '7':
72457253
c -= '0';
7246-
while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9')
7254+
while (i++ < 2 && isdigit(*p) && *p < '8')
72477255
c = c * 8 + (*p++ - '0');
7256+
7257+
encoding = (utf && c > 255)? FORCE_UTF : FORCE_RAW;
72487258
break;
72497259

72507260
case 'o':
72517261
if (*p == '{')
72527262
{
72537263
uint8_t *pt = p;
72547264
c = 0;
7255-
for (pt++; isdigit(*pt) && *pt != '8' && *pt != '9'; pt++)
7265+
for (pt++; isdigit(*pt) && *pt < '8'; ++i, pt++)
72567266
{
7257-
if (++i == 12)
7258-
fprintf(outfile, "** Too many octal digits in \\o{...} item; "
7259-
"using only the first twelve.\n");
7267+
if (c >= 0x20000000l)
7268+
{
7269+
fprintf(outfile, "** \\o{ escape too large\n");
7270+
return PR_OK;
7271+
}
72607272
else c = c * 8 + (*pt - '0');
72617273
}
7262-
if (*pt == '}') p = pt + 1;
7263-
else fprintf(outfile, "** Missing } after \\o{ (assumed)\n");
7274+
if (i == 0 || *pt != '}')
7275+
{
7276+
fprintf(outfile, "** Malformed \\o{ escape\n");
7277+
return PR_OK;
7278+
}
7279+
else p = pt + 1;
72647280
}
72657281
break;
72667282

@@ -7306,15 +7322,31 @@ while ((c = *p++) != 0)
73067322
p++;
73077323
}
73087324
#if defined SUPPORT_PCRE2_8
7309-
if (utf && (test_mode == PCRE8_MODE))
7310-
{
7311-
*q8++ = c;
7312-
continue;
7313-
}
7325+
if (utf && (test_mode == PCRE8_MODE)) encoding = FORCE_RAW;
73147326
#endif
73157327
}
73167328
break;
73177329

7330+
case 'N':
7331+
if (memcmp(p, "{U+", 3) == 0 && isxdigit(p[3]))
7332+
{
7333+
char *endptr;
7334+
unsigned long uli;
7335+
7336+
p += 3;
7337+
errno = 0;
7338+
uli = strtoul((const char *)p, &endptr, 16);
7339+
if (errno == 0 && *endptr == '}' && uli <= UINT32_MAX)
7340+
{
7341+
c = (uint32_t)uli;
7342+
p = (uint8_t *)endptr + 1;
7343+
encoding = FORCE_UTF;
7344+
break;
7345+
}
7346+
}
7347+
fprintf(outfile, "** Malformed \\N{U+ escape\n");
7348+
return PR_OK;
7349+
73187350
case 0: /* \ followed by EOF allows for an empty line */
73197351
p--;
73207352
continue;
@@ -7340,24 +7372,13 @@ while ((c = *p++) != 0)
73407372
}
73417373

73427374
/* We now have a character value in c that may be greater than 255.
7343-
In 8-bit mode we convert to UTF-8 if we are in UTF mode. Values greater
7344-
than 127 in UTF mode must have come from \x{...} or octal constructs
7345-
because values from \x.. get this far only in non-UTF mode. */
7375+
Depending of how we got it, the encoding enum could be set to tell
7376+
us how to encode it, otherwise follow the utf modifier. */
73467377

73477378
#ifdef SUPPORT_PCRE2_8
73487379
if (test_mode == PCRE8_MODE)
73497380
{
7350-
if (utf)
7351-
{
7352-
if (c > 0x7fffffff)
7353-
{
7354-
fprintf(outfile, "** Character \\x{%x} is greater than 0x7fffffff "
7355-
"and so cannot be converted to UTF-8\n", c);
7356-
return PR_OK;
7357-
}
7358-
q8 += ord2utf8(c, q8);
7359-
}
7360-
else
7381+
if (encoding == FORCE_RAW || !(utf || encoding == FORCE_UTF))
73617382
{
73627383
if (c > 0xffu)
73637384
{
@@ -7368,27 +7389,37 @@ while ((c = *p++) != 0)
73687389
}
73697390
*q8++ = (uint8_t)c;
73707391
}
7392+
else
7393+
{
7394+
if (c > 0x7fffffff)
7395+
{
7396+
fprintf(outfile, "** Character \\N{U+%x} is greater than 0x7fffffff "
7397+
"and therefore cannot be encoded as UTF-8\n", c);
7398+
return PR_OK;
7399+
}
7400+
q8 += ord2utf8(c, q8);
7401+
}
73717402
}
73727403
#endif
73737404
#ifdef SUPPORT_PCRE2_16
73747405
if (test_mode == PCRE16_MODE)
73757406
{
7376-
if (utf)
7407+
if (encoding == FORCE_UTF || utf)
73777408
{
73787409
if (c > 0x10ffffu)
73797410
{
7380-
fprintf(outfile, "** Failed: character \\x{%x} is greater than "
7381-
"0x10ffff and so cannot be converted to UTF-16\n", c);
7411+
fprintf(outfile, "** Failed: character \\N{U+%x} is greater than "
7412+
"0x10ffff and therefore cannot be encoded as "
7413+
"UTF-16\n", c);
73827414
return PR_OK;
73837415
}
73847416
else if (c >= 0x10000u)
73857417
{
7386-
c-= 0x10000u;
7418+
c -= 0x10000u;
73877419
*q16++ = 0xD800 | (c >> 10);
73887420
*q16++ = 0xDC00 | (c & 0x3ff);
73897421
}
7390-
else
7391-
*q16++ = c;
7422+
else *q16++ = c;
73927423
}
73937424
else
73947425
{

testdata/testinput11

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -356,9 +356,18 @@
356356
# We can use pcre2test's utf8_input modifier to create wide pattern characters,
357357
# even though this test is run when UTF is not supported.
358358

359+
/a\x{ffff}b/utf8_input
360+
a￿b
361+
a\x{ffff}b
362+
a\o{177777}b
363+
\= Expect no match
364+
a\N{U+ffff}z
365+
359366
/ab������z/utf8_input
360367
ab������z
361368
ab\x{7fffffff}z
369+
ab\o{17777777777}z
370+
ab\N{U+7fffffff}z
362371

363372
/ab�������z/utf8_input
364373
ab�������z
@@ -367,6 +376,15 @@
367376
/ab�Az/utf8_input
368377
ab�Az
369378
ab\x{80000041}z
379+
\= Expect no match
380+
abAz
381+
aAz
382+
ab\377Az
383+
ab\xff\N{U+0041}z
384+
ab\N{U+ff}\N{U+41}z
385+
386+
/ab\x{80000041}z/
387+
ab\x{80000041}z
370388

371389
/(?i:A{1,}\6666666666)/
372390
A\x{1b6}6666666

testdata/testinput4

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2335,6 +2335,9 @@
23352335
/[\N{U+1234}]/utf
23362336
\x{1234}
23372337

2338+
/(\x{1234}) \1/utf
2339+
\N{U+1234} \o{11064}
2340+
23382341
# Test the full list of Unicode "Pattern White Space" characters that are to
23392342
# be ignored by /x. The pattern lines below may show up oddly in text editors
23402343
# or when listed to the screen. Note that characters such as U+2002, which are

0 commit comments

Comments
 (0)