Skip to content

Commit b72cc97

Browse files
authored
Add support for Turkish I casefolding (#521)
New flag: PCRE2_EXTRA_TURKISH_CASING, and pre-pattern flag (*TURKISH_CASING). Also added a pre-pattern flag (*CASELESS_RESTRICT) for this existing flag.
1 parent c9bf833 commit b72cc97

38 files changed

+1641
-84
lines changed

doc/html/pcre2_set_compile_extra_options.html

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,10 @@ <h1>pcre2_set_compile_extra_options man page</h1>
4343
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
4444
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
4545
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
46+
PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern
4647
PCRE2_EXTRA_NO_BS0 Disallow \0 (but not \00 or \000)
4748
PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal
49+
PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding
4850
</pre>
4951
There is a complete description of the PCRE2 native API in the
5052
<a href="pcre2api.html"><b>pcre2api</b></a>

doc/html/pcre2api.html

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1697,12 +1697,21 @@ <h1>pcre2api man page</h1>
16971697
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
16981698
PCRE2_UCP is set, Unicode properties are used for all characters with more than
16991699
one other case, and for all characters whose code points are greater than
1700-
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
1700+
U+007F.
1701+
</P>
1702+
<P>
1703+
Note that there are two ASCII characters, K and S, that, in addition to
17011704
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
17021705
sign) and U+017F (long S) respectively. If you do not want this case
17031706
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
17041707
</P>
17051708
<P>
1709+
One language family, Turkish and Azeri, has its own case-insensitivity rules,
1710+
which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the
1711+
behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131
1712+
(small dotless i) characters.
1713+
</P>
1714+
<P>
17061715
For lower valued characters with only one other case, a lookup table is used
17071716
for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used
17081717
for all code points less than 256, and higher code points (available only in
@@ -2037,9 +2046,16 @@ <h1>pcre2api man page</h1>
20372046
upper/lower casing operations, even when PCRE2_UTF is not set. This makes it
20382047
possible to process strings in the 16-bit UCS-2 code. This option is available
20392048
only if PCRE2 has been compiled with Unicode support (which is the default).
2040-
The PCRE2_EXTRA_CASELESS_RESTRICT option (see below) restricts caseless
2049+
</P>
2050+
<P>
2051+
The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless
20412052
matching such that ASCII characters match only ASCII characters and non-ASCII
2042-
characters match only non-ASCII characters.
2053+
characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option
2054+
(see above) alters the matching of the 'i' characters to follow their behaviour
2055+
in Turkish and Azeri languages. For further details on
2056+
PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the
2057+
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
2058+
page.
20432059
<pre>
20442060
PCRE2_UNGREEDY
20452061
</pre>
@@ -2176,7 +2192,8 @@ <h1>pcre2api man page</h1>
21762192
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
21772193
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
21782194
caseless match, both characters must either be ASCII or non-ASCII. The option
2179-
can be changed with a pattern by the (?r) option setting.
2195+
can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option
2196+
settings.
21802197
<pre>
21812198
PCRE2_EXTRA_ESCAPED_CR_IS_LF
21822199
</pre>
@@ -2223,6 +2240,14 @@ <h1>pcre2api man page</h1>
22232240
returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application
22242241
knows that a callout will not be provided to <b>pcre2_match()</b>, so that
22252242
callouts in the pattern are not silently ignored.
2243+
<pre>
2244+
PCRE2_EXTRA_TURKISH_CASING
2245+
</pre>
2246+
This option alters case-equivalence of the 'i' letters to follow the
2247+
alphabet used by Turkish and Azeri languages. The option can be changed within
2248+
a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or
2249+
UCP options must be set. In the 8-bit library, UTF must be set. This option
2250+
cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT.
22262251
<a name="jitcompiling"></a></P>
22272252
<br><a name="SEC21" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
22282253
<P>

doc/html/pcre2pattern.html

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,10 @@ <h1>pcre2pattern man page</h1>
302302
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
303303
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the
304304
PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to
305-
<b>pcre2_compile()</b> or set by (?r) within the pattern).
305+
<b>pcre2_compile()</b> or set by (*CASELESS_RESTRICT) or (?r) within the
306+
pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed
307+
to <b>pcre2_compile()</b> or set by (*TURKISH_CASING) within the pattern), then
308+
the 'i' letters are matched according to Turkish and Azeri languages.
306309
</P>
307310
<P>
308311
The power of regular expressions comes from the ability to include wild cards,

doc/html/pcre2syntax.html

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -436,17 +436,19 @@ <h1>pcre2syntax man page</h1>
436436
of the newline or \R sequences or options with similar syntax. More than one
437437
of them may appear. For the first three, d is a decimal number.
438438
<pre>
439-
(*LIMIT_DEPTH=d) set the backtracking limit to d
440-
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
441-
(*LIMIT_MATCH=d) set the match limit to d
442-
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
443-
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
444-
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
439+
(*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching
440+
(*LIMIT_DEPTH=d) set the backtracking limit to d
441+
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
442+
(*LIMIT_MATCH=d) set the match limit to d
443+
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
444+
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
445+
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
445446
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
446-
(*NO_JIT) disable JIT optimization
447-
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
448-
(*UTF) set appropriate UTF mode for the library in use
449-
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
447+
(*NO_JIT) disable JIT optimization
448+
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
449+
(*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching
450+
(*UTF) set appropriate UTF mode for the library in use
451+
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
450452
</pre>
451453
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
452454
the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,

doc/html/pcre2test.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -673,6 +673,7 @@ <h1>pcre2test man page</h1>
673673
no_start_optimize set PCRE2_NO_START_OPTIMIZE
674674
no_utf_check set PCRE2_NO_UTF_CHECK
675675
python_octal set PCRE2_EXTRA_PYTHON_OCTAL
676+
turkish_casing set PCRE2_EXTRA_TURKISH_CASING
676677
ucp set PCRE2_UCP
677678
ungreedy set PCRE2_UNGREEDY
678679
use_offset_limit set PCRE2_USE_OFFSET_LIMIT

doc/html/pcre2unicode.html

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,35 @@ <h1>pcre2unicode man page</h1>
157157
counterparts can be disabled by setting the PCRE2_EXTRA_CASELESS_RESTRICT
158158
option. When this is set, all characters in a case equivalence must either be
159159
ASCII or non-ASCII; there can be no mixing.
160+
<pre>
161+
Without PCRE2_EXTRA_CASELESS_RESTRICT:
162+
'k' = 'K' = U+212A (Kelvin sign)
163+
's' = 'S' = U+017F (long S)
164+
With PCRE2_EXTRA_CASELESS_RESTRICT:
165+
'k' = 'K'
166+
U+212A (Kelvin sign) only case-equivalent to itself
167+
's' = 'S'
168+
U+017F (long S) only case-equivalent to itself
169+
</PRE>
170+
</P>
171+
<P>
172+
One language family, Turkish and Azeri, has its own case-insensitivity rules,
173+
which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the
174+
behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131
175+
(small dotless i) characters.
176+
<pre>
177+
Without PCRE2_EXTRA_TURKISH_CASING:
178+
'i' = 'I'
179+
U+0130 (capital I with dot above) only case-equivalent to itself
180+
U+0131 (small dotless i) only case-equivalent to itself
181+
With PCRE2_EXTRA_TURKISH_CASING:
182+
'i' = U+0130 (capital I with dot above)
183+
U+0131 (small dotless i) = 'I'
184+
</PRE>
185+
</P>
186+
<P>
187+
It is not allowed to specify both PCRE2_EXTRA_CASELESS_RESTRICT and
188+
PCRE2_EXTRA_TURKISH_CASING together.
160189
</P>
161190
<P>
162191
From release 10.45 the Unicode letter properties Lu (upper case), Ll (lower

doc/pcre2_set_compile_extra_options.3

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,10 @@ options are:
4343
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \er as \en
4444
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
4545
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
46+
PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern
4647
PCRE2_EXTRA_NO_BS0 Disallow \e0 (but not \e00 or \e000)
4748
PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal
49+
PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding
4850
.sp
4951
There is a complete description of the PCRE2 native API in the
5052
.\" HREF

doc/pcre2api.3

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1633,11 +1633,18 @@ letters in the subject. It is equivalent to Perl's /i option, and it can be
16331633
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
16341634
PCRE2_UCP is set, Unicode properties are used for all characters with more than
16351635
one other case, and for all characters whose code points are greater than
1636-
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
1636+
U+007F.
1637+
.P
1638+
Note that there are two ASCII characters, K and S, that, in addition to
16371639
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
16381640
sign) and U+017F (long S) respectively. If you do not want this case
16391641
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
16401642
.P
1643+
One language family, Turkish and Azeri, has its own case-insensitivity rules,
1644+
which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the
1645+
behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131
1646+
(small dotless i) characters.
1647+
.P
16411648
For lower valued characters with only one other case, a lookup table is used
16421649
for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used
16431650
for all code points less than 256, and higher code points (available only in
@@ -1986,9 +1993,17 @@ The second effect of PCRE2_UCP is to force the use of Unicode properties for
19861993
upper/lower casing operations, even when PCRE2_UTF is not set. This makes it
19871994
possible to process strings in the 16-bit UCS-2 code. This option is available
19881995
only if PCRE2 has been compiled with Unicode support (which is the default).
1989-
The PCRE2_EXTRA_CASELESS_RESTRICT option (see below) restricts caseless
1996+
.P
1997+
The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless
19901998
matching such that ASCII characters match only ASCII characters and non-ASCII
1991-
characters match only non-ASCII characters.
1999+
characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option
2000+
(see above) alters the matching of the 'i' characters to follow their behaviour
2001+
in Turkish and Azeri languages. For further details on
2002+
PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the
2003+
.\" HREF
2004+
\fBpcre2unicode\fP
2005+
.\"
2006+
page.
19922007
.sp
19932008
PCRE2_UNGREEDY
19942009
.sp
@@ -2128,7 +2143,8 @@ characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
21282143
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
21292144
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
21302145
caseless match, both characters must either be ASCII or non-ASCII. The option
2131-
can be changed with a pattern by the (?r) option setting.
2146+
can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option
2147+
settings.
21322148
.sp
21332149
PCRE2_EXTRA_ESCAPED_CR_IS_LF
21342150
.sp
@@ -2177,6 +2193,14 @@ If this option is set, PCRE2 treats callouts in the pattern as a syntax error,
21772193
returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application
21782194
knows that a callout will not be provided to \fBpcre2_match()\fP, so that
21792195
callouts in the pattern are not silently ignored.
2196+
.sp
2197+
PCRE2_EXTRA_TURKISH_CASING
2198+
.sp
2199+
This option alters case-equivalence of the 'i' letters to follow the
2200+
alphabet used by Turkish and Azeri languages. The option can be changed within
2201+
a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or
2202+
UCP options must be set. In the 8-bit library, UTF must be set. This option
2203+
cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT.
21802204
.
21812205
.
21822206
.\" HTML <a name="jitcompiling"></a>

doc/pcre2pattern.3

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,10 @@ ASCII characters, K and S, that, in addition to their lower case ASCII
278278
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
279279
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the
280280
PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to
281-
\fBpcre2_compile()\fP or set by (?r) within the pattern).
281+
\fBpcre2_compile()\fP or set by (*CASELESS_RESTRICT) or (?r) within the
282+
pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed
283+
to \fBpcre2_compile()\fP or set by (*TURKISH_CASING) within the pattern), then
284+
the 'i' letters are matched according to Turkish and Azeri languages.
282285
.P
283286
The power of regular expressions comes from the ability to include wild cards,
284287
character classes, alternatives, and repetitions in the pattern. These are

doc/pcre2syntax.3

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -411,17 +411,19 @@ The following are recognized only at the very start of a pattern or after one
411411
of the newline or \eR sequences or options with similar syntax. More than one
412412
of them may appear. For the first three, d is a decimal number.
413413
.sp
414-
(*LIMIT_DEPTH=d) set the backtracking limit to d
415-
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
416-
(*LIMIT_MATCH=d) set the match limit to d
417-
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
418-
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
419-
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
414+
(*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching
415+
(*LIMIT_DEPTH=d) set the backtracking limit to d
416+
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
417+
(*LIMIT_MATCH=d) set the match limit to d
418+
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
419+
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
420+
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
420421
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
421-
(*NO_JIT) disable JIT optimization
422-
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
423-
(*UTF) set appropriate UTF mode for the library in use
424-
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
422+
(*NO_JIT) disable JIT optimization
423+
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
424+
(*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching
425+
(*UTF) set appropriate UTF mode for the library in use
426+
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
425427
.sp
426428
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
427429
the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP,

0 commit comments

Comments
 (0)