Ignoring case for unicode character leads to unexpected match #2200

t-b · 2022-05-06T18:22:04Z

t-b
May 6, 2022

I presume this is some unicode thing I'm not aware off and not a bug.

If I do the following:

echo "$(printf '\u03A9')" | rg --ignore-case '\x{2126}'

I get a match. But from (limited) understanding I don't know why. The echo'ed one is http://www.unicode-symbol.com/u/03A9.html and the rg'ed one is http://www.unicode-symbol.com/u/2126.html. It is symmetric though.

Windows 10 x64
$ rg --version
ripgrep 12.1.1
-SIMD -AVX (compiled)
+SIMD -AVX (runtime)

Any idea why?

Answered by BurntSushi

May 7, 2022

There's two levels to "why" here. The first level is, "because that's how Unicode defines its case folding tables." Here:

$ curl -LO https://www.unicode.org/Public/zipped/14.0.0/UCD.zip
$ unzip UCD.zip
$ rg '03A9' CaseFolding.txt
332:03A9; C; 03C9; # GREEK CAPITAL LETTER OMEGA
$ rg '2126' CaseFolding.txt
959:2126; C; 03C9; # OHM SIGN
$ rg '03C9' CaseFolding.txt
332:03A9; C; 03C9; # GREEK CAPITAL LETTER OMEGA
949:1FF3; F; 03C9 03B9; # GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
951:1FF6; F; 03C9 0342; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI
952:1FF7; F; 03C9 0342 03B9; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
957:1FFC; F; 03C9 03B9; # GREEK CAPITAL LETTER OMEGA WIT…

View full answer

BurntSushi · 2022-05-07T18:26:56Z

BurntSushi
May 7, 2022
Maintainer

There's two levels to "why" here. The first level is, "because that's how Unicode defines its case folding tables." Here:

$ curl -LO https://www.unicode.org/Public/zipped/14.0.0/UCD.zip
$ unzip UCD.zip
$ rg '03A9' CaseFolding.txt
332:03A9; C; 03C9; # GREEK CAPITAL LETTER OMEGA
$ rg '2126' CaseFolding.txt
959:2126; C; 03C9; # OHM SIGN
$ rg '03C9' CaseFolding.txt
332:03A9; C; 03C9; # GREEK CAPITAL LETTER OMEGA
949:1FF3; F; 03C9 03B9; # GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
951:1FF6; F; 03C9 0342; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI
952:1FF7; F; 03C9 0342 03B9; # GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
957:1FFC; F; 03C9 03B9; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
959:2126; C; 03C9; # OHM SIGN

Basically, the CaseFolding.txt file defines equivalence classes of codepoints. 03A9, 2126 and 03C9 all belong to the same equivalence class. We can actually trace this through the code pretty easily. The case folding table above is directly translated into a table in Rust source code in the regex engine. The table generation pre-computes the equivalence classes of every codepoint based on CaseFolding.txt. Here's each of the relevant codepoints here:

So all three are in the same class. Each codepoint maps to a sequence containing the other two codepoints (for faster lookups).

And that's pretty much it. This isn't a unique example either. The same thing happens with the "micro sign" and the greek letter mu.

As for the second level... Why did Unicode define it this way? I don't know that for certain, but the obvious answer to me is that they are confusable. One might be a symbol corresponding to a particular kind of unit and thus shouldn't have a "case" per se, yet, it looks exactly like the letter. I imagine there might also be legacy reasons in play, e.g., if keyboards would normally input one type over another. Then, it's a simple matter of expediency to generally treat both symbols and letters---even if they have distinct codepoint numbers for other reasons---as the same when it comes to case conversion (and likely other things).

1 reply

t-b May 9, 2022
Author

Thanks a lot for this exhaustive explanation. This does explain it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ignoring case for unicode character leads to unexpected match #2200

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Ignoring case for unicode character leads to unexpected match #2200

Uh oh!

t-b May 6, 2022

Replies: 1 comment · 1 reply

Uh oh!

BurntSushi May 7, 2022 Maintainer

Uh oh!

t-b May 9, 2022 Author

t-b
May 6, 2022

Replies: 1 comment 1 reply

BurntSushi
May 7, 2022
Maintainer

t-b May 9, 2022
Author