Ignoring case for unicode character leads to unexpected match #2200
-
I presume this is some unicode thing I'm not aware off and not a bug. If I do the following: echo "$(printf '\u03A9')" | rg --ignore-case '\x{2126}' I get a match. But from (limited) understanding I don't know why. The echo'ed one is http://www.unicode-symbol.com/u/03A9.html and the rg'ed one is http://www.unicode-symbol.com/u/2126.html. It is symmetric though.
Any idea why? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
There's two levels to "why" here. The first level is, "because that's how Unicode defines its case folding tables." Here:
Basically, the So all three are in the same class. Each codepoint maps to a sequence containing the other two codepoints (for faster lookups). And that's pretty much it. This isn't a unique example either. The same thing happens with the "micro sign" and the greek letter mu. As for the second level... Why did Unicode define it this way? I don't know that for certain, but the obvious answer to me is that they are confusable. One might be a symbol corresponding to a particular kind of unit and thus shouldn't have a "case" per se, yet, it looks exactly like the letter. I imagine there might also be legacy reasons in play, e.g., if keyboards would normally input one type over another. Then, it's a simple matter of expediency to generally treat both symbols and letters---even if they have distinct codepoint numbers for other reasons---as the same when it comes to case conversion (and likely other things). |
Beta Was this translation helpful? Give feedback.
There's two levels to "why" here. The first level is, "because that's how Unicode defines its case folding tables." Here: