Open
Description
Kiichi Kuramoto opened BATCH-2540 and commented
The supplementary characters like "𠮷" are represented in unicode by 2-char(32-bit), which is referred to as surrogate pair.
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20bb7
For example, when I think trying to get the first two characters of "𠮷田 太郎".
I expected "𠮷田", but it does not work as expected following code.
"𠮷田 太郎".substring(0, 2); // => "𠮷"
Therefore, String#substring method must be used by searching start and end positions considering the surrogate pair, by using String#offsetByCodePoints as below.
String str = "𠮷田 太郎";
int startIndex = 0;
int endIndex = 2;
int startIndexSurrogate = str.offsetByCodePoints(0, startIndex); // => 0
int endIndexSurrogate = str.offsetByCodePoints(0, endIndex); // => 3
String subStrSurrogate = str.substring(startIndexSurrogate, endIndexSurrogate); // => "𠮷田"
Affects: 3.0.7
0 votes, 8 watchers