Skip to content

FixedLengthTokenizer does not correct for the surrogate pair [BATCH-2540] #1062

Open
@spring-projects-issues

Description

@spring-projects-issues

Kiichi Kuramoto opened BATCH-2540 and commented

The supplementary characters like "𠮷" are represented in unicode by 2-char(32-bit), which is referred to as surrogate pair.
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20bb7

For example, when I think trying to get the first two characters of "𠮷田 太郎".
I expected "𠮷田", but it does not work as expected following code.

"𠮷田 太郎".substring(0, 2); // => "𠮷"

Therefore, String#substring method must be used by searching start and end positions considering the surrogate pair, by using String#offsetByCodePoints as below.

String str = "𠮷田 太郎";
int startIndex = 0;
int endIndex = 2;

int startIndexSurrogate = str.offsetByCodePoints(0, startIndex); // => 0
int endIndexSurrogate = str.offsetByCodePoints(0, endIndex); // => 3

String subStrSurrogate = str.substring(startIndexSurrogate, endIndexSurrogate); // => "𠮷田"

Affects: 3.0.7

0 votes, 8 watchers

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions