FixedLengthTokenizer does not correct for the surrogate pair [BATCH-2540]

**[Kiichi Kuramoto](https://jira.spring.io/secure/ViewProfile.jspa?name=kuramotoki)** opened **[BATCH-2540](https://jira.spring.io/browse/BATCH-2540?redirect=false)** and commented

The supplementary characters like "𠮷" are represented in unicode by 2-char(32-bit), which is referred to as surrogate pair.
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=20bb7

For example, when I think trying to get the first two characters of "𠮷田 太郎".
I expected "𠮷田", but it does not work as expected following code.

```java
"𠮷田 太郎".substring(0, 2); // => "𠮷"
```

Therefore, String#substring method must be used by searching start and end positions considering the surrogate pair, by using String#offsetByCodePoints as below.

```java
String str = "𠮷田 太郎";
int startIndex = 0;
int endIndex = 2;

int startIndexSurrogate = str.offsetByCodePoints(0, startIndex); // => 0
int endIndexSurrogate = str.offsetByCodePoints(0, endIndex); // => 3

String subStrSurrogate = str.substring(startIndexSurrogate, endIndexSurrogate); // => "𠮷田"
```



---

**Affects:** 3.0.7

0 votes, 8 watchers


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FixedLengthTokenizer does not correct for the surrogate pair [BATCH-2540] #1062

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FixedLengthTokenizer does not correct for the surrogate pair [BATCH-2540] #1062

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions