Skip to content

Commit 1c250e1

Browse files
authored
feat(c++): Support the UTF-8 to UTF-16 with SIMD (#1990)
<!-- **Thanks for contributing to Fury.** **If this is your first time opening a PR on fury, you can refer to [CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md).** Contribution Checklist - The **Apache Fury (incubating)** community has restrictions on the naming of pr titles. You can also find instructions in [CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md). - Fury has a strong focus on performance. If the PR you submit will have an impact on performance, please benchmark it first and provide the benchmark result here. --> ## What does this PR do? To support the utf8 utf16 and using simd to accelerate the optimization ``` c++ std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian) ``` The logic of converting UTF-8 to UTF-16 isn't that complicated. but there are still lots of optimizations that I haven't come up with yet. So, I'll first design a version that's a bit faster than the original one, and then think about how to make further optimizations. Judging from the tests, the logic is correct: ``` text [----------] 9 tests from UTF8ToUTF16Test [ RUN ] UTF8ToUTF16Test.BasicConversion [ OK ] UTF8ToUTF16Test.BasicConversion (0 ms) [ RUN ] UTF8ToUTF16Test.EmptyString [ OK ] UTF8ToUTF16Test.EmptyString (0 ms) [ RUN ] UTF8ToUTF16Test.SurrogatePairs [ OK ] UTF8ToUTF16Test.SurrogatePairs (0 ms) [ RUN ] UTF8ToUTF16Test.BoundaryValues [ OK ] UTF8ToUTF16Test.BoundaryValues (0 ms) [ RUN ] UTF8ToUTF16Test.SpecialCharacters [ OK ] UTF8ToUTF16Test.SpecialCharacters (0 ms) [ RUN ] UTF8ToUTF16Test.LittleEndian [ OK ] UTF8ToUTF16Test.LittleEndian (0 ms) [ RUN ] UTF8ToUTF16Test.BigEndian [ OK ] UTF8ToUTF16Test.BigEndian (0 ms) [ RUN ] UTF8ToUTF16Test.RoundTripConversion [ OK ] UTF8ToUTF16Test.RoundTripConversion (0 ms) ``` <img width="264" alt="image" src="https://github.com/user-attachments/assets/7b9033ad-001f-4a36-a27e-6a8362f3a6df" /> And from the performance perspective, it's improved compared to serial processing: <img width="394" alt="image" src="https://github.com/user-attachments/assets/93379aeb-9080-449e-b889-567dd207f5fc" /> The speed of execution has been significantly improved Actually, this code doesn't use libraries like AVX2 or really apply SIMD to process. The main reason is that the structure of UTF-8 encoding is complex and not fixed. It involves multi-byte encoding, and we need to analyze it byte by byte when dealing with different bytes. So, without clear rules and a uniform length, it becomes really hard to directly parallelize the processing of each byte. During the process of converting UTF-8 to UTF-16, we have to handle characters of different lengths, ranging from 1 to 4 bytes, which makes it difficult to break it down into structures that can be directly applied to SIMD operations. There are also some code style changes, uniform writing <!-- Describe the purpose of this PR. --> ## Related issues Close #1964 <!-- Is there any related issue? Please attach here. - #xxxx0 - #xxxx1 - #xxxx2 --> ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [x] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. -->
1 parent 3c1df17 commit 1c250e1

File tree

3 files changed

+451
-3
lines changed

3 files changed

+451
-3
lines changed

cpp/fury/util/string_util.cc

Lines changed: 208 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,129 @@ inline void utf16SurrogatePairToUtf8(uint16_t high, uint16_t low, char *&utf8) {
5858
*utf8++ = static_cast<char>((code_point & 0x3F) | 0x80);
5959
}
6060

61+
std::u16string utf8ToUtf16SIMD(const std::string &utf8, bool is_little_endian) {
62+
std::u16string utf16;
63+
utf16.reserve(utf8.size()); // Reserve space to avoid frequent reallocations
64+
65+
char buffer[64]; // Buffer to hold temporary UTF-16 results
66+
char16_t *output =
67+
reinterpret_cast<char16_t *>(buffer); // Use char16_t for output
68+
69+
size_t i = 0;
70+
size_t n = utf8.size();
71+
72+
while (i + 32 <= n) {
73+
74+
for (int j = 0; j < 32; ++j) {
75+
uint8_t byte = utf8[i + j];
76+
77+
if (byte < 0x80) {
78+
// 1-byte character (ASCII)
79+
*output++ = static_cast<char16_t>(byte);
80+
} else if (byte < 0xE0) {
81+
// 2-byte character
82+
uint16_t utf16_char = ((byte & 0x1F) << 6) | (utf8[i + j + 1] & 0x3F);
83+
if (!is_little_endian) {
84+
utf16_char = (utf16_char >> 8) |
85+
(utf16_char << 8); // Swap bytes for big-endian
86+
}
87+
*output++ = utf16_char;
88+
++j;
89+
} else if (byte < 0xF0) {
90+
// 3-byte character
91+
uint16_t utf16_char = ((byte & 0x0F) << 12) |
92+
((utf8[i + j + 1] & 0x3F) << 6) |
93+
(utf8[i + j + 2] & 0x3F);
94+
if (!is_little_endian) {
95+
utf16_char = (utf16_char >> 8) |
96+
(utf16_char << 8); // Swap bytes for big-endian
97+
}
98+
*output++ = utf16_char;
99+
j += 2;
100+
} else {
101+
// 4-byte character (surrogate pair handling required)
102+
uint32_t code_point =
103+
((byte & 0x07) << 18) | ((utf8[i + j + 1] & 0x3F) << 12) |
104+
((utf8[i + j + 2] & 0x3F) << 6) | (utf8[i + j + 3] & 0x3F);
105+
106+
// Convert the code point to a surrogate pair
107+
uint16_t high_surrogate = 0xD800 + ((code_point - 0x10000) >> 10);
108+
uint16_t low_surrogate = 0xDC00 + (code_point & 0x3FF);
109+
110+
if (!is_little_endian) {
111+
high_surrogate = (high_surrogate >> 8) |
112+
(high_surrogate << 8); // Swap bytes for big-endian
113+
low_surrogate = (low_surrogate >> 8) |
114+
(low_surrogate << 8); // Swap bytes for big-endian
115+
}
116+
117+
*output++ = high_surrogate;
118+
*output++ = low_surrogate;
119+
120+
j += 3;
121+
}
122+
}
123+
124+
// Append the processed buffer to the final utf16 string
125+
utf16.append(reinterpret_cast<char16_t *>(buffer),
126+
output - reinterpret_cast<char16_t *>(buffer));
127+
output =
128+
reinterpret_cast<char16_t *>(buffer); // Reset output buffer pointer
129+
i += 32;
130+
}
131+
132+
// Handle remaining characters
133+
while (i < n) {
134+
uint8_t byte = utf8[i];
135+
136+
if (byte < 0x80) {
137+
*output++ = static_cast<char16_t>(byte);
138+
} else if (byte < 0xE0) {
139+
uint16_t utf16_char = ((byte & 0x1F) << 6) | (utf8[i + 1] & 0x3F);
140+
if (!is_little_endian) {
141+
utf16_char =
142+
(utf16_char >> 8) | (utf16_char << 8); // Swap bytes for big-endian
143+
}
144+
*output++ = utf16_char;
145+
++i;
146+
} else if (byte < 0xF0) {
147+
uint16_t utf16_char = ((byte & 0x0F) << 12) |
148+
((utf8[i + 1] & 0x3F) << 6) | (utf8[i + 2] & 0x3F);
149+
if (!is_little_endian) {
150+
utf16_char =
151+
(utf16_char >> 8) | (utf16_char << 8); // Swap bytes for big-endian
152+
}
153+
*output++ = utf16_char;
154+
i += 2;
155+
} else {
156+
uint32_t code_point = ((byte & 0x07) << 18) |
157+
((utf8[i + 1] & 0x3F) << 12) |
158+
((utf8[i + 2] & 0x3F) << 6) | (utf8[i + 3] & 0x3F);
159+
160+
uint16_t high_surrogate = 0xD800 + ((code_point - 0x10000) >> 10);
161+
uint16_t low_surrogate = 0xDC00 + (code_point & 0x3FF);
162+
163+
if (!is_little_endian) {
164+
high_surrogate = (high_surrogate >> 8) | (high_surrogate << 8);
165+
low_surrogate = (low_surrogate >> 8) | (low_surrogate << 8);
166+
}
167+
168+
*output++ = high_surrogate;
169+
*output++ = low_surrogate;
170+
171+
i += 3;
172+
}
173+
174+
++i;
175+
}
176+
177+
// Append the last part of the buffer to the utf16 string
178+
utf16.append(reinterpret_cast<char16_t *>(buffer),
179+
output - reinterpret_cast<char16_t *>(buffer));
180+
181+
return utf16;
182+
}
183+
61184
#if defined(__x86_64__) || defined(_M_X64)
62185

63186
bool isLatin(const std::string &str) {
@@ -168,6 +291,10 @@ std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian) {
168291
return utf8;
169292
}
170293

294+
std::u16string utf8ToUtf16(const std::string &utf8, bool is_little_endian) {
295+
return utf8ToUtf16SIMD(utf8, is_little_endian);
296+
}
297+
171298
#elif defined(__ARM_NEON) || defined(__ARM_NEON__)
172299

173300
bool isLatin(const std::string &str) {
@@ -264,6 +391,10 @@ std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian) {
264391
return utf8;
265392
}
266393

394+
std::u16string utf8ToUtf16(const std::string &utf8, bool is_little_endian) {
395+
return utf8ToUtf16SIMD(utf8, is_little_endian);
396+
}
397+
267398
#elif defined(__riscv) && __riscv_vector
268399

269400
bool isLatin(const std::string &str) {
@@ -365,6 +496,10 @@ std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian) {
365496
return utf8;
366497
}
367498

499+
std::u16string utf8ToUtf16(const std::string &utf8, bool is_little_endian) {
500+
return utf8ToUtf16SIMD(utf8, is_little_endian);
501+
}
502+
368503
#else
369504

370505
bool isLatin(const std::string &str) {
@@ -414,6 +549,78 @@ std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian) {
414549
return utf8;
415550
}
416551

552+
// Fallback implementation without SIMD acceleration
553+
std::u16string utf8ToUtf16(const std::string &utf8, bool is_little_endian) {
554+
std::u16string utf16; // Resulting UTF-16 string
555+
size_t i = 0; // Index for traversing the UTF-8 string
556+
size_t n = utf8.size(); // Total length of the UTF-8 string
557+
558+
// Loop through each byte of the UTF-8 string
559+
while (i < n) {
560+
uint32_t code_point = 0; // The Unicode code point
561+
unsigned char c = utf8[i]; // Current byte of the UTF-8 string
562+
563+
// Determine the number of bytes for this character based on its first byte
564+
if ((c & 0x80) == 0) {
565+
// 1-byte character (ASCII)
566+
code_point = c;
567+
++i;
568+
} else if ((c & 0xE0) == 0xC0) {
569+
// 2-byte character
570+
code_point = c & 0x1F;
571+
code_point = (code_point << 6) | (utf8[i + 1] & 0x3F);
572+
i += 2;
573+
} else if ((c & 0xF0) == 0xE0) {
574+
// 3-byte character
575+
code_point = c & 0x0F;
576+
code_point = (code_point << 6) | (utf8[i + 1] & 0x3F);
577+
code_point = (code_point << 6) | (utf8[i + 2] & 0x3F);
578+
i += 3;
579+
} else if ((c & 0xF8) == 0xF0) {
580+
// 4-byte character
581+
code_point = c & 0x07;
582+
code_point = (code_point << 6) | (utf8[i + 1] & 0x3F);
583+
code_point = (code_point << 6) | (utf8[i + 2] & 0x3F);
584+
code_point = (code_point << 6) | (utf8[i + 3] & 0x3F);
585+
i += 4;
586+
} else {
587+
// Invalid UTF-8 byte sequence
588+
throw std::invalid_argument("Invalid UTF-8 encoding.");
589+
}
590+
591+
// If the code point is beyond the BMP range, use surrogate pairs
592+
if (code_point >= 0x10000) {
593+
code_point -= 0x10000; // Subtract 0x10000 to get the surrogate pair
594+
uint16_t high_surrogate = 0xD800 + (code_point >> 10); // High surrogate
595+
uint16_t low_surrogate = 0xDC00 + (code_point & 0x3FF); // Low surrogate
596+
597+
// If not little-endian, swap bytes of the surrogates
598+
if (!is_little_endian) {
599+
high_surrogate = (high_surrogate >> 8) | (high_surrogate << 8);
600+
low_surrogate = (low_surrogate >> 8) | (low_surrogate << 8);
601+
}
602+
603+
// Add both high and low surrogates to the UTF-16 string
604+
utf16.push_back(high_surrogate);
605+
utf16.push_back(low_surrogate);
606+
} else {
607+
// For code points within the BMP range, directly store as a 16-bit value
608+
uint16_t utf16_char = static_cast<uint16_t>(code_point);
609+
610+
// If not little-endian, swap the bytes of the 16-bit character
611+
if (!is_little_endian) {
612+
utf16_char = (utf16_char >> 8) | (utf16_char << 8);
613+
}
614+
615+
// Add the UTF-16 character to the string
616+
utf16.push_back(utf16_char);
617+
}
618+
}
619+
620+
// Return the resulting UTF-16 string
621+
return utf16;
622+
}
623+
417624
#endif
418625

419-
} // namespace fury
626+
} // namespace fury

cpp/fury/util/string_util.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,6 @@ bool isLatin(const std::string &str);
2727

2828
std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian);
2929

30-
} // namespace fury
30+
std::u16string utf8ToUtf16(const std::string &utf8, bool is_little_endian);
31+
32+
} // namespace fury

0 commit comments

Comments
 (0)