You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(c++): Support the UTF-8 to UTF-16 with SIMD (#1990)
<!--
**Thanks for contributing to Fury.**
**If this is your first time opening a PR on fury, you can refer to
[CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md).**
Contribution Checklist
- The **Apache Fury (incubating)** community has restrictions on the
naming of pr titles. You can also find instructions in
[CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md).
- Fury has a strong focus on performance. If the PR you submit will have
an impact on performance, please benchmark it first and provide the
benchmark result here.
-->
## What does this PR do?
To support the utf8 utf16 and using simd to accelerate the optimization
``` c++
std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian)
```
The logic of converting UTF-8 to UTF-16 isn't that complicated. but
there are still lots of optimizations that I haven't come up with yet.
So, I'll first design a version that's a bit faster than the original
one, and then think about how to make further optimizations.
Judging from the tests, the logic is correct:
``` text
[----------] 9 tests from UTF8ToUTF16Test
[ RUN ] UTF8ToUTF16Test.BasicConversion
[ OK ] UTF8ToUTF16Test.BasicConversion (0 ms)
[ RUN ] UTF8ToUTF16Test.EmptyString
[ OK ] UTF8ToUTF16Test.EmptyString (0 ms)
[ RUN ] UTF8ToUTF16Test.SurrogatePairs
[ OK ] UTF8ToUTF16Test.SurrogatePairs (0 ms)
[ RUN ] UTF8ToUTF16Test.BoundaryValues
[ OK ] UTF8ToUTF16Test.BoundaryValues (0 ms)
[ RUN ] UTF8ToUTF16Test.SpecialCharacters
[ OK ] UTF8ToUTF16Test.SpecialCharacters (0 ms)
[ RUN ] UTF8ToUTF16Test.LittleEndian
[ OK ] UTF8ToUTF16Test.LittleEndian (0 ms)
[ RUN ] UTF8ToUTF16Test.BigEndian
[ OK ] UTF8ToUTF16Test.BigEndian (0 ms)
[ RUN ] UTF8ToUTF16Test.RoundTripConversion
[ OK ] UTF8ToUTF16Test.RoundTripConversion (0 ms)
```
<img width="264" alt="image"
src="https://github.com/user-attachments/assets/7b9033ad-001f-4a36-a27e-6a8362f3a6df"
/>
And from the performance perspective, it's improved compared to serial
processing:
<img width="394" alt="image"
src="https://github.com/user-attachments/assets/93379aeb-9080-449e-b889-567dd207f5fc"
/>
The speed of execution has been significantly improved
Actually, this code doesn't use libraries like AVX2 or really apply SIMD
to process. The main reason is that the structure of UTF-8 encoding is
complex and not fixed. It involves multi-byte encoding, and we need to
analyze it byte by byte when dealing with different bytes. So, without
clear rules and a uniform length, it becomes really hard to directly
parallelize the processing of each byte. During the process of
converting UTF-8 to UTF-16, we have to handle characters of different
lengths, ranging from 1 to 4 bytes, which makes it difficult to break it
down into structures that can be directly applied to SIMD operations.
There are also some code style changes, uniform writing
<!-- Describe the purpose of this PR. -->
## Related issues
Close#1964
<!--
Is there any related issue? Please attach here.
- #xxxx0
- #xxxx1
- #xxxx2
-->
## Does this PR introduce any user-facing change?
<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/fury/issues/new/choose) describing the
need to do so and update the document if necessary.
-->
- [x] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?
## Benchmark
<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
0 commit comments