Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach #284

Merged
merged 3 commits into from
Mar 5, 2025

Conversation

airborne12
Copy link
Member

@airborne12 airborne12 commented Feb 25, 2025

related PR:#255

This pull request includes several changes to improve the handling of UTF-8 encoding in the CLucene library and adds new tests to ensure the correctness of these changes. The most important changes include modifications to the IndexInput and IndexOutput classes to handle UTF-8 encoding more accurately and the addition of new test cases for UTF-8 characters.

Improvements to UTF-8 encoding handling:

  • src/core/CLucene/store/IndexInput.cpp: Modified the handling of byte sequences to differentiate between incorrect and correct UTF-8 encoding, providing a temporary solution to handle 4-byte characters.
  • src/core/CLucene/store/IndexOutput.cpp: Updated the writing of byte sequences for 4-byte characters to differentiate between incorrect and correct UTF-8 encoding, providing a temporary solution. [1] [2]

Addition of new test cases:

Copy link
Collaborator

@zzzxl1993 zzzxl1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@airborne12 airborne12 merged commit d837bb2 into apache:clucene Mar 5, 2025
3 of 4 checks passed
@airborne12 airborne12 deleted the fix-unicode branch March 5, 2025 09:02
airborne12 added a commit to airborne12/doris-thirdparty that referenced this pull request Mar 5, 2025
…ve 0xFFFF using a compatible approach (apache#284)

fix truncate problem and add write flag
airborne12 added a commit to airborne12/doris-thirdparty that referenced this pull request Mar 6, 2025
…ve 0xFFFF using a compatible approach (apache#284)

fix truncate problem and add write flag
airborne12 added a commit that referenced this pull request Mar 6, 2025
* [fix](unicode) fix 4 bytes unicode read and write bug (#255)

* [fix](unicode) fix 4 bytes unicode read and write bug

* [fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach (#284)

fix truncate problem and add write flag

* [test](unicode) add more ut

* fix ut
airborne12 added a commit to airborne12/doris-thirdparty that referenced this pull request Mar 6, 2025
* [fix](unicode) fix 4 bytes unicode read and write bug (apache#255)

* [fix](unicode) fix 4 bytes unicode read and write bug

* [fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach (apache#284)

fix truncate problem and add write flag

* [test](unicode) add more ut

* fix ut
airborne12 added a commit that referenced this pull request Mar 6, 2025
* [fix](build) fix build for clucene-2.0

* [fix](unicode) fix 4 bytes unicode read and write bug (#289)

* [fix](unicode) fix 4 bytes unicode read and write bug (#255)

* [fix](unicode) fix 4 bytes unicode read and write bug

* [fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach (#284)

fix truncate problem and add write flag

* [test](unicode) add more ut

* fix ut
airborne12 added a commit to airborne12/doris-thirdparty that referenced this pull request Mar 6, 2025
* [fix](unicode) fix 4 bytes unicode read and write bug (apache#255)

* [fix](unicode) fix 4 bytes unicode read and write bug

* [fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach (apache#284)

fix truncate problem and add write flag

* [test](unicode) add more ut

* fix ut
airborne12 added a commit that referenced this pull request Mar 7, 2025
* [fix](unicode) fix 4 bytes unicode read and write bug (#255)

* [fix](unicode) fix 4 bytes unicode read and write bug

* [fix](unicode) resolve truncation problem for Unicode code points above 0xFFFF using a compatible approach (#284)

fix truncate problem and add write flag

* [test](unicode) add more ut

* fix ut
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants