Skip to content

fix: do not split "B"-encoded-words at UTF-8 char boundaries#42

Merged
mdecimus merged 1 commit intostalwartlabs:mainfrom
link2xt:link2xt/b-encoded-words-utf8-split
Aug 10, 2025
Merged

fix: do not split "B"-encoded-words at UTF-8 char boundaries#42
mdecimus merged 1 commit intostalwartlabs:mainfrom
link2xt:link2xt/b-encoded-words-utf8-split

Conversation

@link2xt
Copy link
Contributor

@link2xt link2xt commented Jul 27, 2025

Fixes #40

@link2xt
Copy link
Contributor Author

link2xt commented Jul 27, 2025

I also tested that it fixes the problem in chatmail/core#7039

@link2xt
Copy link
Contributor Author

link2xt commented Aug 7, 2025

@mdecimus Could you review this? This is the remaining piece of fixing the issue with long unicode group names in Delta Chat: chatmail/core#7039

let chunk = self.text.as_bytes().get(last_pos..pos).unwrap_or_default();
base64_encode_mime(chunk, &mut output, true)?;

output.write_all(b"?=\r\n\t=?utf-8?B?")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some code duplication here, =?utf-8?B? is being written here and also on line 47.
I prefer an approach using an iterator that returns the bytes at their correct offest which are then encoded and wrapped around =?utf-8?B? and ?=.
In fact, I don't think that a custom iterator is needed at all, you could use char_indices (and avoid having to do (ch as i8) >= -0x40), keep track of the offsets and then print the encoded word once you reached the right size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I force-pushed the change, now using char_indices.

@link2xt link2xt force-pushed the link2xt/b-encoded-words-utf8-split branch from 1cfb664 to 069efe1 Compare August 8, 2025 23:39
@mdecimus mdecimus merged commit 177fa27 into stalwartlabs:main Aug 10, 2025
1 check passed
@mdecimus
Copy link
Member

I made some changes to the B-encoding function, can you check that it works for you so I can release?

@link2xt
Copy link
Contributor Author

link2xt commented Aug 11, 2025

I tested with commit 2ec9d02, it works.

Why remove debug_assert though?

-        // There is always a header or continuation whitespace before inline text.
-        debug_assert!(bytes_written > 0);
-

If the function is called with bytes_written equal to 0, it starts the header with \t. This should not happen, so debug_assert ensured that this kind of calls will not be introduced in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text headers break UTF-8 mid-character

2 participants