Skip to content

Fix TrailingCase::Unchanged handling for Dutch#7863

Merged
Manishearth merged 1 commit intounicode-org:mainfrom
Manishearth:dutch-trailing-case
Apr 10, 2026
Merged

Fix TrailingCase::Unchanged handling for Dutch#7863
Manishearth merged 1 commit intounicode-org:mainfrom
Manishearth:dutch-trailing-case

Conversation

@Manishearth
Copy link
Copy Markdown
Member

Fixes #7681

I couldn't figure out an easy way to keep the dutch handling within full_helper (where all other locale-sensitive stuff lives), since this is the one case that actually affects the titlecasing uppercase-to-lowercase state change.

I think this implementation works nicely. The dutch_i_at_beginning and dutch_ij_pair_at_beginning_count split comes from when I was trying to retain the old code and refactor it to be reusable, but I realized that I could do this without retaining the old code. I can merge these two functions, but I kind of like how it turned out. They're more testable this way.

They're not implemented on CaseMapContext since they run at the beginning of a string.

Changelog

icu_casemapping: Fix TrailingCase::Unchanged handling for Dutch

@Manishearth Manishearth requested a review from sffc April 9, 2026 19:51
@Manishearth Manishearth force-pushed the dutch-trailing-case branch from d7abc2e to cf41d72 Compare April 9, 2026 20:05
Comment on lines 102 to 104
} else {
break;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code path with dutch_titlecase_count is a little confusing but fine

Optional: I would have expected something more like

Suggested change
} else if i == 0 && is_dutch && is_ij_start(c) {
// pass
} else if i == 1 && is_dutch && is_ij_end(c) && self.titlecase_tail_casing == TrailingCase::Lower {
mapping = MappingKind::Lower;
} else {
break;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a lot more state in between ij_start and ij_end because you have to check for accents and stuff.

_ => return false,
/// Is there an i at the beginning of the string which may be relevant
/// for Dutch titlecasing?
fn dutch_i_at_beginning(s: &'_ str) -> Option<DutchIData<'_>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

Suggested change
fn dutch_i_at_beginning(s: &'_ str) -> Option<DutchIData<'_>> {
fn dutch_i_at_beginning(s: &'_ str) -> Option<DutchIData<'_>> {
let mut chars = s.chars();
let has_accent = match chars.next() {
Some('i') | Some('I') => match chars.peek() {
Some(ACUTE) => {
chars.next();
true
}
_ => false
},
Some('í') | Some('Í') => true,
_ => return None
}
Some(DutchIData { /* ... */ }
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, and waffled on choice for a while

Firstly, .peek() doesn't work that way, you have to call .peekable() to make an intermediate iterator and then peek.

I think

let peekable = chars.peekable();
match peekable.peek() {
   Some(ACUTE) => peekable.next(), true,
   ...
}

is not very different from having a rest local variable. The main improvement is that my version is a little bit more verbose with the structs, which I don't like but I think it's fine. I like not having to track control flow in my version because each branch returns. Sometimes I wish Rust had field-order construction for local structs....

///
/// In dutch titlecasing mode, the first N characters should be uppercased:
/// `ijabc` should titlecase to `IJabc`.
fn dutch_ij_pair_at_beginning_count(s: &str, mapping: &CaseMap) -> Option<usize> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): make a helper function that takes a &str and returns a struct similar to DutchIData but works for either I or J, so you can re-use the logic. It can take const generics for the four magic chars i, I, í, Í

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not reusable logic, the j logic is pretty different

@Manishearth Manishearth merged commit 5a3c265 into unicode-org:main Apr 10, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants