Skip to content

Conversation

@psg-19
Copy link
Contributor

@psg-19 psg-19 commented Jan 13, 2026

Addresses #6996

Summary :-

Adding another field right_boundary_property to iterate from the end, and trying to implement DoubleEndedIterator for Utf16Indices and Latin1Indices

  • Added right_boundary_property to grapheme.rs, word.rs and sentence.rs .
  • Updated the iterator macro to expose next_back(), allowing wrapper segmenters to delegate reverse iteration in iterator_helper.rs .
  • Implemented next_back() function for Utf16Indices to allow construction of Unicode code points from the end of the string backwards while correctly merging surrogate pairs. (in indices.rs)
  • Implemented next_back() function for Latin1Indices which yields byte-index and <u8> character pairs starting from the end. (in indices.rs)
  • Added DoubleEndedIterator for RuleBreakIterator implementation, it enables finding segmentation boundaries in reverse, iterating from the end of the text using rule-based logic.

@psg-19 psg-19 marked this pull request as draft January 13, 2026 18:26
@psg-19 psg-19 marked this pull request as ready for review January 14, 2026 12:32
@psg-19 psg-19 requested a review from Manishearth as a code owner January 14, 2026 12:32
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work; we've been needing this for a long time. We need to be super careful that this produces correct results.

@makotokato @aethanyc @eggrobin

}

impl<Y: RuleBreakType> DoubleEndedIterator for RuleBreakIterator<'_, '_, Y> {
fn next_back(&mut self) -> Option<Self::Item> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to consider more than 1 code point back. See .next()

/// The iterator over characters.
type IterAttr<'s>: Iterator<Item = (usize, Self::CharType)> + Clone + core::fmt::Debug;
type IterAttr<'s>: Iterator<Item = (usize, Self::CharType)>
+ DoubleEndedIterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more test coverage; .next() and .next_back() should produce identical breakpoints. Just take a bunch of real test data (which we should already have) and run it both ways and check that the vectors are the same (just reversed).

@Manishearth Manishearth removed their request for review January 15, 2026 14:10
@psg-19 psg-19 marked this pull request as draft January 15, 2026 17:33
@psg-19
Copy link
Contributor Author

psg-19 commented Jan 15, 2026

thanks for the review, started working on it

@psg-19
Copy link
Contributor Author

psg-19 commented Jan 18, 2026

Hi @sffc,

What I added :-

  • impl DoubleEndedIterator for LineBreakIterator This implementation enables reverse iteration for
    LineBreakIterator, fulfilling the requirement to support DoubleEndedIterator across all segmenters. It currently provides a stub implementation that delegates to the cache.

  • get_break_property_with_state It simulates the state machine in reverse by scanning backwards to a safe point which is Start of Text and re-running the forward state logic to determine the correct break property and backtrack length for the current position, handling complex looking-ahead rules that simple table lookups miss.

  • Added code in spec_test.rs to test on real data as you suggested.

  • Although my implementation fails for 4 testcases :-

icu4x> cargo test --test spec_test
running 12 tests
test run_grapheme_break_extra_test ... ok
test run_sentence_break_extra_test ... FAILED
test run_sentence_break_test ... FAILED
test run_word_break_extra_test ... ok
test run_grapheme_break_test ... ok
test run_line_break_extra_test ... ok
test run_word_break_test ... ok
test run_sentence_break_random_test ... FAILED
test run_line_break_test ... ok
test run_line_break_random_test ... ok
test run_word_break_random_test ... FAILED
test run_grapheme_break_random_test ... ok

failures:
  run_sentence_break_extra_test
    run_sentence_break_random_test
    run_sentence_break_test
    run_word_break_random_test

test result: FAILED. 8 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.81s

Its very complex to mimic the logic used in the forward iteration, can you help me out reviewing my current code and give me some insight. The DoubleEndedIterator for LineBreakIterator does not implement full reverse logic with LB9 support.

@psg-19 psg-19 requested a review from sffc January 18, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants