Double Line Break Opportunities Around Spaces in Khmer Text

## Summary

The `icu_segmenter` crate (version 2.1.1) incorrectly places line break opportunities **both before and after** space characters in Khmer text, creating consecutive invalid break points that isolate spaces as standalone segments.

## Environment

- **Crate**: `icu_segmenter` v2.1.1
- **Features**: `compiled_data`
- **Rust version**: (rustc 1.91.0)
- **Locale**: Khmer (`km`)

## Expected Behavior

Line break opportunities should appear **after** space characters only, following UAX #14. Spaces should remain attached to the preceding word or be consumed as part of the break opportunity, not isolated as standalone segments.

For the text `"អស់ នឹង មាន"`, expected breaks would be:

```
Position 10: after space → "អស់ " | "នឹង"
Position 20: after space → "អស់ នឹង " | "មាន"
```

## Actual Behavior

The segmenter creates **two consecutive break points** around each space character:

```rust
Position 9:  before space → "អស់" | " នឹង"
Position 10: after space  → "អស់ " | "នឹង"
Position 19: before space → "អស់ នឹង" | " មាន"
Position 20: after space  → "អស់ នឹង " | "មាន"
```

This creates segments where a single space character is isolated between two break opportunities.

## Reproduction

### Code

```rust
use icu_segmenter::LineSegmenter;
use icu_segmenter::options::LineBreakOptions;
use icu_locale_core::{locale, LanguageIdentifier};

fn main() {
    let khmer = "អស់ នឹង មាន ចន្លោះ។";
    let km_locale: LanguageIdentifier = locale!("km").into();

    let mut options = LineBreakOptions::default();
    options.content_locale = Some(&km_locale);

    let seg = LineSegmenter::new_auto(options);
    let breaks: Vec<usize> = seg.segment_str(khmer).collect();

    println!("Breaks: {:?}", breaks);

    // Show the problem: consecutive positions around spaces
    for window in breaks.windows(2) {
        let segment = &khmer[window[0]..window[1]];
        println!("[{}->{}] '{}'", window[0], window[1], segment);
    }
}
```

### Cargo.toml

```toml
[dependencies]
icu_segmenter = { version = "2.1.1", features = ["compiled_data"] }
icu_locale_core = "2.1.1"
```

### Output

```
Breaks: [0, 9, 10, 19, 20, 29, 30, 48, 52, 61, 70, 71, 80]

[0->9] 'អស់'
[9->10] ' '        ⚠️ ISOLATED SPACE
[10->19] 'នឹង'
[19->20] ' '       ⚠️ ISOLATED SPACE
[20->29] 'មាន'
[29->30] ' '       ⚠️ ISOLATED SPACE
[30->48] 'ចន្លោះ'
[48->52] '។ '
[52->61] 'អស់'
[61->70] 'នឹង'
[70->71] ' '       ⚠️ ISOLATED SPACE
[71->80] 'មានចន្'
```

## Verification Against ICU4C

The official ICU segmentation demo does **NOT** show this behavior:
https://icu4c-demos.unicode.org/icu-bin/icusegments#2/km_KH

Testing the same Khmer text on the ICU4C demo shows correct single break opportunities after spaces, not the double breaks observed in the Rust implementation.

## Scope

This bug affects:

1. **All segmenter modes**: LSTM, Dictionary, and Auto all show identical incorrect behavior
2. **With and without locale**: The bug appears regardless of whether Khmer locale is specified
3. **Real-world impact**: This affects applications like Typst and any other Rust software using `icu_segmenter` for Khmer text

## Additional Observations

- Tested with all four segmentation methods:
  
  - `new_lstm()` - Shows bug
  - `new_dictionary()` - Shows bug
  - `new_auto()` - Shows bug
  - No locale (default) - Shows bug
- The double-break pattern occurs consistently at every space in the text
  
- The behavior differs from ICU4C, suggesting an issue specific to the Rust port
  

## Potential Causes

Since ICU4C works correctly but ICU4X doesn't, the issue likely lies in:

1. Break position calculation logic specific to the Rust implementation
2. Post-processing of break iterator results
3. Khmer-specific compiled data generation or loading
4. Incorrect handling of the Unicode Line_Break property for spaces in Khmer context

## Workaround

Users can filter consecutive breaks as a temporary solution:

```rust
fn filter_double_breaks(breaks: &[usize], text: &str) -> Vec<usize> {
    let mut filtered = Vec::new();
    let mut i = 0;

    while i < breaks.len() {
        let current = breaks[i];

        if i + 1 < breaks.len() {
            let next = breaks[i + 1];

            // Check if segment between current and next is a single space
            if next == current + 1 && text.as_bytes().get(current) == Some(&b' ') {
                filtered.push(next); // Keep break after space, skip before
                i += 2;
                continue;
            }
        }

        filtered.push(current);
        i += 1;
    }

    filtered
}
```

## Related

- This may be related to how spaces are classified in the Line_Break property for Southeast Asian scripts
- Similar patterns might exist for Thai, Lao, or Myanmar if they use similar space handling

## Request

Please investigate and fix the double-break issue for Khmer (and potentially other Southeast Asian languages). This bug causes incorrect line breaking behavior that affects text layout and rendering quality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Double Line Break Opportunities Around Spaces in Khmer Text #7218

Summary

Environment

Expected Behavior

Actual Behavior

Reproduction

Code

Cargo.toml

Output

Verification Against ICU4C

Scope

Additional Observations

Potential Causes

Workaround

Related

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Double Line Break Opportunities Around Spaces in Khmer Text #7218

Description

Summary

Environment

Expected Behavior

Actual Behavior

Reproduction

Code

Cargo.toml

Output

Verification Against ICU4C

Scope

Additional Observations

Potential Causes

Workaround

Related

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions