Skip to content

Double Line Break Opportunities Around Spaces in Khmer Text #7218

@kenton-r

Description

@kenton-r

Summary

The icu_segmenter crate (version 2.1.1) incorrectly places line break opportunities both before and after space characters in Khmer text, creating consecutive invalid break points that isolate spaces as standalone segments.

Environment

  • Crate: icu_segmenter v2.1.1
  • Features: compiled_data
  • Rust version: (rustc 1.91.0)
  • Locale: Khmer (km)

Expected Behavior

Line break opportunities should appear after space characters only, following UAX #14. Spaces should remain attached to the preceding word or be consumed as part of the break opportunity, not isolated as standalone segments.

For the text "អស់ នឹង មាន", expected breaks would be:

Position 10: after space → "អស់ " | "នឹង"
Position 20: after space → "អស់ នឹង " | "មាន"

Actual Behavior

The segmenter creates two consecutive break points around each space character:

Position 9:  before space → "អស់" | " នឹង"
Position 10: after space  → "អស់ " | "នឹង"
Position 19: before space → "អស់ នឹង" | " មាន"
Position 20: after space  → "អស់ នឹង " | "មាន"

This creates segments where a single space character is isolated between two break opportunities.

Reproduction

Code

use icu_segmenter::LineSegmenter;
use icu_segmenter::options::LineBreakOptions;
use icu_locale_core::{locale, LanguageIdentifier};

fn main() {
    let khmer = "អស់ នឹង មាន ចន្លោះ។";
    let km_locale: LanguageIdentifier = locale!("km").into();

    let mut options = LineBreakOptions::default();
    options.content_locale = Some(&km_locale);

    let seg = LineSegmenter::new_auto(options);
    let breaks: Vec<usize> = seg.segment_str(khmer).collect();

    println!("Breaks: {:?}", breaks);

    // Show the problem: consecutive positions around spaces
    for window in breaks.windows(2) {
        let segment = &khmer[window[0]..window[1]];
        println!("[{}->{}] '{}'", window[0], window[1], segment);
    }
}

Cargo.toml

[dependencies]
icu_segmenter = { version = "2.1.1", features = ["compiled_data"] }
icu_locale_core = "2.1.1"

Output

Breaks: [0, 9, 10, 19, 20, 29, 30, 48, 52, 61, 70, 71, 80]

[0->9] 'អស់'
[9->10] ' '        ⚠️ ISOLATED SPACE
[10->19] 'នឹង'
[19->20] ' '       ⚠️ ISOLATED SPACE
[20->29] 'មាន'
[29->30] ' '       ⚠️ ISOLATED SPACE
[30->48] 'ចន្លោះ'
[48->52] '។ '
[52->61] 'អស់'
[61->70] 'នឹង'
[70->71] ' '       ⚠️ ISOLATED SPACE
[71->80] 'មានចន្'

Verification Against ICU4C

The official ICU segmentation demo does NOT show this behavior:
https://icu4c-demos.unicode.org/icu-bin/icusegments#2/km_KH

Testing the same Khmer text on the ICU4C demo shows correct single break opportunities after spaces, not the double breaks observed in the Rust implementation.

Scope

This bug affects:

  1. All segmenter modes: LSTM, Dictionary, and Auto all show identical incorrect behavior
  2. With and without locale: The bug appears regardless of whether Khmer locale is specified
  3. Real-world impact: This affects applications like Typst and any other Rust software using icu_segmenter for Khmer text

Additional Observations

  • Tested with all four segmentation methods:

    • new_lstm() - Shows bug
    • new_dictionary() - Shows bug
    • new_auto() - Shows bug
    • No locale (default) - Shows bug
  • The double-break pattern occurs consistently at every space in the text

  • The behavior differs from ICU4C, suggesting an issue specific to the Rust port

Potential Causes

Since ICU4C works correctly but ICU4X doesn't, the issue likely lies in:

  1. Break position calculation logic specific to the Rust implementation
  2. Post-processing of break iterator results
  3. Khmer-specific compiled data generation or loading
  4. Incorrect handling of the Unicode Line_Break property for spaces in Khmer context

Workaround

Users can filter consecutive breaks as a temporary solution:

fn filter_double_breaks(breaks: &[usize], text: &str) -> Vec<usize> {
    let mut filtered = Vec::new();
    let mut i = 0;

    while i < breaks.len() {
        let current = breaks[i];

        if i + 1 < breaks.len() {
            let next = breaks[i + 1];

            // Check if segment between current and next is a single space
            if next == current + 1 && text.as_bytes().get(current) == Some(&b' ') {
                filtered.push(next); // Keep break after space, skip before
                i += 2;
                continue;
            }
        }

        filtered.push(current);
        i += 1;
    }

    filtered
}

Related

  • This may be related to how spaces are classified in the Line_Break property for Southeast Asian scripts
  • Similar patterns might exist for Thai, Lao, or Myanmar if they use similar space handling

Request

Please investigate and fix the double-break issue for Khmer (and potentially other Southeast Asian languages). This bug causes incorrect line breaking behavior that affects text layout and rendering quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-segmentationComponent: SegmentationT-bugType: Bad behavior, security, privacy

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions