Skip to content

PyICU example was wrong: PyICU iterators don't work on Python strings #1

@behdad

Description

@behdad

On the front page, the code claiming ICU is buggy is doing it wrong. ICU returns indices as UTF-16 or UTF-8 indices, not "character" indices like Python expects. Here is a fix:

diff --git a/test.py b/test.py
index d040782..1318933 100644
--- a/test.py
+++ b/test.py
@@ -1,12 +1,13 @@
 import icu
 def iterate_breaks(text, break_iterator):
+    text = icu.UnicodeString(text)
     break_iterator.setText(text)
     lastpos = 0
     while True:
         next_boundary = break_iterator.nextBoundary()
         print(next_boundary)
         if next_boundary == -1: return
-        yield text[lastpos:next_boundary]
+        yield str(text[lastpos:next_boundary])
         lastpos = next_boundary
 bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions