Skip to content

Bug when transliterating Unicode digraphs #202

@TDavLinguist

Description

@TDavLinguist

I have been working on a project to develop a Sgaw Karen [ksw] to Thai [tha] language pair and as part of the project, I wanted to develop a transliterator between the three+ Sgaw Karen orthographies. Using https://user.keio.ac.jp/~kato/SgawKarenRomei.pdf as a guide, I developed a small test .dix, which I show below:

<dictionary>
 <alphabet>abcdefghijklmnopqrstuvwxyz ဖခဘ ံ ိ ၣ်</alphabet>
 <sdefs/>
 <section id="consonants" type="inconditional">
   <e><p><l>hp</l><r>ဖ</r></p></e>
   <e><p><l>hk</l><r>ခ</r></p></e>
   <e><p><l>b</l><r>ဘ</r></p></e>
 </section>
 <section id="vowels" type="inconditional">
   <e><p><l>i</l><r>ံ</r></p></e>   <!-- U+1036, bytes: e1 80 b6 -->
   <e><p><l>o</l><r>ိ</r></p></e>   <!-- U+102D, bytes: e1 80 ad -->
   <e><p><l>a</l><r></r></p></e>   <!-- empty output is okay -->
   <e><p><l>f</l><r>ၣ်</r></p></e>
 </section>
</dictionary>

It seems to successfully compile as a .bin with lt-comp

lt-comp lr rom-test.dix rom-test.bin
consonants@inconditional 4 5
vowels@inconditional 3 5

and lt-expand shows the correct mapping:

lt-expand rom-test.dix 
hp:ဖ
hk:ခ
b:ဘ
i:ံ
o:ိ
a:
f:ၣ်

However, when testing with lt-proc -t I get incorrect output:

printf 'hpi hpi\n' | lt-proc -t ./rom-test.bin
ဖi ဖi

(Expected output: ဖံ ဖံ)

It seems that none of the vowels will render after a consonant, but a vowel by itself or in succession will render just fine:

printf 'i i\n' | lt-proc -t ./rom-test.bin
ံ

To be sure, I ran the first prompt through hexdump and it confirmed that the 'i' is just passing through as-is. So it seems to be a compilation problem, not a unicode problem. (or is it a compilation problem stemming from a Unicode problem?)
``
printf 'hpi hpi\n' | lt-proc -t ./rom-test.bin | hexdump -C
00000000 e1 80 96 69 20 e1 80 96 69 0a |...i ...i.|
0000000a

**Update**
Interestingly, the vowels transliterate without issue if there is a space between them and the digraphs:

echo "hk i" | lt-proc -t rom-test.bin
ခ ံ

However, of course, the consonant and vowel need to be together (ခံ), which is not an issue with non-digraph inputs

echo "bi" | lt-proc -t rom-test.bin
ဘံ

Any help would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions