Skip to content

講笑話 - a context sensitive phrase #55

@henrymcl

Description

@henrymcl

Describe the bug

講笑話 can be parsed as 講笑/話 and 講/笑話

>>> pycantonese.characters_to_jyutping('笑話')
[('笑話', 'siu3waa2')]
>>> pycantonese.characters_to_jyutping('佢好鐘意講笑話佢女朋友肥')
[('佢', 'keoi5'), ('好', 'hou2'), ('鐘意', 'zung1ji3'), ('講笑', 'gong2siu3'), ('話', 'waa6'), ('佢', 'keoi5'), ('女朋友', 'neoi2pang4jau5'), ('肥', 'fei4')]
>>> pycantonese.characters_to_jyutping('佢好鐘意講笑話')
[('佢', 'keoi5'), ('好', 'hou2'), ('鐘意', 'zung1ji3'), ('講笑', 'gong2siu3'), ('話', 'waa6')]

Expected behavior

>>> pycantonese.characters_to_jyutping('佢好鐘意講笑話')
[('佢', 'keoi5'), ('好', 'hou2'), ('鐘意', 'zung1ji3'), ('講', 'gong2'), ('笑話', 'siu3waa2')]

There isn't probably that many other similar cases (perhaps even none) as it involves the quoting word 話 interacting with 講笑 in multiple ways.

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

System: Ubuntu 22.04.5 LTS

Output from pip list:

certifi            2025.10.5
charset-normalizer 3.4.3
idna               3.10
pip                22.0.2
pycantonese        3.4.0
pylangacq          0.16.2
python-dateutil    2.9.0.post0
requests           2.32.5
setuptools         59.6.0
six                1.17.0
tabulate           0.9.0
urllib3            2.5.0
wcwidth            0.2.14
wordseg            0.0.2

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions