Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 6, 2026

The word_tokenize() function hangs indefinitely on LLM-generated text with repetitive patterns due to exponential complexity in the maximum matching algorithm.

Changes

  • Fork multi_cut.py from PyThaiNLP - Implements maximum matching with safe mode (based on PyThaiNLP PR #302)

    • Chunks texts >140 chars into ~120 char segments at natural break points
    • Prevents exponential blowup on high-ambiguity inputs
  • Update word_tokenize() - Use new implementation with safe_mode=True by default

    • Backward compatible: optional safe_mode parameter for original behavior
  • Add test coverage - Validates fix with problematic input from issue

Example

from laonlp.tokenize import word_tokenize

# This text previously hung indefinitely
text = "ທ່ານ scott beaumont ປະທານ google ອາຊີ ອາຊີ - ປາຊີຟິກ ຕີລາຄາສູງບັນດາ..."
result = word_tokenize(text)  # Now completes in ~5s

# Disable safe mode if needed (not recommended)
result = word_tokenize(text, safe_mode=False)

Performance

  • Problematic text: ∞ → 5s
  • Normal text: No impact
Original prompt

This section details on the original issue you should resolve

<issue_title>word_tokenize stuck</issue_title>
<issue_description>I was trying to tokenize some sentences generated by LLM. This is actually an example about what I did:

from laonlp.tokenize import word_tokenize

a = "ທ່ານ scott beaumont ປະທານ google ອາຊີ ອາຊີ - ປາຊີຟິກ ຕີລາຄາສູງບັນດາເສດຖະກິດເສດຖະກິດ - ໃນສະພາບສໍາຄັນຂອງໂລກສາມາດ ນໍາ ຕີລາຄາສູງຫຼາຍ; ວ່າມີບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາເສດຖະກິດເສດຖະກິດ, ແລະ ຍູ້ແຮງບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາການເຄື່ອນໄຫວ ແລະ ຫວຽດນາມ."

print(word_tokenize(a))

The function was stuck and never return result. There might be an infinite loop inside the funciton.</issue_description>

<agent_instructions>Fork https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/multi_cut.py and use PyThaiNLP/pythainlp#302 to fix the bug</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@wannaphong Hello! I think it is the same issue as https://github.com/PyThaiNLP/pythainlp/pull/302. I may fork `mm` to laonlp and add tokenizer-safe to avoid the issue.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Fix word_tokenize being stuck on large input Fix word_tokenize infinite loop on long repetitive text Jan 6, 2026
Copilot AI requested a review from wannaphong January 6, 2026 15:20
@wannaphong wannaphong marked this pull request as ready for review January 6, 2026 15:20
@wannaphong wannaphong merged commit 9c58cbd into master Jan 6, 2026
6 of 8 checks passed
@wannaphong wannaphong deleted the copilot/fix-word-tokenize-issue branch January 6, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

word_tokenize stuck

2 participants