Fix word_tokenize infinite loop on long repetitive text #25

Copilot · 2026-01-06T15:01:00Z

The word_tokenize() function hangs indefinitely on LLM-generated text with repetitive patterns due to exponential complexity in the maximum matching algorithm.

Changes

Fork multi_cut.py from PyThaiNLP - Implements maximum matching with safe mode (based on PyThaiNLP PR #302)
- Chunks texts >140 chars into ~120 char segments at natural break points
- Prevents exponential blowup on high-ambiguity inputs
Update word_tokenize() - Use new implementation with safe_mode=True by default
- Backward compatible: optional safe_mode parameter for original behavior
Add test coverage - Validates fix with problematic input from issue

Example

from laonlp.tokenize import word_tokenize

# This text previously hung indefinitely
text = "ທ່ານ scott beaumont ປະທານ google ອາຊີ ອາຊີ - ປາຊີຟິກ ຕີລາຄາສູງບັນດາ..."
result = word_tokenize(text)  # Now completes in ~5s

# Disable safe mode if needed (not recommended)
result = word_tokenize(text, safe_mode=False)

Performance

Problematic text: ∞ → 5s
Normal text: No impact

Original prompt

This section details on the original issue you should resolve

<issue_title>word_tokenize stuck</issue_title>
<issue_description>I was trying to tokenize some sentences generated by LLM. This is actually an example about what I did:
from laonlp.tokenize import word_tokenize

a = "ທ່ານ scott beaumont ປະທານ google ອາຊີ ອາຊີ - ປາຊີຟິກ ຕີລາຄາສູງບັນດາເສດຖະກິດເສດຖະກິດ - ໃນສະພາບສໍາຄັນຂອງໂລກສາມາດ ນໍາ ຕີລາຄາສູງຫຼາຍ; ວ່າມີບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາເສດຖະກິດເສດຖະກິດ, ແລະ ຍູ້ແຮງບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາບັນດາການເຄື່ອນໄຫວ ແລະ ຫວຽດນາມ."

print(word_tokenize(a))
The function was stuck and never return result. There might be an infinite loop inside the funciton.</issue_description>

<agent_instructions>Fork https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/multi_cut.py and use PyThaiNLP/pythainlp#302 to fix the bug</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@wannaphong Hello! I think it is the same issue as https://github.com/PyThaiNLP/pythainlp/pull/302. I may fork `mm` to laonlp and add tokenizer-safe to avoid the issue.

Fixes word_tokenize stuck #19

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…stuck issue Co-authored-by: wannaphong <[email protected]>

Co-authored-by: wannaphong <[email protected]>

Initial plan

689a8d4

Copilot AI assigned Copilot and wannaphong Jan 6, 2026

Copilot started work on behalf of wannaphong January 6, 2026 15:02 View session

Copilot AI and others added 2 commits January 6, 2026 15:09

Fork multi_cut.py from PyThaiNLP with safe mode to fix word_tokenize …

77f8d64

…stuck issue Co-authored-by: wannaphong <[email protected]>

Address code review feedback: improve documentation and readability

fd66ba8

Co-authored-by: wannaphong <[email protected]>

Copilot AI changed the title ~~[WIP] Fix word_tokenize being stuck on large input~~ Fix word_tokenize infinite loop on long repetitive text Jan 6, 2026

Copilot AI requested a review from wannaphong January 6, 2026 15:20

Copilot finished work on behalf of wannaphong January 6, 2026 15:20

wannaphong marked this pull request as ready for review January 6, 2026 15:20

wannaphong approved these changes Jan 6, 2026

View reviewed changes

wannaphong merged commit 9c58cbd into master Jan 6, 2026
6 of 8 checks passed

wannaphong deleted the copilot/fix-word-tokenize-issue branch January 6, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix word_tokenize infinite loop on long repetitive text #25

Fix word_tokenize infinite loop on long repetitive text #25

Uh oh!

Copilot AI commented Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix word_tokenize infinite loop on long repetitive text #25

Fix word_tokenize infinite loop on long repetitive text #25

Uh oh!

Conversation

Copilot AI commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Performance

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 6, 2026 •

edited

Loading