Skip to content

Corpus not loading in Pyodide #52

@taviowong

Description

@taviowong

Describe the bug
Pyodide is a tool for running Python packages in the browser. In its current state, pycantonese cannot be run in Pyodide due to the use of multi-threading during data loading of corpus.

To reproduce

  1. Go to the online REPL at https://pyodide.org/en/stable/console.html
  2. Run the following script
>>> import micropip
>>> await micropip.install('setuptools')
>>> await micropip.install('pycantonese')
>>> import pycantonese
>>> pycantonese.segment('但願人長久,千裡共嬋娟')
  1. An error is thrown: "RuntimeError: can't start new thread". Full stack trace as follows.
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 170, in parse_text
    _get_utterance(sent, segment_kwargs, pos_tag_kwargs, participant)
  File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 56, in _get_utterance
    words, tags, jps = _parse_text(unparsed_sent, segment_kwargs, pos_tag_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 27, in _parse_text
    chars_jps = characters_to_jyutping(text, **(segment_kwargs or {}))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 101, in characters_to_jyutping
    words_to_jyutping, chars_to_jyutping = _get_words_characters_to_jyutping()
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 14, in _get_words_characters_to_jyutping
    corpus = hkcancor()
             ^^^^^^^^^^
  File "/lib/python3.12/site-packages/pycantonese/corpus.py", line 396, in hkcancor
    reader = _HKCanCorReader.from_dir(data_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1057, in from_dir
    return cls.from_files(
           ^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1005, in from_files
    strs = list(executor.map(_open_file, paths))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python312.zip/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python312.zip/concurrent/futures/thread.py", line 179, in submit
    self._adjust_thread_count()
  File "/lib/python312.zip/concurrent/futures/thread.py", line 202, in _adjust_thread_count
    t.start()
  File "/lib/python312.zip/threading.py", line 992, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Expected behavior
The sentence can be segmented without error:
['但願', '人', '長久', ',', '千', '裡', '共', '嬋娟']

System (please complete the following information):

  • Operating System: N/A
  • PyCantonese version: 3.4.0

Additional context

  • Pyodide does not support multi-threading (source: https://pyodide.org/en/stable/usage/wasm-constraints.html#included-but-not-working-modules).
  • Allowing pycantonese to be run in JavaScript/browser will open up to many different opportunities (e.g. Cantonese-themed web apps, browser extensions).
  • The _HKCanCorReader.from_dir() function supports disabling multi-threading using parallel=False. Preliminary testing shows that pycantonese works in Pyodide with multi-threading disabled.
  • Can you add an environment variable for overriding the argument so that pycantonese can be loaded properly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions