-
-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Labels
Description
Describe the bug
Pyodide is a tool for running Python packages in the browser. In its current state, pycantonese cannot be run in Pyodide due to the use of multi-threading during data loading of corpus.
To reproduce
- Go to the online REPL at https://pyodide.org/en/stable/console.html
- Run the following script
>>> import micropip
>>> await micropip.install('setuptools')
>>> await micropip.install('pycantonese')
>>> import pycantonese
>>> pycantonese.segment('但願人長久,千裡共嬋娟')- An error is thrown: "RuntimeError: can't start new thread". Full stack trace as follows.
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 170, in parse_text
_get_utterance(sent, segment_kwargs, pos_tag_kwargs, participant)
File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 56, in _get_utterance
words, tags, jps = _parse_text(unparsed_sent, segment_kwargs, pos_tag_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pycantonese/parsing.py", line 27, in _parse_text
chars_jps = characters_to_jyutping(text, **(segment_kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 101, in characters_to_jyutping
words_to_jyutping, chars_to_jyutping = _get_words_characters_to_jyutping()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pycantonese/jyutping/characters.py", line 14, in _get_words_characters_to_jyutping
corpus = hkcancor()
^^^^^^^^^^
File "/lib/python3.12/site-packages/pycantonese/corpus.py", line 396, in hkcancor
reader = _HKCanCorReader.from_dir(data_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1057, in from_dir
return cls.from_files(
^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pylangacq/chat.py", line 187, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.12/site-packages/pylangacq/chat.py", line 1005, in from_files
strs = list(executor.map(_open_file, paths))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python312.zip/concurrent/futures/_base.py", line 608, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python312.zip/concurrent/futures/thread.py", line 179, in submit
self._adjust_thread_count()
File "/lib/python312.zip/concurrent/futures/thread.py", line 202, in _adjust_thread_count
t.start()
File "/lib/python312.zip/threading.py", line 992, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Expected behavior
The sentence can be segmented without error:
['但願', '人', '長久', ',', '千', '裡', '共', '嬋娟']
System (please complete the following information):
- Operating System: N/A
- PyCantonese version: 3.4.0
Additional context
- Pyodide does not support multi-threading (source: https://pyodide.org/en/stable/usage/wasm-constraints.html#included-but-not-working-modules).
- Allowing pycantonese to be run in JavaScript/browser will open up to many different opportunities (e.g. Cantonese-themed web apps, browser extensions).
- The _HKCanCorReader.from_dir() function supports disabling multi-threading using
parallel=False. Preliminary testing shows that pycantonese works in Pyodide with multi-threading disabled. - Can you add an environment variable for overriding the argument so that pycantonese can be loaded properly?