Description
Hi there,
I am the maintainer of xoscar and xorbits. xoscar is a lightweight actor programming framework that enables inter-process and inter-node communication. We use ucx-py to accelerate communication. There have been no issues before, but recently, using ucx-py has been consistently reporting the following error.
It seems that there are some asyncio tasks not end?
Exception in callback <bound method BlockingMode._fd_reader_callback of <ucp.continuous_ucx_progress.BlockingMode object at 0x71df9c35c910>>
handle: <Handle BlockingMode._fd_reader_callback>
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
File "/home/xor/.conda/envs/xor/lib/python3.11/site-packages/ucp/continuous_ucx_progress.py", line 85, in _fd_reader_callback
assert self.asyncio_task is None or self.asyncio_task.done()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
As this is only a assert
statement, I delete this line. After commenting out this assert line, the entire program can run but will report another error.
Task was destroyed but it is pending!
task: <Task pending name='Task-102' coro=<BlockingMode._arm_worker() running at /fs/fast/share/pingtai_cc/envs/cudf/lib/python3.11/site-packages/ucp/continuous_ucx_progress.py:110> wait_for=<_SyncSocketReaderFuture pending cb=[Task.task_wakeup()]>>
In terms of performance for communication and computation across computing nodes, now using ucx-py is slightly slower than using unixsocket. Perviously, when no error like this, ucx-py is faster than unixsocket.
This part feels difficult to debug. Are there any clues to help with debugging?