Implement WorkerDisconnectNotification (WDN) Protocol#497
Implement WorkerDisconnectNotification (WDN) Protocol#497magniloquency wants to merge 12 commits intofinos:mainfrom
Conversation
- Add WorkerDisconnectNotification struct to Cap'n Proto definition. - Implement WorkerDisconnectNotification Python class and register it in PROTOCOL. - Update worker protocol documentation.
728ea31 to
0cbfc00
Compare
- Update Scheduler and WorkerController to handle WorkerDisconnectNotification (WDN) - Update Worker to send WDN upon graceful shutdown - Update protocol documentation for WDN - Apply formatting fixes (black/isort) to modified files
0cbfc00 to
99848a7
Compare
|
Did you test with the case where Now, you can switch the value |
| self._loop.add_signal_handler(signal.SIGINT, lambda: asyncio.ensure_future(self.__graceful_shutdown())) | ||
|
|
||
| async def __graceful_shutdown(self): | ||
| await self._connector_external.send(DisconnectRequest.new_msg(self.identity)) |
There was a problem hiding this comment.
Are we not sending WorkerDisconnectNotification when receiving signals?
There was a problem hiding this comment.
I changed it back to sending when the worker exits as it's just a notification now and no response is expected. For YMQ I found that sometimes it gets a ConnectorSocketClosedByRemoteEnd exception, so I've just added a check for that, but for some reasons it seems to hang sometimes (difficult to reproduce, seems kind of random). One time I also got a segmentation fault during the test_graph_fail test. But I'm not sure if these are related to my changes?
There was a problem hiding this comment.
I cannot reproduce the hang and the SegFault. Please try to reproduce on main branch on your machine and give at least some traces.
for some reasons it seems to hang sometimes
Would be nice to have test cases it hangs on.
But I'm not sure if these are related to my changes?
Likely not, the SegFault is definitely not.
There was a problem hiding this comment.
It seems like test_graph_fail and test_graph_error cause it to hang or segfault, I've seen both, but it doesn't always happen, sometimes there's no error
There was a problem hiding this comment.
I just saw test_send_object hang. The hangs usually happen when exiting / quitting
There was a problem hiding this comment.
I have attached gdb to the frozen process, here's the 10 most recent stack frames:
#0 0x00007fe6c1f1872d in syscall () from /usr/lib/libc.so.6
#1 0x00007fe6c0edf000 in std::__atomic_futex_unsigned_base::_M_futex_wait_until (this=<optimized out>, __addr=0x55f483ca0b60, __val=2147483648, __has_timeout=<optimized out>, __s=...,
__ns=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/futex.cc:122
#2 0x00007fe6bfd42b49 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test_until (this=0x55f483ca0b60, __assumed=0, __operand=1, __equal=true,
__mo=std::memory_order::acquire, __has_timeout=false, __s=std::chrono::duration = { 0s }, __ns=std::chrono::duration = { 0ns }) at /usr/include/c++/15.2.1/bits/atomic_futex.h:111
#3 0x00007fe6bfd41636 in std::__atomic_futex_unsigned<2147483648u>::_M_load_and_test (this=0x55f483ca0b60, __assumed=0, __operand=1, __equal=true, __mo=std::memory_order::acquire)
at /usr/include/c++/15.2.1/bits/atomic_futex.h:160
#4 0x00007fe6bfd3f2a0 in std::__atomic_futex_unsigned<2147483648u>::_M_load_when_equal (this=0x55f483ca0b60, __val=1, __mo=std::memory_order::acquire)
at /usr/include/c++/15.2.1/bits/atomic_futex.h:218
#5 std::__future_base::_State_baseV2::wait (this=0x55f483ca0b50) at /usr/include/c++/15.2.1/future:362
#6 0x00007fe6bfd41512 in std::__basic_future<void>::wait (this=0x7ffcabc784d0) at /usr/include/c++/15.2.1/future:725
#7 0x00007fe6bfd3de02 in scaler::ymq::IOContext::removeIOSocket (this=0x55f47e3d9630, socket=std::shared_ptr<scaler::ymq::IOSocket> (empty) = {...})
at /home/xxx/work/opengris-scaler/src/cpp/scaler/ymq/io_context.cpp:94
#8 0x00007fe6bfce63a1 in PyIOSocket_dealloc (self=0x7fe6abe72f40) at /home/xxx/work/opengris-scaler/src/cpp/scaler/ymq/pymod_ymq/io_socket.h:40
#9 0x00007fe6c22fb85d in _Py_DECREF (op=<optimized out>) at ./Include/object.h:500
#10 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:567
It seems like IOContext::removeIOSocket() is hanging for some reason
There was a problem hiding this comment.
maybe #445 is related? total guess
Likely not. In fact that issue addressed a possible cause of hanging.
Inspecting the log, I see
#7 0x00007fe6bfd3de02 in scaler::ymq::IOContext::removeIOSocket (this=0x55f47e3d9630, socket=std::shared_ptrscaler::ymq::IOSocket (empty) = {...})
which is not valid because assumption is that the argument must represent a shared_ptr that's not empty.
There was a problem hiding this comment.
I ran test_send_object 500 times and it didn't fail. More info you can share?
There was a problem hiding this comment.
@magniloquency please resolve this and make issue about what's left (if you are doing it in misc fixes branch then there's no need for issue).
71f0b7e to
832597e
Compare
832597e to
0a84c8f
Compare
Closes #465
Overview
This PR introduces the
WorkerDisconnectNotification(WDN) message to the Scaler protocol. This allows workers to proactively notify the scheduler when they are shutting down gracefully, enabling immediate state cleanup on the scheduler side without waiting for timeouts or heartbeat failures.Changes
1. Protocol Definition
WorkerDisconnectNotificationstruct tomessage.capnp.WorkerDisconnectNotificationclass insrc/scaler/protocol/python/message.pyand registered it.src/scaler/protocol/worker.mdto include the new WDN message specification.2. Worker Implementation
Workerclass to send aWorkerDisconnectNotificationmessage to the scheduler at the end of itsrunmethod, ensuring the scheduler is informed before the process terminates.__graceful_shutdownasync handler in favor of a unified shutdown sequence that ensures the notification is dispatched.3. Scheduler Implementation
Scheduler.pyto recognize theWorkerDisconnectNotificationmessage and route it to the appropriate controller.on_disconnect_notificationinWorkerController(and its mixin) to process the notification and trigger the immediate disconnection and cleanup of the worker state.