Skip to content

Branch for YMQ integration to Scaler#430

Merged
sharpener6 merged 36 commits intofinos:mainfrom
gxuu:nov-ymq-integrate
Jan 6, 2026
Merged

Branch for YMQ integration to Scaler#430
sharpener6 merged 36 commits intofinos:mainfrom
gxuu:nov-ymq-integrate

Conversation

@gxuu
Copy link
Contributor

@gxuu gxuu commented Nov 14, 2025

READY FOR REVIEW.

The test cases when using tcp_zmq as SCALER_NETWORK_BACKEND is passing, after fixing a bug in OSS.

There are still two test cases that are not passed when SCALER_NETWORK_BACKEND is ymq. They likely are caused by the same underlying bug that I couldn't find.

There are still two test cases that are not passed when SCALER_NETWORK_BACKEND is ymq. They likely are caused by the same underlying bug that I couldn't find.

There is a bug in the worker side. The worker, when shutdown requested through signal handle, is sending a DisconnectRequest. It should wait for DisconnectResponse afterwards. This was never done. The old version of the code exits immediately after sending DisconnectRequest.

This results to hang in the Scheduler side. The Scheduler, upon receiving DisconnectRequest from downstream, tries to send DisconnectResponse back. The Scheduler equipped with ymq connectors will stuck on statement:

    async def on_disconnect(self, worker_id: WorkerID, request: DisconnectRequest):
        await self.__disconnect_worker(request.worker)
        await self._binder.send(worker_id, DisconnectResponse.new_msg(request.worker)) # <-- HERE

and no other recv/send operation can be completed due to the interface's implementation. Apparently, we don't have this problem when zmq connectors equipped. This is because their implementation contains internal polling in the interface side. We don't have it.

But the culprit is the worker not waiting on DisconnectResponse. Hence, this really is a bug in scaler itself, but was hidden by zmq connectors.

@gxuu gxuu marked this pull request as draft November 14, 2025 16:09
@gxuu gxuu force-pushed the nov-ymq-integrate branch 2 times, most recently from 79bf3e2 to 0332c19 Compare November 15, 2025 00:34
@gxuu gxuu marked this pull request as ready for review November 15, 2025 00:37
@gxuu gxuu requested review from rafa-be and sharpener6 November 17, 2025 19:15
@gxuu gxuu force-pushed the nov-ymq-integrate branch from 5d48fc1 to d998bda Compare November 19, 2025 01:55
@finos finos deleted a comment from linux-foundation-easycla bot Nov 19, 2025
gxuu and others added 16 commits November 24, 2025 22:38
Signed-off-by: gxu <georgexu420@163.com>
- It should cleanup by removing the connection and talk back to user
- Also, add a safety net for removeIOSocket

Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>

Reformat 2

Signed-off-by: gxu <georgexu420@163.com>

Reformat 3

Signed-off-by: gxu <georgexu420@163.com>

Reformat 4

Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
Co-authored-by: rafa-be <raphael@noisycamp.com>
Signed-off-by: gxu <georgexu420@163.com>
Co-authored-by: rafa-be <raphael@noisycamp.com>
Co-authored-by: rafa-be <raphael@noisycamp.com>
Signed-off-by: gxu <georgexu420@gmail.com>
Signed-off-by: gxu <georgexu420@gmail.com>
This reverts commit d2f03ef.
Signed-off-by: gxu <georgexu420@gmail.com>
Signed-off-by: gxu <georgexu420@163.com>
Signed-off-by: gxu <georgexu420@163.com>
@gxuu gxuu force-pushed the nov-ymq-integrate branch from e09b168 to 60d18cc Compare November 24, 2025 14:42
@gxuu gxuu requested a review from sharpener6 December 15, 2025 21:15
@gxuu gxuu marked this pull request as draft January 5, 2026 20:21
Signed-off-by: gxu <georgexu420@163.com>
sharpener6 and others added 5 commits January 6, 2026 09:18
@gxuu gxuu marked this pull request as ready for review January 6, 2026 15:15
@sharpener6 sharpener6 merged commit 8f046ec into finos:main Jan 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants