-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Problem:
With ZMQ_HEARTBEAT_IVL and ZMQ_HEARTBEAT_TIMEOUT enabled (e.g. ROUTER/DEALER over TCP), the process can crash with:
Assertion failed: !_io_error (src/stream_engine_base.cpp:316)
Cause: in_event_internal() sets _io_error = true and removes the fd from the poll set when the receive pipe hits backpressure (e.g. RCVHWM) or on other input-stop paths. The I/O thread’s poller can still deliver a POLLOUT and call out_event() before the engine is torn down. out_event() asserts !_io_error, so the process aborts. This is a race between teardown and a stale/speculative out_event callback.
Reproduction is more likely when the application stops reading from the socket (e.g. under load or in a “stuck” state): the receive pipe fills, backpressure sets _input_stopped then _io_error, and the poller may still invoke out_event().
Solution:
In stream_engine_base.cpp, in out_event(), replace the assert with an early return so that when _io_error is already set we no-op and let teardown proceed:
void zmq::stream_engine_base_t::out_event ()
{
if (_io_error)
return;
// ... rest unchanged
}(Remove the line zmq_assert (!_io_error);.) Whenever _io_error is true, the correct behavior is to not run the rest of out_event(); the assert was an invariant that this race violates.
Environment:
- libzmq version: 4.3.4 / 4.3.5 (and current master)
- OS: Linux (Debian-based); also reported on other platforms in 【zmq 4.3.4】pub-sub(tcp protocol) mode crash by _io_error #4364
- Protocol: TCP, ROUTER (server) with many DEALER clients, heartbeat enabled
Steps to reproduce:
See #4364 (PUB/SUB with small heartbeat timeout). In our case: ROUTER with ZMQ_HEARTBEAT_IVL and ZMQ_HEARTBEAT_TIMEOUT set; multiple DEALER clients; stop calling recv on the ROUTER for several seconds (simulating overload). The receive pipe hits HWM, backpressure triggers the path that sets _io_error, and the assertion in out_event() can fire.
Expected result:
No crash. When _io_error is set, out_event() should return immediately; teardown continues without aborting.