Open
Description
I'm still working on a reliable repro, but I thought I'd raise this sooner rather than later. Under some conditions, rogue kernels can get jupyter server into a weird state where it exhausts all available file descriptors. The culprit here seems to stem from the nudge function, which starts accumulating file descriptors without cleaning them up.
lsof -p $(pgrep -f jupyter-lab) | awk '{print $5}' | sort | uniq -c | sort -nr
1031 a_inode
432 IPv4
197 REG
4 unix
3 CHR
2 FIFO
2 DIR
1 TYPE
1 sock
1 IPv6
[
{
"id": "11421a39-5726-49f0-9b24-7ee6c1a02190",
"name": "python3",
"last_activity": "2025-03-21T20:42:09.082165Z",
"execution_state": "idle",
"connections": 26715
},
{
"id": "a54d1764-8b84-4ea0-907a-27dc64289436",
"name": "python310",
"last_activity": "2025-03-21T20:42:07.464654Z",
"execution_state": "idle",
"connections": 26656
},
{
"id": "3a53b14e-e44b-4f0f-ad44-6e5e801d8923",
"name": "python310",
"last_activity": "2025-03-21T20:45:46.949561Z",
"execution_state": "idle",
"connections": 26622
},
{
"id": "982322be-1070-4612-bf59-07d5f4702ade",
"name": "scala",
"last_activity": "2025-03-20T23:59:47.957756Z",
"execution_state": "idle",
"connections": 26665
},
{
"id": "8ba2719f-ec00-4ffe-a868-14ec6339e2d6",
"name": "spark33-scala",
"last_activity": "2025-03-20T23:59:40.736254Z",
"execution_state": "idle",
"connections": 29112
}
]
The rogue kernel here is the Scala kernel, which starts this process, ultimately affecting the Python kernels.
This is the intermediate state:
2025-03-21 04:34:06.580115500 [W 2025-03-21 04:34:06.580 ServerApp] WebSocket ping timeout after 90000 ms.
2025-03-21 04:34:07.540819500 [W 2025-03-21 04:34:07.540 ServerApp] Nudge: attempt 550 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:08.866015500 [W 2025-03-21 04:34:08.865 ServerApp] Replacing stale connection: 8ba2719f-ec00-4ffe-a868-14ec6339e2d6:baf135be-1cf9-4ba7-b2d0-8164caa4a2a9
2025-03-21 04:34:08.867461500 [I 2025-03-21 04:34:08.867 ServerApp] Adapting from protocol version 5.4 (kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6) to 5.3 (client).
2025-03-21 04:34:08.869069500 [I 2025-03-21 04:34:08.869 ServerApp] Connecting to kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6.
2025-03-21 04:34:09.155150500 [W 2025-03-21 04:34:09.155 ServerApp] Nudge: attempt 370 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.047986500 [W 2025-03-21 04:34:11.047 ServerApp] Nudge: attempt 740 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.308202500 [W 2025-03-21 04:34:11.308 ServerApp] Nudge: attempt 190 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:12.555067500 [W 2025-03-21 04:34:12.554 ServerApp] Nudge: attempt 560 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:13.380044500 [W 2025-03-21 04:34:13.379 ServerApp] Nudge: attempt 10 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:14.166065500 [W 2025-03-21 04:34:14.165 ServerApp] Nudge: attempt 380 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.062188500 [W 2025-03-21 04:34:16.062 ServerApp] Nudge: attempt 750 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.321077500 [W 2025-03-21 04:34:16.320 ServerApp] Nudge: attempt 200 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:17.568264500 [W 2025-03-21 04:34:17.568 ServerApp] Nudge: attempt 570 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:18.390559500 [W 2025-03-21 04:34:18.390 ServerApp] Nudge: attempt 20 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:19.176689500 [W 2025-03-21 04:34:19.176 ServerApp] Nudge: attempt 390 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.076157500 [W 2025-03-21 04:34:21.076 ServerApp] Nudge: attempt 760 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.333889500 [W 2025-03-21 04:34:21.333 ServerApp] Nudge: attempt 210 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:22.582934500 [W 2025-03-21 04:34:22.582 ServerApp] Nudge: attempt 580 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:23.401344500 [W 2025-03-21 04:34:23.401 ServerApp] Nudge: attempt 30 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:24.187958500 [W 2025-03-21 04:34:24.187 ServerApp] Nudge: attempt 400 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.090278500 [W 2025-03-21 04:34:26.090 ServerApp] Nudge: attempt 770 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.344238500 [W 2025-03-21 04:34:26.344 ServerApp] Nudge: attempt 220 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:27.598137500 [W 2025-03-21 04:34:27.598 ServerApp] Nudge: attempt 590 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:28.411274500 [W 2025-03-21 04:34:28.411 ServerApp] Nudge: attempt 40 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:29.199767500 [W 2025-03-21 04:34:29.199 ServerApp] Nudge: attempt 410 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
Finally, we end up here:
2025-03-21 22:20:12.583433500 [E 2025-03-21 22:20:12.582 ServerApp] Uncaught exception GET /api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8 (2607:fb10:7011:1::c61)
2025-03-21 22:20:12.583439500 HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8', version='HTTP/1.1', remote_ip='2607:fb10:7011:1::c61')
2025-03-21 22:20:12.583439500 Traceback (most recent call last):
2025-03-21 22:20:12.583439500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.583440500 await open_result
2025-03-21 22:20:12.583440500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.583440500 await self.connection.connect()
2025-03-21 22:20:12.583441500 ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583441500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.583441500 self.create_stream()
2025-03-21 22:20:12.583442500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.583442500 self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.583442500 ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583443500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.583443500 socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.583443500 ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.583444500 sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.583444500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.583445500 sock = self.context.socket(socket_type)
2025-03-21 22:20:12.583445500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583445500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.583446500 socket_class( # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.583446500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.583446500 super().__init__(
2025-03-21 22:20:12.583447500 File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.583447500 zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.710088500 [W 2025-03-21 22:20:12.710 ServerApp] Replacing stale connection: 11421a39-5726-49f0-9b24-7ee6c1a02190:c7ab78fd-148f-4c1e-aafc-cbd646628767
2025-03-21 22:20:12.712058500 [I 2025-03-21 22:20:12.712 ServerApp] Connecting to kernel 11421a39-5726-49f0-9b24-7ee6c1a02190.
2025-03-21 22:20:12.716564500 [E 2025-03-21 22:20:12.716 ServerApp] Uncaught exception GET /api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767 (172.24.9.135)
2025-03-21 22:20:12.716565500 HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767', version='HTTP/1.1', remote_ip='172.24.9.135')
2025-03-21 22:20:12.716566500 Traceback (most recent call last):
2025-03-21 22:20:12.716566500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.716567500 await open_result
2025-03-21 22:20:12.716567500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.716567500 await self.connection.connect()
2025-03-21 22:20:12.716567500 ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716568500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.716568500 self.create_stream()
2025-03-21 22:20:12.716568500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.716569500 self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.716569500 ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716569500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.716570500 socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.716570500 ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716570500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.716571500 sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.716571500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716571500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.716572500 sock = self.context.socket(socket_type)
2025-03-21 22:20:12.716572500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716572500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.716572500 socket_class( # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.716573500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.716573500 super().__init__(
2025-03-21 22:20:12.716573500 File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.716574500 zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.821062500 [W 2025-03-21 22:20:12.821 ServerApp] Replacing stale connection: 982322be-1070-4612-bf59-07d5f4702ade:c34b488e-ca59-42a0-a61a-e505e79d9e03
2025-03-21 22:20:12.822449500 [I 2025-03-21 22:20:12.822 ServerApp] Adapting from protocol version 5.4 (kernel 982322be-1070-4612-bf59-07d5f4702ade) to 5.3 (client).
I'm still working on a repro, but I thought I'd share this earlier than later so anyone who does have an idea how to reproduce or has seen this before, can chime in.
Thanks!