Skip to content

Nudge function leaking FDs, leading to all kernels stopping due to "Too many open files" #1506

Open
@ibdafna

Description

@ibdafna

I'm still working on a reliable repro, but I thought I'd raise this sooner rather than later. Under some conditions, rogue kernels can get jupyter server into a weird state where it exhausts all available file descriptors. The culprit here seems to stem from the nudge function, which starts accumulating file descriptors without cleaning them up.

lsof -p $(pgrep -f jupyter-lab) | awk '{print $5}' | sort | uniq -c | sort -nr
   1031 a_inode
    432 IPv4
    197 REG
      4 unix
      3 CHR
      2 FIFO
      2 DIR
      1 TYPE
      1 sock
      1 IPv6
[
{
"id": "11421a39-5726-49f0-9b24-7ee6c1a02190",
"name": "python3",
"last_activity": "2025-03-21T20:42:09.082165Z",
"execution_state": "idle",
"connections": 26715
},
{
"id": "a54d1764-8b84-4ea0-907a-27dc64289436",
"name": "python310",
"last_activity": "2025-03-21T20:42:07.464654Z",
"execution_state": "idle",
"connections": 26656
},
{
"id": "3a53b14e-e44b-4f0f-ad44-6e5e801d8923",
"name": "python310",
"last_activity": "2025-03-21T20:45:46.949561Z",
"execution_state": "idle",
"connections": 26622
},
{
"id": "982322be-1070-4612-bf59-07d5f4702ade",
"name": "scala",
"last_activity": "2025-03-20T23:59:47.957756Z",
"execution_state": "idle",
"connections": 26665
},
{
"id": "8ba2719f-ec00-4ffe-a868-14ec6339e2d6",
"name": "spark33-scala",
"last_activity": "2025-03-20T23:59:40.736254Z",
"execution_state": "idle",
"connections": 29112
}
]

The rogue kernel here is the Scala kernel, which starts this process, ultimately affecting the Python kernels.

This is the intermediate state:

2025-03-21 04:34:06.580115500  [W 2025-03-21 04:34:06.580 ServerApp] WebSocket ping timeout after 90000 ms.
2025-03-21 04:34:07.540819500  [W 2025-03-21 04:34:07.540 ServerApp] Nudge: attempt 550 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:08.866015500  [W 2025-03-21 04:34:08.865 ServerApp] Replacing stale connection: 8ba2719f-ec00-4ffe-a868-14ec6339e2d6:baf135be-1cf9-4ba7-b2d0-8164caa4a2a9
2025-03-21 04:34:08.867461500  [I 2025-03-21 04:34:08.867 ServerApp] Adapting from protocol version 5.4 (kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6) to 5.3 (client).
2025-03-21 04:34:08.869069500  [I 2025-03-21 04:34:08.869 ServerApp] Connecting to kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6.
2025-03-21 04:34:09.155150500  [W 2025-03-21 04:34:09.155 ServerApp] Nudge: attempt 370 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.047986500  [W 2025-03-21 04:34:11.047 ServerApp] Nudge: attempt 740 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.308202500  [W 2025-03-21 04:34:11.308 ServerApp] Nudge: attempt 190 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:12.555067500  [W 2025-03-21 04:34:12.554 ServerApp] Nudge: attempt 560 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:13.380044500  [W 2025-03-21 04:34:13.379 ServerApp] Nudge: attempt 10 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:14.166065500  [W 2025-03-21 04:34:14.165 ServerApp] Nudge: attempt 380 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.062188500  [W 2025-03-21 04:34:16.062 ServerApp] Nudge: attempt 750 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.321077500  [W 2025-03-21 04:34:16.320 ServerApp] Nudge: attempt 200 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:17.568264500  [W 2025-03-21 04:34:17.568 ServerApp] Nudge: attempt 570 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:18.390559500  [W 2025-03-21 04:34:18.390 ServerApp] Nudge: attempt 20 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:19.176689500  [W 2025-03-21 04:34:19.176 ServerApp] Nudge: attempt 390 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.076157500  [W 2025-03-21 04:34:21.076 ServerApp] Nudge: attempt 760 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.333889500  [W 2025-03-21 04:34:21.333 ServerApp] Nudge: attempt 210 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:22.582934500  [W 2025-03-21 04:34:22.582 ServerApp] Nudge: attempt 580 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:23.401344500  [W 2025-03-21 04:34:23.401 ServerApp] Nudge: attempt 30 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:24.187958500  [W 2025-03-21 04:34:24.187 ServerApp] Nudge: attempt 400 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.090278500  [W 2025-03-21 04:34:26.090 ServerApp] Nudge: attempt 770 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.344238500  [W 2025-03-21 04:34:26.344 ServerApp] Nudge: attempt 220 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:27.598137500  [W 2025-03-21 04:34:27.598 ServerApp] Nudge: attempt 590 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:28.411274500  [W 2025-03-21 04:34:28.411 ServerApp] Nudge: attempt 40 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:29.199767500  [W 2025-03-21 04:34:29.199 ServerApp] Nudge: attempt 410 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6

Finally, we end up here:

2025-03-21 22:20:12.583433500  [E 2025-03-21 22:20:12.582 ServerApp] Uncaught exception GET /api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8 (2607:fb10:7011:1::c61)
2025-03-21 22:20:12.583439500      HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8', version='HTTP/1.1', remote_ip='2607:fb10:7011:1::c61')
2025-03-21 22:20:12.583439500      Traceback (most recent call last):
2025-03-21 22:20:12.583439500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.583440500          await open_result
2025-03-21 22:20:12.583440500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.583440500          await self.connection.connect()
2025-03-21 22:20:12.583441500                ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583441500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.583441500          self.create_stream()
2025-03-21 22:20:12.583442500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.583442500          self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.583442500                                            ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583443500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.583443500          socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.583443500                   ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.583444500          sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.583444500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.583445500          sock = self.context.socket(socket_type)
2025-03-21 22:20:12.583445500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583445500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.583446500          socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.583446500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.583446500          super().__init__(
2025-03-21 22:20:12.583447500        File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.583447500      zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.710088500  [W 2025-03-21 22:20:12.710 ServerApp] Replacing stale connection: 11421a39-5726-49f0-9b24-7ee6c1a02190:c7ab78fd-148f-4c1e-aafc-cbd646628767
2025-03-21 22:20:12.712058500  [I 2025-03-21 22:20:12.712 ServerApp] Connecting to kernel 11421a39-5726-49f0-9b24-7ee6c1a02190.
2025-03-21 22:20:12.716564500  [E 2025-03-21 22:20:12.716 ServerApp] Uncaught exception GET /api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767 (172.24.9.135)
2025-03-21 22:20:12.716565500      HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767', version='HTTP/1.1', remote_ip='172.24.9.135')
2025-03-21 22:20:12.716566500      Traceback (most recent call last):
2025-03-21 22:20:12.716566500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.716567500          await open_result
2025-03-21 22:20:12.716567500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.716567500          await self.connection.connect()
2025-03-21 22:20:12.716567500                ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716568500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.716568500          self.create_stream()
2025-03-21 22:20:12.716568500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.716569500          self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.716569500                                            ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716569500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.716570500          socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.716570500                   ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716570500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.716571500          sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.716571500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716571500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.716572500          sock = self.context.socket(socket_type)
2025-03-21 22:20:12.716572500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716572500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.716572500          socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.716573500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.716573500          super().__init__(
2025-03-21 22:20:12.716573500        File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.716574500      zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.821062500  [W 2025-03-21 22:20:12.821 ServerApp] Replacing stale connection: 982322be-1070-4612-bf59-07d5f4702ade:c34b488e-ca59-42a0-a61a-e505e79d9e03
2025-03-21 22:20:12.822449500  [I 2025-03-21 22:20:12.822 ServerApp] Adapting from protocol version 5.4 (kernel 982322be-1070-4612-bf59-07d5f4702ade) to 5.3 (client).

I'm still working on a repro, but I thought I'd share this earlier than later so anyone who does have an idea how to reproduce or has seen this before, can chime in.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions