nnictl resume cannot open GUI #5803
Description
As title, GUI cannot be open after I resume the experiment.
I observe that if the output log does not hang at "Web portal URLs: ...", GUI will be unable to open. However, I can't find the way to keep "nnictl resume ID" command running.
Complete log of command:
[2024-08-26 13:10:28] Creating experiment, Experiment ID: z39sirw8
[2024-08-26 13:10:28] Starting web server...
[2024-08-26 13:10:29] INFO (main) Start NNI manager
[2024-08-26 13:10:29] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-08-26 13:10:29] INFO (RestServer) REST server started.
[2024-08-26 13:10:29] INFO (NNIDataStore) Datastore initialization done
[2024-08-26 13:10:29] Setting up...
[2024-08-26 13:10:30] INFO (NNIManager) Resuming experiment: z39sirw8
[2024-08-26 13:10:30] INFO (NNIManager) Setup training service...
[2024-08-26 13:10:30] INFO (NNIManager) Setup tuner...
[2024-08-26 13:10:31] INFO (NNIManager) Number of current submitted trials: 621, where 0 is resuming.
[2024-08-26 13:10:31] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-08-26 13:10:31] Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-08-26 13:10:31] Stopping experiment, please wait...
[2024-08-26 13:10:31] Saving experiment checkpoint...
[2024-08-26 13:10:31] Stopping NNI manager, if any...
[2024-08-26 13:10:31] INFO (ShutdownManager) Initiate shutdown: REST request
[2024-08-26 13:10:31] INFO (RestServer) Stopping REST server.
[2024-08-26 13:10:31] ERROR (ShutdownManager) Error during shutting down NniManager: TypeError: Cannot read properties of undefined (reading 'getBufferedAmount')
at TunerServer.sendCommand (/usr/local/lib/python3.8/dist-packages/nni_node/core/tuner_command_channel.js:60:26)
at NNIManager.stopExperimentTopHalf (/usr/local/lib/python3.8/dist-packages/nni_node/core/nnimanager.js:303:25)
at NNIManager.stopExperiment (/usr/local/lib/python3.8/dist-packages/nni_node/core/nnimanager.js:292:20)
at /usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:49:23
at Array.map ()
at ShutdownManager.shutdown (/usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:47:51)
at ShutdownManager.initiate (/usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:22:18)
at /usr/local/lib/python3.8/dist-packages/nni_node/rest_server/restHandler.js:366:40
at Layer.handle [as handle_request] (/usr/local/lib/python3.8/dist-packages/nni_node/node_modules/express/lib/router/layer.js:95:5)
at next (/usr/local/lib/python3.8/dist-packages/nni_node/node_modules/express/lib/router/route.js:144:13)
[2024-08-26 13:10:31] INFO (NNIManager) Change NNIManager status from: RUNNING to: STOPPING
[2024-08-26 13:10:31] INFO (NNIManager) Stopping experiment, cleaning up ...
[2024-08-26 13:10:31] INFO (ShutdownManager) Shutdown complete.
[2024-08-26 13:10:31] INFO (RestServer) REST server stopped.
[2024-08-26 13:10:31] Experiment stopped.
root@e44bc2dd4409:/workspace/MediaPipePyTorch# Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/nni/main.py", line 85, in
main()
File "/usr/local/lib/python3.8/dist-packages/nni/main.py", line 58, in main
dispatcher = MsgDispatcher(url, tuner, assessor)
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/msg_dispatcher.py", line 71, in init
super().init(command_channel_url)
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/msg_dispatcher_base.py", line 47, in init
self._channel.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 58, in connect
self._channel.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 23, in connect
self._ensure_conn()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn
self._conn.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect
self._ws = _wait(_connect_async(self._url))
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait
return future.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async
return await websockets.connect(url, max_size=None) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/client.py", line 655, in await_impl_timeout
return await self.await_impl()
File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/client.py", line 659, in await_impl
_transport, _protocol = await self._create_connection()
File "/usr/lib/python3.8/asyncio/base_events.py", line 1033, in create_connection
raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8080), [Errno 99] Cannot assign requested address