Once the experiment reaches a certain point, it generally stops running and reports an error. #5802
Description
Describe the issue:
"Once the experiment reaches a certain point, it generally stops running and reports an error."
[2024-08-09 23:59:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 10
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data
self.tuner.receive_trial_result(id, trial_params[id], value, customized=customized,
File "/root/miniconda3/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 10
content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "PERIODICAL", "sequence": 199, "value": "0.2895440735801888"}'
}
[2024-08-10 00:00:06] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'ME',
content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "FINAL", "sequence": 0, "value": "0.2898187191127104"}'
}
[2024-08-10 00:00:07] INFO (NNIManager) Trial job YbXt7 status changed from RUNNING to SUCCEEDED
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'EN',
content: '{"trial_job_id":"YbXt7","event":"SUCCEEDED","hyper_params":"{\"parameter_id\": 12, \"parameter_source\": \"algorithm\", \"parameters\": {\"activate\": \"elu\", \"d_emb\": 64, \"d_hid\": 32, \"drop\": 0.3884039376983632, \"gamma\": 6.4905452738897065, \"l1\": 1.4578424787079767, \"l2\": 38.44410448714523, \"l4\": 0.29277084068918136, \"lr\": 9.015207683143664e-05, \"mask\": 0.004542790568841141, \"mode\": \"GAT\", \"t\": 0.6139793721895512, \"mask_edge\": 0.07705512469912157, \"instance_temperature\": 0.6737029785000441, \"cluster_temperature\": 0.5472419195458156}, \"parameter_index\": 0}"}'
}
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'GE', content: '1' }
Environment:
- NNI version:
- Training service (local|remote|pai|aml|etc):
- Client OS:
- Server OS (for remote mode only):
- Python version:
- PyTorch/TensorFlow version:
- Is conda/virtualenv/venv used?:
- Is running in Docker?:
Configuration:
- Experiment config (remember to remove secrets!):
- Search space:
Log message:
- nnimanager.log:
- dispatcher.log:
- nnictl stdout and stderr:
How to reproduce it?: