This repository was archived by the owner on Sep 18, 2024. It is now read-only.
This repository was archived by the owner on Sep 18, 2024. It is now read-only.
Dispatcher crash with TPE KeyError #5798
Open
Description
Describe the issue:
It seems the dispatcher crashes for me from unknown causes, and when this happens, my experiment stops running.
Environment:
- NNI version: 3.0
- Training service (local|remote|pai|aml|etc): local
- Client OS: Linux (Ubuntu 22.04)
- Server OS (for remote mode only):
- Python version: 3.10.14
- PyTorch/TensorFlow version: 2.2.1+cu118 (PyTorch)
- Is conda/virtualenv/venv used?: No (pyenv is used)
- Is running in Docker?: No
Configuration:
- Experiment config (remember to remove secrets!):
from nni.experiment import Experiment
experiment = Experiment('local')
experiment.config.trial_command = 'python model.py'
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'minimize'
experiment.config.max_trial_number = 1000
experiment.config.trial_concurrency = 1
experiment.run(8080)
- Search space:
search_space = {
"hidden_sizes": {
"_type": "choice",
"_value": [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]
},
"learning_rate": {
"_type": "loguniform",
"_value": [0.000001, 0.1]
},
"batch_size": {
"_type": "choice",
"_value": [32, 64, 128]
},
"num_epochs": {
"_type": "randint",
"_value": [100, 1000]
},
"dropout_prob": {
"_type": "uniform",
"_value": [0.0, 0.5]
},
"use_batch_norm": {
"_type": "choice",
"_value": [True, False]
},
"activation_fn": {
"_type": "choice",
"_value": ["relu", "leaky_relu", "sigmoid", "tanh", "elu", "selu"]
},
"patience": {
"_type": "randint",
"_value": [0, 10]
}
}
Log message:
- nnimanager.log:
(relevant snippet)
[2024-07-03 17:10:21] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 46,
hyperParameters: {
value: '{"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-07-03 17:10:21] INFO (LocalV3.local) Created trial eDGmO
[2024-07-03 17:10:22] INFO (LocalV3.local) Trial parameter: eDGmO {"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
type: 'ME',
content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 1, "value": "14.616238377757908"}'
}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
type: 'ME',
content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 2, "value": "10.664397033219485"}'
}
- dispatcher.log:
[2024-07-03 16:45:49] INFO (nni.tuner.tpe/MainThread) Using random seed 668056533
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'hidden_sizes': {'_type': 'choice', '_value': [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]}, 'learning_rate': {'_type': 'loguniform', '_value': [1e-06, 0.1]}, 'batch_size': {'_type': 'choice', '_value': [32, 64, 128]}, 'num_epochs': {'_type': 'randint', '_value': [100, 1000]}, 'dropout_prob': {'_type': 'uniform', '_value': [0, 0.5]}, 'use_batch_norm': {'_type': 'choice', '_value': [True, False]}, 'activation_fn': {'_type': 'choice', '_value': ['relu', 'leaky_relu', 'sigmoid', 'tanh', 'elu', 'selu']}, 'patience': {'_type': 'randint', '_value': [0, 10]}}
[2024-07-03 17:10:21] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 45
Traceback (most recent call last):
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlers[command](data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 45
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
How to reproduce it?:
It happens not just once for me, but occasionally with different experiments. I tried lowering concurrency to 1 in order to avoid it, but it appears nonetheless.
In this example, it was trial 45 evidently which caused the crash. In the web ui, I can see that trial 45 succeeded and there is a recorded metric value for it. Yet, when TPE goes to find its parameters, it seems it cannot find them?
Metadata
Metadata
Assignees
Labels
No labels