Skip to content
This repository was archived by the owner on Sep 18, 2024. It is now read-only.
This repository was archived by the owner on Sep 18, 2024. It is now read-only.

Dispatcher crash with TPE KeyError #5798

Open
@Fripplebubby

Description

@Fripplebubby

Describe the issue:
It seems the dispatcher crashes for me from unknown causes, and when this happens, my experiment stops running.

Environment:

  • NNI version: 3.0
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Linux (Ubuntu 22.04)
  • Server OS (for remote mode only):
  • Python version: 3.10.14
  • PyTorch/TensorFlow version: 2.2.1+cu118 (PyTorch)
  • Is conda/virtualenv/venv used?: No (pyenv is used)
  • Is running in Docker?: No

Configuration:

  • Experiment config (remember to remove secrets!):
from nni.experiment import Experiment
experiment = Experiment('local')
experiment.config.trial_command = 'python model.py'
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'minimize'
experiment.config.max_trial_number = 1000
experiment.config.trial_concurrency = 1
experiment.run(8080)
  • Search space:
search_space = {
    "hidden_sizes": {
        "_type": "choice",
        "_value": [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]
    },
    "learning_rate": {
        "_type": "loguniform",
        "_value": [0.000001, 0.1]
    },
    "batch_size": {
        "_type": "choice",
        "_value": [32, 64, 128]
    },
    "num_epochs": {
        "_type": "randint",
        "_value": [100, 1000]
    },
    "dropout_prob": {
        "_type": "uniform",
        "_value": [0.0, 0.5]
    },
    "use_batch_norm": {
        "_type": "choice",
        "_value": [True, False]
    },
    "activation_fn": {
        "_type": "choice",
        "_value": ["relu", "leaky_relu", "sigmoid", "tanh", "elu", "selu"]
    },
    "patience": {
        "_type": "randint",
        "_value": [0, 10]
    }
}

Log message:

  • nnimanager.log:
    (relevant snippet)
[2024-07-03 17:10:21] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 46,
  hyperParameters: {
    value: '{"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-07-03 17:10:21] INFO (LocalV3.local) Created trial eDGmO
[2024-07-03 17:10:22] INFO (LocalV3.local) Trial parameter: eDGmO {"parameter_id": 46, "parameter_source": "algorithm", "parameters": {"hidden_sizes": [256], "learning_rate": 0.004027533073627928, "batch_size": 128, "num_epochs": 748, "dropout_prob": 0.18980965379785528, "use_batch_norm": true, "activation_fn": "selu", "patience": 6}, "parameter_index": 0}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
  type: 'ME',
  content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 1, "value": "14.616238377757908"}'
}
[2024-07-03 17:10:29] ERROR (WsChannel.__default__) Channel closed. Ignored command {
  type: 'ME',
  content: '{"parameter_id": 46, "trial_job_id": "eDGmO", "type": "PERIODICAL", "sequence": 2, "value": "10.664397033219485"}'
}
  • dispatcher.log:
[2024-07-03 16:45:49] INFO (nni.tuner.tpe/MainThread) Using random seed 668056533
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-07-03 16:45:49] INFO (nni.runtime.msg_dispatcher/Thread-1 (command_queue_worker)) Initial search space: {'hidden_sizes': {'_type': 'choice', '_value': [[], [256], [512], [1024], [1024, 512], [1024, 512, 256], [512, 256]]}, 'learning_rate': {'_type': 'loguniform', '_value': [1e-06, 0.1]}, 'batch_size': {'_type': 'choice', '_value': [32, 64, 128]}, 'num_epochs': {'_type': 'randint', '_value': [100, 1000]}, 'dropout_prob': {'_type': 'uniform', '_value': [0, 0.5]}, 'use_batch_norm': {'_type': 'choice', '_value': [True, False]}, 'activation_fn': {'_type': 'choice', '_value': ['relu', 'leaky_relu', 'sigmoid', 'tanh', 'elu', 'selu']}, 'patience': {'_type': 'randint', '_value': [0, 10]}}
[2024-07-03 17:10:21] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 45
Traceback (most recent call last):
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
    self.process_command(command, data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
    command_handlers[command](data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
    self._handle_final_metric_data(data)
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in _handle_final_metric_data
    self.tuner.receive_trial_result(id_, _trial_params[id_], value, customized=customized,
  File "/home/josep/.pyenv/versions/3.10.14/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
    params = self._running_params.pop(parameter_id)
KeyError: 45
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-07-03 17:10:28] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated

How to reproduce it?:

It happens not just once for me, but occasionally with different experiments. I tried lowering concurrency to 1 in order to avoid it, but it appears nonetheless.

In this example, it was trial 45 evidently which caused the crash. In the web ui, I can see that trial 45 succeeded and there is a recorded metric value for it. Yet, when TPE goes to find its parameters, it seems it cannot find them?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions