Skip to content
This repository was archived by the owner on Sep 18, 2024. It is now read-only.
This repository was archived by the owner on Sep 18, 2024. It is now read-only.

Once the experiment reaches a certain point, it generally stops running and reports an error. #5802

Open
@EternityJune25

Description

@EternityJune25

Describe the issue:

"Once the experiment reaches a certain point, it generally stops running and reports an error."

[2024-08-09 23:59:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-1 (command_queue_worker)) 10
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 148, in handle_report_metric_data
self._handle_final_metric_data(data)
File "/root/miniconda3/lib/python3.10/site-packages/nni/runtime/msg_dispatcher.py", line 201, in handle_final_metric_data
self.tuner.receive_trial_result(id
, trial_params[id], value, customized=customized,
File "/root/miniconda3/lib/python3.10/site-packages/nni/algorithms/hpo/tpe_tuner.py", line 197, in receive_trial_result
params = self._running_params.pop(parameter_id)
KeyError: 10

content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "PERIODICAL", "sequence": 199, "value": "0.2895440735801888"}'
}
[2024-08-10 00:00:06] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'ME',
content: '{"parameter_id": 12, "trial_job_id": "YbXt7", "type": "FINAL", "sequence": 0, "value": "0.2898187191127104"}'
}
[2024-08-10 00:00:07] INFO (NNIManager) Trial job YbXt7 status changed from RUNNING to SUCCEEDED
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command {
type: 'EN',
content: '{"trial_job_id":"YbXt7","event":"SUCCEEDED","hyper_params":"{\"parameter_id\": 12, \"parameter_source\": \"algorithm\", \"parameters\": {\"activate\": \"elu\", \"d_emb\": 64, \"d_hid\": 32, \"drop\": 0.3884039376983632, \"gamma\": 6.4905452738897065, \"l1\": 1.4578424787079767, \"l2\": 38.44410448714523, \"l4\": 0.29277084068918136, \"lr\": 9.015207683143664e-05, \"mask\": 0.004542790568841141, \"mode\": \"GAT\", \"t\": 0.6139793721895512, \"mask_edge\": 0.07705512469912157, \"instance_temperature\": 0.6737029785000441, \"cluster_temperature\": 0.5472419195458156}, \"parameter_index\": 0}"}'
}
[2024-08-10 00:00:07] ERROR (WsChannel.default) Channel closed. Ignored command { type: 'GE', content: '1' }

Environment:

  • NNI version:
  • Training service (local|remote|pai|aml|etc):
  • Client OS:
  • Server OS (for remote mode only):
  • Python version:
  • PyTorch/TensorFlow version:
  • Is conda/virtualenv/venv used?:
  • Is running in Docker?:

Configuration:

  • Experiment config (remember to remove secrets!):
  • Search space:

Log message:

  • nnimanager.log:
  • dispatcher.log:
  • nnictl stdout and stderr:

How to reproduce it?:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions