Skip to content

No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node #151

@karterotte

Description

@karterotte

Current behavior

I'm using docker image from "alideeprec/deeprec-release:deeprec2306-gpu-py38-cu116-ubuntu20.04-hybridbackend" and find error in my training process
This is log:

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-08-23 15:35:03.977041: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
2023-08-23 15:35:03.986474: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xd8fa950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-08-23 15:35:03.986505: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-08-23 15:35:03.989147: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
2023-08-23 15:35:04.000592: E tensorflow/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-08-23 15:35:04.000614: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
INFO:tensorflow:run without loading checkpoint
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1374, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1357, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1397, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by {{node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad}}with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]

	 [[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 22, in <module>
    do_train(UserModel, user_params)
  File "/var/workspace/utils/job.py", line 100, in do_train
    hb.estimator.train_and_evaluate(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 553, in train_and_evaluate
    return estimator.train_and_evaluate(train_spec, eval_spec, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 336, in train_and_evaluate
    return self.train(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 209, in train
    return super().train(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1174, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1206, in _train_model_default
    return self._train_with_estimator_spec(estimator_spec, worker_hooks,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1491, in _train_with_estimator_spec
    with training.MonitoredTrainingSession(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 129, in HybridBackendMonitoredTrainingSession
    sess = fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 660, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 63, in __init__
    super(cls, self).__init__(  # pylint: disable=bad-super-call
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 805, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1292, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 958, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 718, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 964, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1188, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1393, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]

	 [[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]

System information

  • GPU model and memory: 3090
  • OS Platform: ubuntu20.04
  • Docker version:
  • GCC/CUDA/cuDNN version: cu116
  • Python/conda version: py38
  • TensorFlow/PyTorch version: tf1.15

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions