-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Current behavior
I'm using docker image from "alideeprec/deeprec-release:deeprec2306-gpu-py38-cu116-ubuntu20.04-hybridbackend" and find error in my training process
This is log:
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-08-23 15:35:03.977041: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
2023-08-23 15:35:03.986474: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xd8fa950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-08-23 15:35:03.986505: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-08-23 15:35:03.989147: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
2023-08-23 15:35:04.000592: E tensorflow/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-08-23 15:35:04.000614: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
INFO:tensorflow:run without loading checkpoint
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1374, in _do_call
return fn(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1357, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1397, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by {{node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad}}with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
[[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 22, in <module>
do_train(UserModel, user_params)
File "/var/workspace/utils/job.py", line 100, in do_train
hb.estimator.train_and_evaluate(
File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 553, in train_and_evaluate
return estimator.train_and_evaluate(train_spec, eval_spec, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 336, in train_and_evaluate
return self.train(
File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 209, in train
return super().train(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1174, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1206, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1491, in _train_with_estimator_spec
with training.MonitoredTrainingSession(
File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 129, in HybridBackendMonitoredTrainingSession
sess = fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 660, in MonitoredTrainingSession
return MonitoredSession(
File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 63, in __init__
super(cls, self).__init__( # pylint: disable=bad-super-call
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 805, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1292, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 958, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 718, in create_session
return self._get_session_manager().prepare_session(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 964, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1188, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1393, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
[[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]
System information
- GPU model and memory: 3090
- OS Platform: ubuntu20.04
- Docker version:
- GCC/CUDA/cuDNN version: cu116
- Python/conda version: py38
- TensorFlow/PyTorch version: tf1.15
Yes
Metadata
Metadata
Assignees
Labels
No labels