Description
🐛 Describe the bug
I am trying to follow the example to perform inference with the OPT-30B model according to this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
However, as specified in the model-config.yaml file, a checkpoints.json
file is required. This file gets used here: https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L40
As a result, the model fails to load. The error logs are attached below.
Error logs
2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Failed to load model opt, exception Cannot copy out of meta tensor; no data!
2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - service = model_loader.load(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/home/model-server/tmp/models/c1130e4b01c345b9be913ef8414518cb/custom_handler.py", line 55, in initialize
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ds_engine = get_ds_engine(self.model, ctx)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/handler_utils/distributed/deepspeed.py", line 35, in get_ds_engine
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ds_engine = deepspeed.init_inference(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - engine = InferenceEngine(model, config=ds_inference_config)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 154, in __init__
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - self.module.to(device)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return super().to(*args, **kwargs)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return self._apply(convert)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - param_applied = fn(param)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - NotImplementedError: Cannot copy out of meta tensor; no data!
Installation instructions
Docker image URI: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2
EC2 instance: g5dn.24xlarge
Model Packaing
Created model artifact by following this example:
https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
config.properties
No response
Versions
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.8.1
torch-model-archiver==0.8.1
Python version: 3.10 (64-bit runtime)
Python executable: /opt/conda/bin/python3
Versions of relevant python libraries:
captum==0.6.0
numpy==1.22.4
nvgpu==0.10.0
psutil==5.9.5
requests==2.31.0
torch==2.0.1+cu118
torch-model-archiver==0.8.1
torchaudio==2.0.2+cu118
torchdata==0.6.1+cu118
torchserve==0.8.1
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
wheel==0.38.4
torch==2.0.1+cu118
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
torchaudio==2.0.2+cu118
Java Version:
OS: N/A
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.27.2
Is CUDA available: Yes
CUDA runtime version: 11.8.89
GPU models and configuration:
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0
Repro instructions
Please follow the instructions as mentioned here to reproduce this error: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
Possible Solution
No response