Skip to content

Failure in loading Deepspeed large model example #2569

Open
@sachanub

Description

@sachanub

🐛 Describe the bug

I am trying to follow the example to perform inference with the OPT-30B model according to this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

However, as specified in the model-config.yaml file, a checkpoints.json file is required. This file gets used here: https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L40

As a result, the model fails to load. The error logs are attached below.

Error logs

2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Failed to load model opt, exception Cannot copy out of meta tensor; no data!
2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     service = model_loader.load(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/c1130e4b01c345b9be913ef8414518cb/custom_handler.py", line 55, in initialize
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = get_ds_engine(self.model, ctx)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/handler_utils/distributed/deepspeed.py", line 35, in get_ds_engine
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = deepspeed.init_inference(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     engine = InferenceEngine(model, config=ds_inference_config)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 154, in __init__
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     self.module.to(device)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return super().to(*args, **kwargs)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return self._apply(convert)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     param_applied = fn(param)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - NotImplementedError: Cannot copy out of meta tensor; no data!

Installation instructions

Docker image URI: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2
EC2 instance: g5dn.24xlarge

Model Packaing

Created model artifact by following this example:
https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.8.1
torch-model-archiver==0.8.1

Python version: 3.10 (64-bit runtime)
Python executable: /opt/conda/bin/python3

Versions of relevant python libraries:
captum==0.6.0
numpy==1.22.4
nvgpu==0.10.0
psutil==5.9.5
requests==2.31.0
torch==2.0.1+cu118
torch-model-archiver==0.8.1
torchaudio==2.0.2+cu118
torchdata==0.6.1+cu118
torchserve==0.8.1
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
wheel==0.38.4
torch==2.0.1+cu118
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
torchaudio==2.0.2+cu118

Java Version:


OS: N/A
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.27.2

Is CUDA available: Yes
CUDA runtime version: 11.8.89
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0

Repro instructions

Please follow the instructions as mentioned here to reproduce this error: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

Possible Solution

No response

Metadata

Metadata

Assignees

Labels

examplequestionFurther information is requestedtriagedIssue has been reviewed and triaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions