Skip to content

Deepspeed on Power with CPU Acclerator and AuToTP #7108

Open
@vinithakv

Description

@vinithakv

Hi,

I have been trying to use deepspeed on Power Linux platform in a virtual enviroment with CPU accelerator support.
When tried to enable the AuToTP, during the compilation of the deepspeed_shm_comm extension the following compilation error is observed
.

(ptenv26) $ deepspeed --num_accelerators=2 --bind_cores_to_rank --bind_core_list 0-40 driver --device=cpu --reps=1 --model=“~/granite-3b" --model_class=GPTBigCodeForCausalLM --input_size=32 --output_size=200 --batch_size=1
[2025-02-28 01:57:06,419] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-28 01:57:06,428] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2025-02-28 01:57:08,352] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-02-28 01:57:08,356] [INFO] [runner.py:607:main] cmd = /home/user/ptenv26/bin/python3.12 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --bind_cores_to_rank --bind_core_list=0-40 driver-ds-fp32-v3 --device=cpu --reps=1 --model=/home/user/models/granite-3b --model_class=GPTBigCodeForCausalLM --input_size=32 --output_size=200 --batch_size=1

[2025-02-28 01:57:09,616] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-28 01:57:09,626] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2025-02-28 01:57:11,535] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-02-28 01:57:11,535] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-02-28 01:57:11,535] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-02-28 01:57:11,535] [INFO] [launch.py:164:main] dist_world_size=2
[2025-02-28 01:57:11,535] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-02-28 01:57:11,552] [INFO] [launch.py:256:main] process 2180496 spawned with command: ['numactl', '-m', '0', '-C', '0-19', '/home/user/ptenv26/bin/python3.12', '-u', 'driver-ds-fp32-v3', '--local_rank=0', '--device=cpu', '--reps=1', '--model=/home/user/models/granite-3b', '--model_class=GPTBigCodeForCausalLM', '--input_size=32', '--output_size=200', '--batch_size=1']
[2025-02-28 01:57:11,568] [INFO] [launch.py:256:main] process 2180499 spawned with command: ['numactl', '-m', '0', '-C', '20-39', '/home/user/ptenv26/bin/python3.12', '-u', 'driver-ds-fp32-v3', '--local_rank=1', '--device=cpu', '--reps=1', '--model=/home/user/models/granite-3b', '--model_class=GPTBigCodeForCausalLM', '--input_size=32', '--output_size=200', '--batch_size=1']
[2025-02-28 01:57:13,076] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-28 01:57:13,086] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2025-02-28 01:57:13,087] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-02-28 01:57:13,097] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
Loading checkpoint shards: 100%|
…

[2025-02-28 01:58:06,762] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown
[2025-02-28 01:58:06,762] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown
[2025-02-28 01:58:06,763] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-02-28 01:58:06,763] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-02-28 01:58:06,766] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-02-28 01:58:06,766] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-02-28 01:58:06,766] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
Using /home/user/.cache/torch_extensions/py312_cpu as PyTorch extensions root...Using /home/user/.cache/torch_extensions/py312_cpu as PyTorch extensions root...

Creating extension directory /home/user/.cache/torch_extensions/py312_cpu/deepspeed_shm_comm...
Creating extension directory /home/user/.cache/torch_extensions/py312_cpu/deepspeed_shm_comm...
Emitting ninja build file /home/user/.cache/torch_extensions/py312_cpu/deepspeed_shm_comm/build.ninja...
Building extension module deepspeed_shm_comm...
…
c++ -MMD -MF shm.o.d -DTORCH_EXTENSION_NAME=deepspeed_shm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/home/user/ptenv26/lib64/python3.12/site-packages/deepspeed/ops/csrc/cpu/includes -isystem /home/user/ptenv26/lib64/python3.12/site-packages/torch/include -isystem /home/user/ptenv26/lib64/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O2 -fopenmp -c /home/user/ptenv26/lib64/python3.12/site-packages/deepspeed/ops/csrc/cpu/comm/shm.cpp -o shm.o 
/home/user/ptenv26/lib64/python3.12/site-packages/deepspeed/ops/csrc/cpu/comm/shm.cpp:10:10: fatal error: immintrin.h: No such file or directory
   10 | #include <immintrin.h>
      |          ^~~~~~~~~~~~~
compilation terminated.


The file shm.cpp has intel specific intrinsic code . I would like to extend this for Power CPU support.
Could anyone help with suggestions ?

Regards,
Vinitha Vijayan

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions