Skip to content

deepspeed 容器环境下no-ssh多机多卡训练 #6685

Open
@Justin-12138

Description

Reminder

  • I have read the above rules and searched the existing issues.

System Info

llamafactory==0.9.1.dev0
transformers==4.46.1
deepspeed==0.15.4

Reproduction

我想询问下关于容器环境下deepspeed伪多机多卡训练的相关事宜,
1:我这边训练的时候:做ssh免密的时候最终的结果是保存在hostfile第一行的节点,例如
node3   slots=1
node4    slots=1
然后需要在第一个节点执行相关的deepspeed脚本

最终的结果保存如下,悬链过程中没有出现任何问题:
Image
我现在想在k8s环境中实现相同的操作,应该怎么实现呢?我看deepspeed的官方文档的意思似乎是需要在每个节点执行一下运行脚本

deepspeed --hostfile=myhostfile --no_ssh --node_rank=<n> \
    --master_addr=<addr> --master_port=<port> \
    <client_entry.py> <client args> \
    --deepspeed --deepspeed_config ds_config.json

Others

No response

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions