How to use data parallelism between nodes?

Suppose there are more than one compute nodes, and each node's GPU memory in total is enough to contain a model. What is the recommend way to configure DeepSpeed if I want to use zero-3 within each node and use data parallel between these nodes? Thanks!