Does skypilot support "job groups" for launching nodes with different resources together? #8292

danijar · 2025-12-12T23:18:13Z

danijar
Dec 12, 2025

For example, a single training job with 64 H100s plus 256 CPU nodes for data preprocessing or stepping RL environments.

Michaelvll · 2025-12-12T23:24:49Z

Michaelvll
Dec 12, 2025
Maintainer

Hi @danijar, thanks for asking! We are designing the architecture to such workloads. One proposal we had was to have a single training job with H100 GPUs that hosts the trainer, and a SkyServe instance for the data preprocessing / stepping RL environments:

In this way, the trainer and the environment could scale / recover independentally. Would that be something that could fit in your case?

0 replies

danijar · 2025-12-12T23:29:30Z

danijar
Dec 12, 2025
Author

It might!

We mainly want to be able to specify something like a "job group" in a single YAML file and launch/stop it with a single command line. Each job in the group can have it's own number of nodes, resource requirements, and entry point command. And then we'd need a way to connect the jobs within a group (e.g. pass in networking addresses of one job as a flag into another job).

We don't need cross-cloud communication for this, although it's a nice extra.

4 replies

Michaelvll Dec 13, 2025
Maintainer

That is a great point! We don't have a job group concept at the moment, but it is definitely align with Roadmap. Filed an issue for this: #8296.

Before we get there, would also love to see if the above architecture could help your case. : ) Indeed, there are 3 commands to run:

sky serve up -n rl-env rl-env.sky.yaml
ENDPOINT=$(sky serve status --endpoint rl-env)
sky jobs launch -n rl-trainer rl-train.sky.yaml --env RL_ENV_ENDPOINT=$ENDPOINT

danijar Dec 13, 2025
Author

Our jobs are too complicated to manually start and stop the different services. There are learners, actors, envs, data loaders, replay buffers, etc.

Michaelvll Dec 13, 2025
Maintainer

Thanks @danijar! Curious, in your specific case, do you just need a cluster with heterogeneous cluster, and only have one entrypoint for the job that contain all those components (e.g. having something like VeRL to orchestrate the CPU and GPU compoonents); Or, you would like to have a separate entrypoint for each different resource type?

Just want to add a bit to the previous example. Here is an oversimplified example. In the GPU cluster, we use VeRL to deal with the in-cluster scheduling on the GPU machines, and SkyServe handles the CPU-based rollouts only. That said, the learners, actors, dataloaders, replay buffers, may all run in the GPU cluster started by sky jobs launch orchestrated by VeRL, and the env will be run with SkyServe on CPU nodes. In that way, you don't have to start each components separately, but only 1 job and 1 service.

Your insights would be really helpful for us to shape the system. : )

danijar Dec 15, 2025
Author

Separate entry points would be the more general solution. Either way, at least command line flags need to be separate.
We're just looking for more flexibility. E.g. using V100s for data processing, H100s for training, cpu nodes with high RAM for replay buffers, cpu nodes with high CPU for env workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does skypilot support "job groups" for launching nodes with different resources together? #8292

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does skypilot support "job groups" for launching nodes with different resources together? #8292

Uh oh!

danijar Dec 12, 2025

Replies: 2 comments · 4 replies

Uh oh!

Michaelvll Dec 12, 2025 Maintainer

Uh oh!

danijar Dec 12, 2025 Author

Uh oh!

Uh oh!

Michaelvll Dec 13, 2025 Maintainer

Uh oh!

danijar Dec 13, 2025 Author

Uh oh!

Michaelvll Dec 13, 2025 Maintainer

Uh oh!

danijar Dec 15, 2025 Author

danijar
Dec 12, 2025

Replies: 2 comments 4 replies

Michaelvll
Dec 12, 2025
Maintainer

danijar
Dec 12, 2025
Author

Michaelvll Dec 13, 2025
Maintainer

danijar Dec 13, 2025
Author

Michaelvll Dec 13, 2025
Maintainer

danijar Dec 15, 2025
Author