Replies: 2 comments 4 replies
-
|
Hi @danijar, thanks for asking! We are designing the architecture to such workloads. One proposal we had was to have a single training job with H100 GPUs that hosts the trainer, and a SkyServe instance for the data preprocessing / stepping RL environments: In this way, the trainer and the environment could scale / recover independentally. Would that be something that could fit in your case? |
Beta Was this translation helpful? Give feedback.
-
|
It might! We mainly want to be able to specify something like a "job group" in a single YAML file and launch/stop it with a single command line. Each job in the group can have it's own number of nodes, resource requirements, and entry point command. And then we'd need a way to connect the jobs within a group (e.g. pass in networking addresses of one job as a flag into another job). We don't need cross-cloud communication for this, although it's a nice extra. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
For example, a single training job with 64 H100s plus 256 CPU nodes for data preprocessing or stepping RL environments.
Beta Was this translation helpful? Give feedback.
All reactions